- update the watcher design and automation guide to treat --categorize as sequential ATVM sub-runs rather than one parent run with internal phases - document that categorized runs should send one Mattermost status per completed grouped sub-run instead of one parent-only final post - add a --categorize option to the watcher start helper so categorized mode is explicit in watcher startup - update the watcher implementation to track categorized sub-runs separately, write per-subrun state, and post each completed grouped run once
209 lines
7.5 KiB
Markdown
209 lines
7.5 KiB
Markdown
# ATVM Mattermost Watcher Design
|
|
|
|
## Purpose
|
|
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts final run status to Mattermost only after the watched scope has fully completed.
|
|
|
|
This watcher must continue working even if the local operator machine is offline.
|
|
|
|
## Implementation Approach
|
|
Use a `systemd`-managed watcher on the ATVM Cypress controller.
|
|
|
|
Recommended structure:
|
|
- one watcher script that evaluates a specific ATVM run request
|
|
- one `systemd` service to execute the watcher
|
|
- no always-on daemon
|
|
- for categorized ATVM runs, one watcher instance tracks the parent request and posts each categorized sub-run separately as those grouped runs complete
|
|
|
|
Preferred deployment target:
|
|
- controller host: `192.168.3.190`
|
|
- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
|
|
|
|
## Mattermost Destination
|
|
Use the local credential file in this workspace as the source of defaults:
|
|
- `/home/aw/code/cds/.env.credentials.local`
|
|
|
|
Expected variables:
|
|
- `MATTERMOST_ATVM_WEBHOOK`
|
|
- `MATTERMOST_ATVM_CHANNEL`
|
|
|
|
## Run Completion Rule
|
|
The watcher must send Mattermost results only after the watched scope has fully completed.
|
|
|
|
A non-categorized run is considered fully completed only when:
|
|
- there are no active runner processes for the run
|
|
- the expected machine scope has final result artifacts
|
|
- no machine remains in `RUNNING` or `NOT STARTED`
|
|
- final reporter artifacts confirm the run has ended
|
|
|
|
A categorized run must be treated differently:
|
|
- `--categorize` splits the request into sequential ATVM sub-runs
|
|
- each categorized group is its own run/job
|
|
- the watcher must detect each grouped sub-run in order
|
|
- the watcher must wait for that grouped sub-run to complete
|
|
- then send that grouped sub-run's final Mattermost status
|
|
- then continue watching for the next grouped sub-run
|
|
- the watcher must not wait until the very end to send one single parent-only post
|
|
|
|
Evidence sources:
|
|
- live runner processes on `192.168.3.190`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/`
|
|
|
|
## Required Run States
|
|
The watcher must distinguish these run-level states:
|
|
- `COMPLETED`
|
|
- `FAILED`
|
|
- `CANCELLED`
|
|
- `TERMINATED`
|
|
- `HUNG`
|
|
- `UNKNOWN`
|
|
- `RUNNING`
|
|
|
|
Definitions:
|
|
- `COMPLETED`
|
|
- the run finished normally
|
|
- all machines have final results
|
|
- no run-level failure state blocks completion
|
|
- `FAILED`
|
|
- the run finished, but one or more hosts failed
|
|
- this is still a completed run
|
|
- `CANCELLED`
|
|
- the run was intentionally cancelled through an explicit cancellation path
|
|
- `TERMINATED`
|
|
- the run was manually killed or stopped before normal completion
|
|
- `HUNG`
|
|
- the run appears stuck and does not meet completion rules within the expected policy window
|
|
- `UNKNOWN`
|
|
- the watcher cannot safely determine the true state
|
|
- `RUNNING`
|
|
- the run is still active and not yet complete
|
|
|
|
## Mattermost Posting Rule
|
|
Post to Mattermost only when the watched scope has fully completed.
|
|
|
|
Send Mattermost status for:
|
|
- `COMPLETED`
|
|
- `FAILED`
|
|
|
|
Do not send Mattermost status for:
|
|
- `CANCELLED`
|
|
- `TERMINATED`
|
|
- `HUNG`
|
|
- `UNKNOWN`
|
|
- `RUNNING`
|
|
|
|
Important clarification:
|
|
- a completed run with failed hosts should still be posted
|
|
- a cancelled, terminated, hung, or unknown run should not be posted
|
|
- for categorized execution, this rule applies per categorized sub-run
|
|
- one categorized group completion should produce one Mattermost post
|
|
- do not send one parent-level aggregate post in place of the per-group posts
|
|
|
|
## Required Cancellation / Termination Handling
|
|
If a run is cancelled or terminated, the watcher must:
|
|
- detect that the run was cancelled or manually killed
|
|
- stop waiting for normal completion
|
|
- mark the run as closed without posting final Mattermost status
|
|
- prevent any later success/failure post for that same run
|
|
|
|
## State Tracking Requirements
|
|
The watcher must track each monitored run by run id or build name.
|
|
|
|
For each run, keep durable state such as:
|
|
- tracked run id / build name
|
|
- controller-side watcher state
|
|
- completion marker
|
|
- cancellation / termination marker
|
|
- Mattermost posted marker
|
|
- last observed machine summary
|
|
- timestamps for first seen, last seen, closed
|
|
|
|
For categorized runs, keep durable state for:
|
|
- the parent request build name
|
|
- each detected categorized sub-run
|
|
- whether each categorized sub-run has already been posted
|
|
|
|
## Duplicate-Post Prevention
|
|
The watcher must prevent duplicate Mattermost posts.
|
|
|
|
Required behavior:
|
|
- for non-categorized execution, only one final post per run
|
|
- for categorized execution, only one final post per categorized sub-run
|
|
- if a watched scope is already marked as posted, do not send again
|
|
- if a run or categorized sub-run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
|
|
|
|
## Recommended State Files
|
|
Use a durable controller-local state directory, for example:
|
|
- `/var/lib/atvm-run-watcher/`
|
|
|
|
Possible contents:
|
|
- one parent state file per requested build name
|
|
- one posted marker per non-categorized run
|
|
- one subdirectory per categorized sub-run with its own state and posted marker
|
|
- one cancellation marker per parent run id
|
|
- optional lock file to prevent multiple watcher instances from racing
|
|
|
|
## Recommended Operator Workflow
|
|
Normal completion workflow:
|
|
1. ATVM run starts.
|
|
2. Watcher tracks the requested build name.
|
|
3. Watcher polls run state and artifacts.
|
|
4. For non-categorized execution:
|
|
- wait for the run to fully complete
|
|
- build one final status summary
|
|
- post one final Mattermost status
|
|
5. For categorized execution:
|
|
- detect each grouped sub-run in order
|
|
- wait for that grouped sub-run to fully complete
|
|
- build that grouped sub-run's final status summary
|
|
- post that grouped sub-run's final Mattermost status
|
|
- continue to the next grouped sub-run
|
|
6. Watcher marks the completed watched scope as posted and closed.
|
|
|
|
Cancellation / termination workflow:
|
|
1. Operator stops the ATVM run.
|
|
2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
|
|
3. Watcher marks the run `CANCELLED` or `TERMINATED`.
|
|
4. Watcher exits cleanly without posting to Mattermost.
|
|
5. Watcher prevents later duplicate or misleading final-post behavior.
|
|
|
|
## Failure Semantics
|
|
Host-level failures do not suppress Mattermost posting.
|
|
|
|
If:
|
|
- the run has fully completed
|
|
- and one or more hosts failed
|
|
|
|
Then:
|
|
- final Mattermost status should still be sent
|
|
- final run-level state should be treated as completed-with-failures
|
|
|
|
## Hang / Unknown Semantics
|
|
If the run cannot be safely classified as completed, failed, cancelled, or terminated:
|
|
- classify it as `HUNG` or `UNKNOWN`
|
|
- do not post to Mattermost
|
|
- require operator review
|
|
|
|
## Logging Requirements
|
|
The watcher should log:
|
|
- the run id / build name being monitored
|
|
- each state transition
|
|
- posting decisions
|
|
- reasons for suppressing a Mattermost post
|
|
- duplicate-post prevention decisions
|
|
- final closed state
|
|
|
|
## Summary
|
|
This watcher design must satisfy all of the following:
|
|
- run on the ATVM Cypress controller
|
|
- survive local operator machine downtime
|
|
- use `systemd`
|
|
- distinguish run states clearly
|
|
- send Mattermost only after full completion of the watched scope
|
|
- send completion results whether hosts passed or failed
|
|
- never send Mattermost for cancelled, terminated, hung, or unknown runs
|
|
- prevent duplicate or misleading posts
|
|
- treat `--categorize` as sequential ATVM sub-runs, not as one parent run with internal phases
|
|
- send one Mattermost post per completed categorized sub-run
|