Update ATVM watcher for categorized sub-run posting

- update the watcher design and automation guide to treat --categorize as sequential ATVM sub-runs rather than one parent run with internal phases
- document that categorized runs should send one Mattermost status per completed grouped sub-run instead of one parent-only final post
- add a --categorize option to the watcher start helper so categorized mode is explicit in watcher startup
- update the watcher implementation to track categorized sub-runs separately, write per-subrun state, and post each completed grouped run once
This commit is contained in:
2026-03-26 11:00:39 -04:00
parent 68cd428733
commit d60b8b9b18
6 changed files with 399 additions and 89 deletions

View File

@@ -1,7 +1,7 @@
# ATVM Mattermost Watcher Design
## Purpose
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts final run status to Mattermost only after the watched scope has fully completed.
This watcher must continue working even if the local operator machine is offline.
@@ -9,9 +9,10 @@ This watcher must continue working even if the local operator machine is offline
Use a `systemd`-managed watcher on the ATVM Cypress controller.
Recommended structure:
- one watcher script that evaluates the state of a specific ATVM run
- one watcher script that evaluates a specific ATVM run request
- one `systemd` service to execute the watcher
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
- no always-on daemon
- for categorized ATVM runs, one watcher instance tracks the parent request and posts each categorized sub-run separately as those grouped runs complete
Preferred deployment target:
- controller host: `192.168.3.190`
@@ -26,14 +27,23 @@ Expected variables:
- `MATTERMOST_ATVM_CHANNEL`
## Run Completion Rule
The watcher must send Mattermost results only after the ATVM run has fully completed.
The watcher must send Mattermost results only after the watched scope has fully completed.
A run is considered fully completed only when:
A non-categorized run is considered fully completed only when:
- there are no active runner processes for the run
- the expected machine scope has final result artifacts
- no machine remains in `RUNNING` or `NOT STARTED`
- final reporter artifacts confirm the run has ended
A categorized run must be treated differently:
- `--categorize` splits the request into sequential ATVM sub-runs
- each categorized group is its own run/job
- the watcher must detect each grouped sub-run in order
- the watcher must wait for that grouped sub-run to complete
- then send that grouped sub-run's final Mattermost status
- then continue watching for the next grouped sub-run
- the watcher must not wait until the very end to send one single parent-only post
Evidence sources:
- live runner processes on `192.168.3.190`
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
@@ -70,7 +80,7 @@ Definitions:
- the run is still active and not yet complete
## Mattermost Posting Rule
Post to Mattermost only when the run has fully completed.
Post to Mattermost only when the watched scope has fully completed.
Send Mattermost status for:
- `COMPLETED`
@@ -86,6 +96,9 @@ Do not send Mattermost status for:
Important clarification:
- a completed run with failed hosts should still be posted
- a cancelled, terminated, hung, or unknown run should not be posted
- for categorized execution, this rule applies per categorized sub-run
- one categorized group completion should produce one Mattermost post
- do not send one parent-level aggregate post in place of the per-group posts
## Required Cancellation / Termination Handling
If a run is cancelled or terminated, the watcher must:
@@ -106,33 +119,47 @@ For each run, keep durable state such as:
- last observed machine summary
- timestamps for first seen, last seen, closed
For categorized runs, keep durable state for:
- the parent request build name
- each detected categorized sub-run
- whether each categorized sub-run has already been posted
## Duplicate-Post Prevention
The watcher must prevent duplicate Mattermost posts.
Required behavior:
- only one final post per run
- if a run is already marked as posted, do not send again
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
- for non-categorized execution, only one final post per run
- for categorized execution, only one final post per categorized sub-run
- if a watched scope is already marked as posted, do not send again
- if a run or categorized sub-run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
## Recommended State Files
Use a durable controller-local state directory, for example:
- `/var/lib/atvm-run-watcher/`
Possible contents:
- one state file per run id
- one posted marker per run id
- one cancellation marker per run id
- one parent state file per requested build name
- one posted marker per non-categorized run
- one subdirectory per categorized sub-run with its own state and posted marker
- one cancellation marker per parent run id
- optional lock file to prevent multiple watcher instances from racing
## Recommended Operator Workflow
Normal completion workflow:
1. ATVM run starts.
2. Watcher tracks the run id / build name.
2. Watcher tracks the requested build name.
3. Watcher polls run state and artifacts.
4. Run fully completes.
5. Watcher builds final status summary.
6. Watcher posts final status to Mattermost once.
7. Watcher marks the run as posted and closed.
4. For non-categorized execution:
- wait for the run to fully complete
- build one final status summary
- post one final Mattermost status
5. For categorized execution:
- detect each grouped sub-run in order
- wait for that grouped sub-run to fully complete
- build that grouped sub-run's final status summary
- post that grouped sub-run's final Mattermost status
- continue to the next grouped sub-run
6. Watcher marks the completed watched scope as posted and closed.
Cancellation / termination workflow:
1. Operator stops the ATVM run.
@@ -173,7 +200,9 @@ This watcher design must satisfy all of the following:
- survive local operator machine downtime
- use `systemd`
- distinguish run states clearly
- send Mattermost only after full completion
- send Mattermost only after full completion of the watched scope
- send completion results whether hosts passed or failed
- never send Mattermost for cancelled, terminated, hung, or unknown runs
- prevent duplicate or misleading posts
- treat `--categorize` as sequential ATVM sub-runs, not as one parent run with internal phases
- send one Mattermost post per completed categorized sub-run