- change ATVM status formatting to the approved Markdown-table template with SUMMARY:, HOSTS:, TIMING:, and NOTES: - document that normal status requests print locally only unless explicitly asked to send to Mattermost - document Mattermost defaults and posting rules, including only sending after full run completion - document the controller-side systemd watcher design for future automation - add the secrets migration/cleanup review doc - ignore .env.credentials.local in git and reflect the move toward using that local credentials file instead of hardcoded secrets
180 lines
5.8 KiB
Markdown
180 lines
5.8 KiB
Markdown
# ATVM Mattermost Watcher Design
|
|
|
|
## Purpose
|
|
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
|
|
|
|
This watcher must continue working even if the local operator machine is offline.
|
|
|
|
## Implementation Approach
|
|
Use a `systemd`-managed watcher on the ATVM Cypress controller.
|
|
|
|
Recommended structure:
|
|
- one watcher script that evaluates the state of a specific ATVM run
|
|
- one `systemd` service to execute the watcher
|
|
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
|
|
|
|
Preferred deployment target:
|
|
- controller host: `192.168.3.190`
|
|
- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
|
|
|
|
## Mattermost Destination
|
|
Use the local credential file in this workspace as the source of defaults:
|
|
- `/home/aw/code/cds/.env.credentials.local`
|
|
|
|
Expected variables:
|
|
- `MATTERMOST_ATVM_WEBHOOK`
|
|
- `MATTERMOST_ATVM_CHANNEL`
|
|
|
|
## Run Completion Rule
|
|
The watcher must send Mattermost results only after the ATVM run has fully completed.
|
|
|
|
A run is considered fully completed only when:
|
|
- there are no active runner processes for the run
|
|
- the expected machine scope has final result artifacts
|
|
- no machine remains in `RUNNING` or `NOT STARTED`
|
|
- final reporter artifacts confirm the run has ended
|
|
|
|
Evidence sources:
|
|
- live runner processes on `192.168.3.190`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/`
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/`
|
|
|
|
## Required Run States
|
|
The watcher must distinguish these run-level states:
|
|
- `COMPLETED`
|
|
- `FAILED`
|
|
- `CANCELLED`
|
|
- `TERMINATED`
|
|
- `HUNG`
|
|
- `UNKNOWN`
|
|
- `RUNNING`
|
|
|
|
Definitions:
|
|
- `COMPLETED`
|
|
- the run finished normally
|
|
- all machines have final results
|
|
- no run-level failure state blocks completion
|
|
- `FAILED`
|
|
- the run finished, but one or more hosts failed
|
|
- this is still a completed run
|
|
- `CANCELLED`
|
|
- the run was intentionally cancelled through an explicit cancellation path
|
|
- `TERMINATED`
|
|
- the run was manually killed or stopped before normal completion
|
|
- `HUNG`
|
|
- the run appears stuck and does not meet completion rules within the expected policy window
|
|
- `UNKNOWN`
|
|
- the watcher cannot safely determine the true state
|
|
- `RUNNING`
|
|
- the run is still active and not yet complete
|
|
|
|
## Mattermost Posting Rule
|
|
Post to Mattermost only when the run has fully completed.
|
|
|
|
Send Mattermost status for:
|
|
- `COMPLETED`
|
|
- `FAILED`
|
|
|
|
Do not send Mattermost status for:
|
|
- `CANCELLED`
|
|
- `TERMINATED`
|
|
- `HUNG`
|
|
- `UNKNOWN`
|
|
- `RUNNING`
|
|
|
|
Important clarification:
|
|
- a completed run with failed hosts should still be posted
|
|
- a cancelled, terminated, hung, or unknown run should not be posted
|
|
|
|
## Required Cancellation / Termination Handling
|
|
If a run is cancelled or terminated, the watcher must:
|
|
- detect that the run was cancelled or manually killed
|
|
- stop waiting for normal completion
|
|
- mark the run as closed without posting final Mattermost status
|
|
- prevent any later success/failure post for that same run
|
|
|
|
## State Tracking Requirements
|
|
The watcher must track each monitored run by run id or build name.
|
|
|
|
For each run, keep durable state such as:
|
|
- tracked run id / build name
|
|
- controller-side watcher state
|
|
- completion marker
|
|
- cancellation / termination marker
|
|
- Mattermost posted marker
|
|
- last observed machine summary
|
|
- timestamps for first seen, last seen, closed
|
|
|
|
## Duplicate-Post Prevention
|
|
The watcher must prevent duplicate Mattermost posts.
|
|
|
|
Required behavior:
|
|
- only one final post per run
|
|
- if a run is already marked as posted, do not send again
|
|
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
|
|
|
|
## Recommended State Files
|
|
Use a durable controller-local state directory, for example:
|
|
- `/var/lib/atvm-run-watcher/`
|
|
|
|
Possible contents:
|
|
- one state file per run id
|
|
- one posted marker per run id
|
|
- one cancellation marker per run id
|
|
- optional lock file to prevent multiple watcher instances from racing
|
|
|
|
## Recommended Operator Workflow
|
|
Normal completion workflow:
|
|
1. ATVM run starts.
|
|
2. Watcher tracks the run id / build name.
|
|
3. Watcher polls run state and artifacts.
|
|
4. Run fully completes.
|
|
5. Watcher builds final status summary.
|
|
6. Watcher posts final status to Mattermost once.
|
|
7. Watcher marks the run as posted and closed.
|
|
|
|
Cancellation / termination workflow:
|
|
1. Operator stops the ATVM run.
|
|
2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
|
|
3. Watcher marks the run `CANCELLED` or `TERMINATED`.
|
|
4. Watcher exits cleanly without posting to Mattermost.
|
|
5. Watcher prevents later duplicate or misleading final-post behavior.
|
|
|
|
## Failure Semantics
|
|
Host-level failures do not suppress Mattermost posting.
|
|
|
|
If:
|
|
- the run has fully completed
|
|
- and one or more hosts failed
|
|
|
|
Then:
|
|
- final Mattermost status should still be sent
|
|
- final run-level state should be treated as completed-with-failures
|
|
|
|
## Hang / Unknown Semantics
|
|
If the run cannot be safely classified as completed, failed, cancelled, or terminated:
|
|
- classify it as `HUNG` or `UNKNOWN`
|
|
- do not post to Mattermost
|
|
- require operator review
|
|
|
|
## Logging Requirements
|
|
The watcher should log:
|
|
- the run id / build name being monitored
|
|
- each state transition
|
|
- posting decisions
|
|
- reasons for suppressing a Mattermost post
|
|
- duplicate-post prevention decisions
|
|
- final closed state
|
|
|
|
## Summary
|
|
This watcher design must satisfy all of the following:
|
|
- run on the ATVM Cypress controller
|
|
- survive local operator machine downtime
|
|
- use `systemd`
|
|
- distinguish run states clearly
|
|
- send Mattermost only after full completion
|
|
- send completion results whether hosts passed or failed
|
|
- never send Mattermost for cancelled, terminated, hung, or unknown runs
|
|
- prevent duplicate or misleading posts
|