Update ATVM status reporting and credential handling docs
- change ATVM status formatting to the approved Markdown-table template with SUMMARY:, HOSTS:, TIMING:, and NOTES: - document that normal status requests print locally only unless explicitly asked to send to Mattermost - document Mattermost defaults and posting rules, including only sending after full run completion - document the controller-side systemd watcher design for future automation - add the secrets migration/cleanup review doc - ignore .env.credentials.local in git and reflect the move toward using that local credentials file instead of hardcoded secrets
This commit is contained in:
179
atvm/docs/automation/mattermost-watcher-design.md
Normal file
179
atvm/docs/automation/mattermost-watcher-design.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# ATVM Mattermost Watcher Design
|
||||
|
||||
## Purpose
|
||||
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
|
||||
|
||||
This watcher must continue working even if the local operator machine is offline.
|
||||
|
||||
## Implementation Approach
|
||||
Use a `systemd`-managed watcher on the ATVM Cypress controller.
|
||||
|
||||
Recommended structure:
|
||||
- one watcher script that evaluates the state of a specific ATVM run
|
||||
- one `systemd` service to execute the watcher
|
||||
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
|
||||
|
||||
Preferred deployment target:
|
||||
- controller host: `192.168.3.190`
|
||||
- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
|
||||
|
||||
## Mattermost Destination
|
||||
Use the local credential file in this workspace as the source of defaults:
|
||||
- `/home/aw/code/cds/.env.credentials.local`
|
||||
|
||||
Expected variables:
|
||||
- `MATTERMOST_ATVM_WEBHOOK`
|
||||
- `MATTERMOST_ATVM_CHANNEL`
|
||||
|
||||
## Run Completion Rule
|
||||
The watcher must send Mattermost results only after the ATVM run has fully completed.
|
||||
|
||||
A run is considered fully completed only when:
|
||||
- there are no active runner processes for the run
|
||||
- the expected machine scope has final result artifacts
|
||||
- no machine remains in `RUNNING` or `NOT STARTED`
|
||||
- final reporter artifacts confirm the run has ended
|
||||
|
||||
Evidence sources:
|
||||
- live runner processes on `192.168.3.190`
|
||||
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
|
||||
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/`
|
||||
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/`
|
||||
|
||||
## Required Run States
|
||||
The watcher must distinguish these run-level states:
|
||||
- `COMPLETED`
|
||||
- `FAILED`
|
||||
- `CANCELLED`
|
||||
- `TERMINATED`
|
||||
- `HUNG`
|
||||
- `UNKNOWN`
|
||||
- `RUNNING`
|
||||
|
||||
Definitions:
|
||||
- `COMPLETED`
|
||||
- the run finished normally
|
||||
- all machines have final results
|
||||
- no run-level failure state blocks completion
|
||||
- `FAILED`
|
||||
- the run finished, but one or more hosts failed
|
||||
- this is still a completed run
|
||||
- `CANCELLED`
|
||||
- the run was intentionally cancelled through an explicit cancellation path
|
||||
- `TERMINATED`
|
||||
- the run was manually killed or stopped before normal completion
|
||||
- `HUNG`
|
||||
- the run appears stuck and does not meet completion rules within the expected policy window
|
||||
- `UNKNOWN`
|
||||
- the watcher cannot safely determine the true state
|
||||
- `RUNNING`
|
||||
- the run is still active and not yet complete
|
||||
|
||||
## Mattermost Posting Rule
|
||||
Post to Mattermost only when the run has fully completed.
|
||||
|
||||
Send Mattermost status for:
|
||||
- `COMPLETED`
|
||||
- `FAILED`
|
||||
|
||||
Do not send Mattermost status for:
|
||||
- `CANCELLED`
|
||||
- `TERMINATED`
|
||||
- `HUNG`
|
||||
- `UNKNOWN`
|
||||
- `RUNNING`
|
||||
|
||||
Important clarification:
|
||||
- a completed run with failed hosts should still be posted
|
||||
- a cancelled, terminated, hung, or unknown run should not be posted
|
||||
|
||||
## Required Cancellation / Termination Handling
|
||||
If a run is cancelled or terminated, the watcher must:
|
||||
- detect that the run was cancelled or manually killed
|
||||
- stop waiting for normal completion
|
||||
- mark the run as closed without posting final Mattermost status
|
||||
- prevent any later success/failure post for that same run
|
||||
|
||||
## State Tracking Requirements
|
||||
The watcher must track each monitored run by run id or build name.
|
||||
|
||||
For each run, keep durable state such as:
|
||||
- tracked run id / build name
|
||||
- controller-side watcher state
|
||||
- completion marker
|
||||
- cancellation / termination marker
|
||||
- Mattermost posted marker
|
||||
- last observed machine summary
|
||||
- timestamps for first seen, last seen, closed
|
||||
|
||||
## Duplicate-Post Prevention
|
||||
The watcher must prevent duplicate Mattermost posts.
|
||||
|
||||
Required behavior:
|
||||
- only one final post per run
|
||||
- if a run is already marked as posted, do not send again
|
||||
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
|
||||
|
||||
## Recommended State Files
|
||||
Use a durable controller-local state directory, for example:
|
||||
- `/var/lib/atvm-run-watcher/`
|
||||
|
||||
Possible contents:
|
||||
- one state file per run id
|
||||
- one posted marker per run id
|
||||
- one cancellation marker per run id
|
||||
- optional lock file to prevent multiple watcher instances from racing
|
||||
|
||||
## Recommended Operator Workflow
|
||||
Normal completion workflow:
|
||||
1. ATVM run starts.
|
||||
2. Watcher tracks the run id / build name.
|
||||
3. Watcher polls run state and artifacts.
|
||||
4. Run fully completes.
|
||||
5. Watcher builds final status summary.
|
||||
6. Watcher posts final status to Mattermost once.
|
||||
7. Watcher marks the run as posted and closed.
|
||||
|
||||
Cancellation / termination workflow:
|
||||
1. Operator stops the ATVM run.
|
||||
2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
|
||||
3. Watcher marks the run `CANCELLED` or `TERMINATED`.
|
||||
4. Watcher exits cleanly without posting to Mattermost.
|
||||
5. Watcher prevents later duplicate or misleading final-post behavior.
|
||||
|
||||
## Failure Semantics
|
||||
Host-level failures do not suppress Mattermost posting.
|
||||
|
||||
If:
|
||||
- the run has fully completed
|
||||
- and one or more hosts failed
|
||||
|
||||
Then:
|
||||
- final Mattermost status should still be sent
|
||||
- final run-level state should be treated as completed-with-failures
|
||||
|
||||
## Hang / Unknown Semantics
|
||||
If the run cannot be safely classified as completed, failed, cancelled, or terminated:
|
||||
- classify it as `HUNG` or `UNKNOWN`
|
||||
- do not post to Mattermost
|
||||
- require operator review
|
||||
|
||||
## Logging Requirements
|
||||
The watcher should log:
|
||||
- the run id / build name being monitored
|
||||
- each state transition
|
||||
- posting decisions
|
||||
- reasons for suppressing a Mattermost post
|
||||
- duplicate-post prevention decisions
|
||||
- final closed state
|
||||
|
||||
## Summary
|
||||
This watcher design must satisfy all of the following:
|
||||
- run on the ATVM Cypress controller
|
||||
- survive local operator machine downtime
|
||||
- use `systemd`
|
||||
- distinguish run states clearly
|
||||
- send Mattermost only after full completion
|
||||
- send completion results whether hosts passed or failed
|
||||
- never send Mattermost for cancelled, terminated, hung, or unknown runs
|
||||
- prevent duplicate or misleading posts
|
||||
Reference in New Issue
Block a user