Files
cds-ai/atvm/docs/automation/mattermost-watcher-design.md
anthony.wen fa97ce5ad0 Update ATVM status reporting and credential handling docs
- change ATVM status formatting to the approved Markdown-table template with SUMMARY:, HOSTS:, TIMING:, and NOTES:
- document that normal status requests print locally only unless explicitly asked to send to Mattermost
- document Mattermost defaults and posting rules, including only sending after full run completion
- document the controller-side systemd watcher design for future automation
- add the secrets migration/cleanup review doc
- ignore .env.credentials.local in git and reflect the move toward using that local credentials file instead of hardcoded secrets
2026-03-24 14:27:00 -04:00

5.8 KiB

ATVM Mattermost Watcher Design

Purpose

Design a controller-local watcher on the ATVM Cypress machine (192.168.3.190) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.

This watcher must continue working even if the local operator machine is offline.

Implementation Approach

Use a systemd-managed watcher on the ATVM Cypress controller.

Recommended structure:

  • one watcher script that evaluates the state of a specific ATVM run
  • one systemd service to execute the watcher
  • optionally one systemd timer for periodic polling if the watcher is not implemented as a long-running process

Preferred deployment target:

  • controller host: 192.168.3.190
  • ATVM automation root: /root/cdc-e2e-cyp-12.17.4

Mattermost Destination

Use the local credential file in this workspace as the source of defaults:

  • /home/aw/code/cds/.env.credentials.local

Expected variables:

  • MATTERMOST_ATVM_WEBHOOK
  • MATTERMOST_ATVM_CHANNEL

Run Completion Rule

The watcher must send Mattermost results only after the ATVM run has fully completed.

A run is considered fully completed only when:

  • there are no active runner processes for the run
  • the expected machine scope has final result artifacts
  • no machine remains in RUNNING or NOT STARTED
  • final reporter artifacts confirm the run has ended

Evidence sources:

  • live runner processes on 192.168.3.190
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/

Required Run States

The watcher must distinguish these run-level states:

  • COMPLETED
  • FAILED
  • CANCELLED
  • TERMINATED
  • HUNG
  • UNKNOWN
  • RUNNING

Definitions:

  • COMPLETED
    • the run finished normally
    • all machines have final results
    • no run-level failure state blocks completion
  • FAILED
    • the run finished, but one or more hosts failed
    • this is still a completed run
  • CANCELLED
    • the run was intentionally cancelled through an explicit cancellation path
  • TERMINATED
    • the run was manually killed or stopped before normal completion
  • HUNG
    • the run appears stuck and does not meet completion rules within the expected policy window
  • UNKNOWN
    • the watcher cannot safely determine the true state
  • RUNNING
    • the run is still active and not yet complete

Mattermost Posting Rule

Post to Mattermost only when the run has fully completed.

Send Mattermost status for:

  • COMPLETED
  • FAILED

Do not send Mattermost status for:

  • CANCELLED
  • TERMINATED
  • HUNG
  • UNKNOWN
  • RUNNING

Important clarification:

  • a completed run with failed hosts should still be posted
  • a cancelled, terminated, hung, or unknown run should not be posted

Required Cancellation / Termination Handling

If a run is cancelled or terminated, the watcher must:

  • detect that the run was cancelled or manually killed
  • stop waiting for normal completion
  • mark the run as closed without posting final Mattermost status
  • prevent any later success/failure post for that same run

State Tracking Requirements

The watcher must track each monitored run by run id or build name.

For each run, keep durable state such as:

  • tracked run id / build name
  • controller-side watcher state
  • completion marker
  • cancellation / termination marker
  • Mattermost posted marker
  • last observed machine summary
  • timestamps for first seen, last seen, closed

Duplicate-Post Prevention

The watcher must prevent duplicate Mattermost posts.

Required behavior:

  • only one final post per run
  • if a run is already marked as posted, do not send again
  • if a run is marked CANCELLED, TERMINATED, HUNG, or UNKNOWN, do not later convert it into a posted completion unless explicitly reset by an operator workflow

Use a durable controller-local state directory, for example:

  • /var/lib/atvm-run-watcher/

Possible contents:

  • one state file per run id
  • one posted marker per run id
  • one cancellation marker per run id
  • optional lock file to prevent multiple watcher instances from racing

Normal completion workflow:

  1. ATVM run starts.
  2. Watcher tracks the run id / build name.
  3. Watcher polls run state and artifacts.
  4. Run fully completes.
  5. Watcher builds final status summary.
  6. Watcher posts final status to Mattermost once.
  7. Watcher marks the run as posted and closed.

Cancellation / termination workflow:

  1. Operator stops the ATVM run.
  2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
  3. Watcher marks the run CANCELLED or TERMINATED.
  4. Watcher exits cleanly without posting to Mattermost.
  5. Watcher prevents later duplicate or misleading final-post behavior.

Failure Semantics

Host-level failures do not suppress Mattermost posting.

If:

  • the run has fully completed
  • and one or more hosts failed

Then:

  • final Mattermost status should still be sent
  • final run-level state should be treated as completed-with-failures

Hang / Unknown Semantics

If the run cannot be safely classified as completed, failed, cancelled, or terminated:

  • classify it as HUNG or UNKNOWN
  • do not post to Mattermost
  • require operator review

Logging Requirements

The watcher should log:

  • the run id / build name being monitored
  • each state transition
  • posting decisions
  • reasons for suppressing a Mattermost post
  • duplicate-post prevention decisions
  • final closed state

Summary

This watcher design must satisfy all of the following:

  • run on the ATVM Cypress controller
  • survive local operator machine downtime
  • use systemd
  • distinguish run states clearly
  • send Mattermost only after full completion
  • send completion results whether hosts passed or failed
  • never send Mattermost for cancelled, terminated, hung, or unknown runs
  • prevent duplicate or misleading posts