Files
cds-ai/atvm/docs/automation/mattermost-watcher-design.md
anthony.wen f5eb21cccd Infer categorized watcher group names from actual host execution
- update the watcher to stop trusting misleading categorized child build labels when they do not match the host/spec actually being executed
- infer the reported categorized group name from the actual host being run, so mismatched labels like ubuntu-batch for a Red Hat host are corrected in status reporting
- document the categorized watcher workaround in the ATVM guide, watcher design, and watcher README without changing the underlying ATVM runner scripts
2026-03-26 14:20:22 -04:00

8.2 KiB

ATVM Mattermost Watcher Design

Purpose

Design a controller-local watcher on the ATVM Cypress machine (192.168.3.190) that monitors an ATVM automation run and posts final run status to Mattermost only after the watched scope has fully completed.

This watcher must continue working even if the local operator machine is offline.

Implementation Approach

Use a systemd-managed watcher on the ATVM Cypress controller.

Recommended structure:

  • one watcher script that evaluates a specific ATVM run request
  • one systemd service to execute the watcher
  • no always-on daemon
  • for categorized ATVM runs, one watcher instance tracks the parent request and posts each categorized sub-run separately as those grouped runs complete

Preferred deployment target:

  • controller host: 192.168.3.190
  • ATVM automation root: /root/cdc-e2e-cyp-12.17.4

Mattermost Destination

Use the local credential file in this workspace as the source of defaults:

  • /home/aw/code/cds/.env.credentials.local

Expected variables:

  • MATTERMOST_ATVM_WEBHOOK
  • MATTERMOST_ATVM_CHANNEL

Run Completion Rule

The watcher must send Mattermost results only after the watched scope has fully completed.

A non-categorized run is considered fully completed only when:

  • there are no active runner processes for the run
  • the expected machine scope has final result artifacts
  • no machine remains in RUNNING or NOT STARTED
  • final reporter artifacts confirm the run has ended

A categorized run must be treated differently:

  • --categorize splits the request into sequential ATVM sub-runs
  • each categorized group is its own run/job
  • the watcher must detect each grouped sub-run in order
  • the watcher must wait for that grouped sub-run to complete
  • then send that grouped sub-run's final Mattermost status
  • then continue watching for the next grouped sub-run
  • the watcher must remain alive while the parent categorized request or related child Cypress process is still active
  • one completed grouped sub-run must not be treated as proof that the parent categorized request is finished
  • if the child build id label does not match the actual host/spec being executed, the watcher must infer the real group from host execution and use that inferred group for reporting
  • the watcher must not wait until the very end to send one single parent-only post

Evidence sources:

  • live runner processes on 192.168.3.190
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/
  • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/

Required Run States

The watcher must distinguish these run-level states:

  • COMPLETED
  • FAILED
  • CANCELLED
  • TERMINATED
  • HUNG
  • UNKNOWN
  • RUNNING

Definitions:

  • COMPLETED
    • the run finished normally
    • all machines have final results
    • no run-level failure state blocks completion
  • FAILED
    • the run finished, but one or more hosts failed
    • this is still a completed run
  • CANCELLED
    • the run was intentionally cancelled through an explicit cancellation path
  • TERMINATED
    • the run was manually killed or stopped before normal completion
  • HUNG
    • the run appears stuck and does not meet completion rules within the expected policy window
  • UNKNOWN
    • the watcher cannot safely determine the true state
  • RUNNING
    • the run is still active and not yet complete

Mattermost Posting Rule

Post to Mattermost only when the watched scope has fully completed.

Send Mattermost status for:

  • COMPLETED
  • FAILED

Do not send Mattermost status for:

  • CANCELLED
  • TERMINATED
  • HUNG
  • UNKNOWN
  • RUNNING

Important clarification:

  • a completed run with failed hosts should still be posted
  • a cancelled, terminated, hung, or unknown run should not be posted
  • for categorized execution, this rule applies per categorized sub-run
  • one categorized group completion should produce one Mattermost post
  • do not send one parent-level aggregate post in place of the per-group posts

Required Cancellation / Termination Handling

If a run is cancelled or terminated, the watcher must:

  • detect that the run was cancelled or manually killed
  • stop waiting for normal completion
  • mark the run as closed without posting final Mattermost status
  • prevent any later success/failure post for that same run

State Tracking Requirements

The watcher must track each monitored run by run id or build name.

For each run, keep durable state such as:

  • tracked run id / build name
  • controller-side watcher state
  • completion marker
  • cancellation / termination marker
  • Mattermost posted marker
  • last observed machine summary
  • timestamps for first seen, last seen, closed

For categorized runs, keep durable state for:

  • the parent request build name
  • each detected categorized sub-run
  • whether each categorized sub-run has already been posted

Duplicate-Post Prevention

The watcher must prevent duplicate Mattermost posts.

Required behavior:

  • for non-categorized execution, only one final post per run
  • for categorized execution, only one final post per categorized sub-run
  • if a watched scope is already marked as posted, do not send again
  • if a run or categorized sub-run is marked CANCELLED, TERMINATED, HUNG, or UNKNOWN, do not later convert it into a posted completion unless explicitly reset by an operator workflow

Use a durable controller-local state directory, for example:

  • /var/lib/atvm-run-watcher/

Possible contents:

  • one parent state file per requested build name
  • one posted marker per non-categorized run
  • one subdirectory per categorized sub-run with its own state and posted marker
  • one cancellation marker per parent run id
  • optional lock file to prevent multiple watcher instances from racing

When the same requested parent build name is reused for a new run:

  • the watcher start workflow must clear old watcher state for that requested build name before starting
  • stale cancelled.marker, posted.marker, state.json, and subruns/ contents must not be allowed to affect the new run

Normal completion workflow:

  1. ATVM run starts.
  2. Watcher tracks the requested build name.
  3. Watcher polls run state and artifacts.
  4. For non-categorized execution:
    • wait for the run to fully complete
    • build one final status summary
    • post one final Mattermost status
  5. For categorized execution:
    • detect each grouped sub-run in order
    • wait for that grouped sub-run to fully complete
    • build that grouped sub-run's final status summary
    • post that grouped sub-run's final Mattermost status
    • continue to the next grouped sub-run
  6. Watcher marks the completed watched scope as posted and closed.

Cancellation / termination workflow:

  1. Operator stops the ATVM run.
  2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
  3. Watcher marks the run CANCELLED or TERMINATED.
  4. Watcher exits cleanly without posting to Mattermost.
  5. Watcher prevents later duplicate or misleading final-post behavior.

Failure Semantics

Host-level failures do not suppress Mattermost posting.

If:

  • the run has fully completed
  • and one or more hosts failed

Then:

  • final Mattermost status should still be sent
  • final run-level state should be treated as completed-with-failures

Hang / Unknown Semantics

If the run cannot be safely classified as completed, failed, cancelled, or terminated:

  • classify it as HUNG or UNKNOWN
  • do not post to Mattermost
  • require operator review

Logging Requirements

The watcher should log:

  • the run id / build name being monitored
  • each state transition
  • posting decisions
  • reasons for suppressing a Mattermost post
  • duplicate-post prevention decisions
  • final closed state

Summary

This watcher design must satisfy all of the following:

  • run on the ATVM Cypress controller
  • survive local operator machine downtime
  • use systemd
  • distinguish run states clearly
  • send Mattermost only after full completion of the watched scope
  • send completion results whether hosts passed or failed
  • never send Mattermost for cancelled, terminated, hung, or unknown runs
  • prevent duplicate or misleading posts
  • treat --categorize as sequential ATVM sub-runs, not as one parent run with internal phases
  • send one Mattermost post per completed categorized sub-run