Files
cds-ai/atvm/watcher-service/INSTALL.md
anthony.wen c9706e9702 Record cancelled watcher state on ATVM run cancellation
- update the watcher cancel helper so it writes a final CANCELLED state into state.json before stopping the service
- record cancellation timestamps and a cancellation note in the watcher state file for clearer post-run inspection
- update the watcher service docs so the documented cancel behavior matches the state-file handling
2026-03-25 18:24:17 -04:00

5.7 KiB

ATVM Watcher Service Install Plan

This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at 192.168.3.190.

This is a deployment plan only. It does not perform the installation.

Goal

Install the local watcher package so the controller can:

  • watch one ATVM run per watcher instance
  • send final Mattermost status only for COMPLETED or FAILED
  • suppress Mattermost posts for CANCELLED, TERMINATED, HUNG, and UNKNOWN
  • stop automatically after the watched run reaches a terminal state

Controller Target Layout

Recommended controller paths:

  • package root:
    • /opt/atvm-watcher-service
  • service unit:
    • /etc/systemd/system/atvm-run-watcher@.service
  • global environment file:
    • /etc/atvm-run-watcher.env
  • state root:
    • /var/lib/atvm-run-watcher
  • ATVM automation root:
    • /root/cdc-e2e-cyp-12.17.4

Best-practice rule:

  • install the watcher service package under /opt/atvm-watcher-service
  • do not use /root/atvm-watcher-service as the standard install location
  • if a temporary /root/atvm-watcher-service install exists, replace it with a clean /opt/atvm-watcher-service install

Files To Install

From the local workspace:

  • /home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py
  • /home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service
  • /home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/inventory/vm-inventory.md

Optional reference docs:

  • /home/aw/code/cds/atvm/watcher-service/README.md
  • /home/aw/code/cds/atvm/watcher-service/INSTALL.md

Required Controller Environment

The controller must have:

  • python3
  • systemd
  • outbound network access to the Mattermost webhook
  • read access to:
    • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter
    • /tmp/<build-name>.log

Required Secrets

The controller needs a watcher environment file with:

  • MATTERMOST_ATVM_WEBHOOK
  • MATTERMOST_ATVM_CHANNEL

Recommended file:

  • /etc/atvm-run-watcher.env

Recommended permissions:

  • owner: root
  • mode: 0600

Deployment Steps

  1. Create controller directories.

    • /opt/atvm-watcher-service
    • /var/lib/atvm-run-watcher
  2. Copy package files to the controller.

    • copy the Python watcher
    • copy the systemd unit file
    • copy the helper scripts
    • copy vm-inventory.md
  3. Set executable permissions.

    • atvm_run_watcher.py
    • start-atvm-run-watcher.sh
    • cancel-atvm-run-watcher.sh
  4. Create /etc/atvm-run-watcher.env.

    • add Mattermost webhook/channel
    • keep permissions restricted
  5. Install the systemd unit file.

    • copy to /etc/systemd/system/atvm-run-watcher@.service
  6. Reload systemd.

    • systemctl daemon-reload
  7. Run a syntax/smoke validation.

    • check Python import/launch
    • check helper script usage
    • verify the unit resolves
  8. Do a non-production test.

    • start a watcher for a fake or completed build name
    • confirm state directory creation
    • confirm the watcher exits as expected
  9. Do a real ATVM run test.

    • launch a real run
    • start the watcher for that build name
    • confirm final Mattermost delivery for a completed run

Examples for later execution on the controller:

mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
systemctl daemon-reload
systemctl cat atvm-run-watcher@.service
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help

Per-Run Usage After Install

Once installed, the intended workflow is:

  1. Launch the ATVM run as usual.
  2. Start the watcher for that build name.
  3. Let the watcher run on the controller.
  4. The watcher exits on terminal state.

Example:

/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
  --template cmc-e2e \
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"

Cancel example:

/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc

The cancel helper should:

  • write cancelled.marker
  • update state.json so the final watcher state is CANCELLED
  • stop the watcher instance
  • avoid any Mattermost post for that run

Operational Notes

  • This is not a daemon.
  • One watcher instance is started per ATVM run.
  • The watcher exits after the run reaches a terminal state.
  • The watcher writes state under /var/lib/atvm-run-watcher/<build-name>.
  • The watcher prevents duplicate Mattermost posts by writing a posted marker.

Failure Handling

Expected terminal behavior:

  • COMPLETED
    • post to Mattermost
    • verify ok
    • exit
  • FAILED
    • post to Mattermost
    • verify ok
    • exit
  • CANCELLED
    • write final CANCELLED state to state.json
    • do not post
    • exit
  • TERMINATED
    • do not post
    • exit
  • HUNG
    • do not post
    • exit
  • UNKNOWN
    • do not post
    • exit

Answer To "Do We Need An Installer README?"

Not strictly, but yes, it is useful.

Why:

  • it gives a repeatable controller deployment procedure
  • it separates local package design from controller installation steps
  • it makes later install/reinstall safer
  • it gives you a review checkpoint before anything is installed on 192.168.3.190

That is the purpose of this file.