Files
cds-ai/atvm/watcher-service/INSTALL.md
anthony.wen f5849dde0c Reset reused watcher state before starting a new ATVM run
- update the watcher start helper to stop any stale watcher instance for the same requested parent build name and remove its old state directory before starting fresh
- document that reused parent build names must not inherit stale cancelled, posted, state.json, or subruns state from older runs
- update the watcher install and design docs so the controller workflow explicitly treats stale reused-build-name state as part of startup cleanup
2026-03-26 11:30:28 -04:00

6.6 KiB

ATVM Watcher Service Install Plan

This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at 192.168.3.190.

This is a deployment plan only. It does not perform the installation.

Goal

Install the local watcher package so the controller can:

  • watch one requested ATVM run per watcher instance
  • for non-categorized runs, send one final Mattermost status only for COMPLETED or FAILED
  • for categorized runs, send one final Mattermost status per completed categorized sub-run/group
  • suppress Mattermost posts for CANCELLED, TERMINATED, HUNG, and UNKNOWN
  • stop automatically after the watched run reaches a terminal state

Controller Target Layout

Recommended controller paths:

  • package root:
    • /opt/atvm-watcher-service
  • service unit:
    • /etc/systemd/system/atvm-run-watcher@.service
  • global environment file:
    • /etc/atvm-run-watcher.env
  • state root:
    • /var/lib/atvm-run-watcher
  • ATVM automation root:
    • /root/cdc-e2e-cyp-12.17.4

Best-practice rule:

  • install the watcher service package under /opt/atvm-watcher-service
  • do not use /root/atvm-watcher-service as the standard install location
  • if a temporary /root/atvm-watcher-service install exists, replace it with a clean /opt/atvm-watcher-service install

Files To Install

From the local workspace:

  • /home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py
  • /home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service
  • /home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/inventory/vm-inventory.md

Optional reference docs:

  • /home/aw/code/cds/atvm/watcher-service/README.md
  • /home/aw/code/cds/atvm/watcher-service/INSTALL.md

Required Controller Environment

The controller must have:

  • python3
  • systemd
  • outbound network access to the Mattermost webhook
  • read access to:
    • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter
    • /tmp/<build-name>.log

Required Secrets

The controller needs a watcher environment file with:

  • MATTERMOST_ATVM_WEBHOOK
  • MATTERMOST_ATVM_CHANNEL

Recommended file:

  • /etc/atvm-run-watcher.env

Recommended permissions:

  • owner: root
  • mode: 0600

Deployment Steps

  1. Create controller directories.

    • /opt/atvm-watcher-service
    • /var/lib/atvm-run-watcher
  2. Copy package files to the controller.

    • copy the Python watcher
    • copy the systemd unit file
    • copy the helper scripts
    • copy vm-inventory.md
  3. Set executable permissions.

    • atvm_run_watcher.py
    • start-atvm-run-watcher.sh
    • cancel-atvm-run-watcher.sh
  4. Create /etc/atvm-run-watcher.env.

    • add Mattermost webhook/channel
    • keep permissions restricted
  5. Install the systemd unit file.

    • copy to /etc/systemd/system/atvm-run-watcher@.service
  6. Reload systemd.

    • systemctl daemon-reload
  7. Run a syntax/smoke validation.

    • check Python import/launch
    • check helper script usage
    • verify the unit resolves
  8. Do a non-production test.

    • start a watcher for a fake or completed build name
    • confirm state directory creation
    • confirm the watcher exits as expected
  9. Do a real ATVM run test.

    • launch a real run
    • start the watcher for that build name
    • if the run uses --categorize, also pass --categorize to the watcher start helper
    • confirm final Mattermost delivery for a completed run
    • confirm categorized execution sends one post per completed grouped sub-run
    • confirm reused parent build names do not inherit stale cancelled.marker, posted.marker, or subruns/ state from older runs

Examples for later execution on the controller:

mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
systemctl daemon-reload
systemctl cat atvm-run-watcher@.service
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help

Per-Run Usage After Install

Once installed, the intended workflow is:

  1. Launch the ATVM run as usual.
  2. Start the watcher for that build name.
    • the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance
  3. Let the watcher run on the controller.
  4. The watcher exits on terminal state.

Example:

/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
  --template cmc-e2e \
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"

Cancel example:

/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc

The cancel helper should:

  • write cancelled.marker
  • update state.json so the final watcher state is CANCELLED
  • stop the watcher instance
  • avoid any Mattermost post for that run

Operational Notes

  • This is not a daemon.
  • One watcher instance is started per ATVM run.
  • Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
  • The watcher exits after the run reaches a terminal state.
  • The watcher writes state under /var/lib/atvm-run-watcher/<build-name>.
  • The watcher prevents duplicate Mattermost posts by writing posted markers.
  • Categorized sub-run state is written under /var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/.

Failure Handling

Expected terminal behavior:

  • COMPLETED
    • post to Mattermost
    • verify ok
    • exit
  • FAILED
    • post to Mattermost
    • verify ok
    • exit
  • categorized COMPLETED / FAILED
    • post once for that grouped sub-run
    • verify ok
    • continue until the parent request finishes
  • CANCELLED
    • write final CANCELLED state to state.json
    • do not post
    • exit
  • TERMINATED
    • do not post
    • exit
  • HUNG
    • do not post
    • exit
  • UNKNOWN
    • do not post
    • exit

Answer To "Do We Need An Installer README?"

Not strictly, but yes, it is useful.

Why:

  • it gives a repeatable controller deployment procedure
  • it separates local package design from controller installation steps
  • it makes later install/reinstall safer
  • it gives you a review checkpoint before anything is installed on 192.168.3.190

That is the purpose of this file.