Files
cds-ai/atvm/watcher-service/INSTALL.md

9.6 KiB

ATVM Watcher Service Install Plan

This document describes how to deploy the ATVM per-run watcher and runner services to the ATVM Cypress controller at 192.168.3.190.

This is a deployment plan only. It does not perform the installation.

Goal

Install the local watcher/runner package so the controller can:

  • start one requested ATVM Cypress runner per service instance
  • watch one requested ATVM run per watcher instance
  • for non-categorized runs, send one final Mattermost status only for COMPLETED or FAILED
  • for categorized runs, send one final Mattermost status per completed categorized sub-run/group
  • suppress Mattermost posts for CANCELLED, TERMINATED, HUNG, and UNKNOWN
  • stop automatically after the watched run reaches a terminal state

Controller Target Layout

Recommended controller paths:

  • package root:
    • /opt/atvm-watcher-service
  • runner service unit:
    • /etc/systemd/system/atvm-runner@.service
  • service unit:
    • /etc/systemd/system/atvm-run-watcher@.service
  • global environment file:
    • /etc/atvm-run-watcher.env
  • state root:
    • /var/lib/atvm-run-watcher
  • ATVM automation root:
    • /root/cdc-e2e-cyp-12.17.4

Best-practice rule:

  • install the watcher service package under /opt/atvm-watcher-service
  • do not use /root/atvm-watcher-service as the standard install location
  • if a temporary /root/atvm-watcher-service install exists, replace it with a clean /opt/atvm-watcher-service install

Files To Install

From the local workspace:

  • /home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py
  • /home/aw/code/cds/atvm/watcher-service/run-atvm-runner.sh
  • /home/aw/code/cds/atvm/watcher-service/atvm-runner@.service
  • /home/aw/code/cds/atvm/watcher-service/start-atvm-runner.sh
  • /home/aw/code/cds/atvm/watcher-service/cancel-atvm-runner.sh
  • /home/aw/code/cds/atvm/watcher-service/start-atvm-run.sh
  • /home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service
  • /home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh
  • /home/aw/code/cds/atvm/inventory/vm-inventory.md

Optional reference docs:

  • /home/aw/code/cds/atvm/watcher-service/README.md
  • /home/aw/code/cds/atvm/watcher-service/INSTALL.md

Required Controller Environment

The controller must have:

  • python3
  • systemd
  • outbound network access to the Mattermost webhook
  • read access to:
    • /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter
    • /tmp/<build-name>.log

Required Secrets

The controller needs a watcher environment file with:

  • MATTERMOST_ATVM_WEBHOOK
  • MATTERMOST_ATVM_CHANNEL

Recommended file:

  • /etc/atvm-run-watcher.env

Recommended permissions:

  • owner: root
  • mode: 0600

Deployment Steps

  1. Create controller directories.

    • /opt/atvm-watcher-service
    • /var/lib/atvm-run-watcher
  2. Copy package files to the controller.

    • copy the runner wrapper
    • copy the runner systemd unit file
    • copy the runner helper scripts
    • copy the Python watcher
    • copy the systemd unit file
    • copy the helper scripts
    • copy vm-inventory.md
  3. Set executable permissions.

    • run-atvm-runner.sh
    • start-atvm-runner.sh
    • cancel-atvm-runner.sh
    • start-atvm-run.sh
    • atvm_run_watcher.py
    • start-atvm-run-watcher.sh
    • cancel-atvm-run-watcher.sh
  4. Create /etc/atvm-run-watcher.env.

    • add Mattermost webhook/channel
    • keep permissions restricted
  5. Install the systemd unit file.

    • copy the runner unit to /etc/systemd/system/atvm-runner@.service
    • copy to /etc/systemd/system/atvm-run-watcher@.service
  6. Reload systemd.

    • systemctl daemon-reload
  7. Run a syntax/smoke validation.

    • check Python import/launch
    • check helper script usage
    • verify the unit resolves
  8. Do a non-production test.

    • start a watcher for a fake or completed build name
    • confirm state directory creation
    • confirm the watcher exits as expected
  9. Do a real ATVM run test.

    • launch a real run
    • start the watcher for that build name
    • if the run uses --categorize, also pass --categorize to the watcher start helper
    • confirm final Mattermost delivery for a completed run
    • confirm categorized execution sends one post per completed grouped sub-run
    • confirm the watcher stays alive between categorized grouped runs while the parent request is still active
    • confirm reused parent build names do not inherit stale cancelled.marker, posted.marker, or subruns/ state from older runs

Examples for later execution on the controller:

mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
chmod 755 /opt/atvm-watcher-service/run-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/start-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/start-atvm-run.sh
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
systemctl daemon-reload
systemctl cat atvm-runner@.service
systemctl cat atvm-run-watcher@.service
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
/opt/atvm-watcher-service/start-atvm-run.sh --help
/opt/atvm-watcher-service/start-atvm-runner.sh --help
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help

Per-Run Usage After Install

Once installed, the intended workflow is:

  1. Start the watcher for that build name.
    • the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance
  2. Start the runner service for that build name.
  3. Let the runner and watcher run on the controller.
  4. The watcher exits on terminal state.

Example:

/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
  --template cmc-e2e \
  --template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
  --runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"

/opt/atvm-watcher-service/start-atvm-runner.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
  --runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize"

Preferred combined start:

/opt/atvm-watcher-service/start-atvm-run.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
  --template cmc-e2e \
  --template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
  --runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
  --config-family gold \
  --config-file cypress.atvm-config-gold.ts \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --categorize

Cancel example:

/opt/atvm-watcher-service/cancel-atvm-runner.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc

The cancel helper should:

  • write cancelled.marker
  • update state.json so the final watcher state is CANCELLED
  • stop the watcher instance
  • avoid any Mattermost post for that run

Operational Notes

  • This is not a daemon.
  • One runner instance is started per ATVM run.
  • One watcher instance is started per ATVM run.
  • Prefer the atvm-runner@... service over detached SSH background launch patterns for run-sorry-cypress.py.
  • Prefer start-atvm-run.sh over launching watcher and runner separately when both are needed, because it enforces the safe watcher-first order.
  • Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
  • In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed.
  • The watcher exits after the run reaches a terminal state.
  • The watcher writes state under /var/lib/atvm-run-watcher/<build-name>.
  • The watcher prevents duplicate Mattermost posts by writing posted markers.
  • Categorized sub-run state is written under /var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/.

Failure Handling

Expected terminal behavior:

  • COMPLETED
    • post to Mattermost
    • verify ok
    • exit
  • FAILED
    • post to Mattermost
    • verify ok
    • exit
  • categorized COMPLETED / FAILED
    • post once for that grouped sub-run
    • verify ok
    • continue until the parent request finishes
  • CANCELLED
    • write final CANCELLED state to state.json
    • do not post
    • exit
  • TERMINATED
    • do not post
    • exit
  • HUNG
    • do not post
    • exit
  • UNKNOWN
    • do not post
    • exit

Answer To "Do We Need An Installer README?"

Not strictly, but yes, it is useful.

Why:

  • it gives a repeatable controller deployment procedure
  • it separates local package design from controller installation steps
  • it makes later install/reinstall safer
  • it gives you a review checkpoint before anything is installed on 192.168.3.190

That is the purpose of this file.