# ATVM Watcher Service Install Plan This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`. This is a deployment plan only. It does not perform the installation. ## Goal Install the local watcher package so the controller can: - watch one requested ATVM run per watcher instance - for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED` - for categorized runs, send one final Mattermost status per completed categorized sub-run/group - suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN` - stop automatically after the watched run reaches a terminal state ## Controller Target Layout Recommended controller paths: - package root: - `/opt/atvm-watcher-service` - service unit: - `/etc/systemd/system/atvm-run-watcher@.service` - global environment file: - `/etc/atvm-run-watcher.env` - state root: - `/var/lib/atvm-run-watcher` - ATVM automation root: - `/root/cdc-e2e-cyp-12.17.4` Best-practice rule: - install the watcher service package under `/opt/atvm-watcher-service` - do not use `/root/atvm-watcher-service` as the standard install location - if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install ## Files To Install From the local workspace: - `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py` - `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service` - `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh` - `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh` - `/home/aw/code/cds/atvm/inventory/vm-inventory.md` Optional reference docs: - `/home/aw/code/cds/atvm/watcher-service/README.md` - `/home/aw/code/cds/atvm/watcher-service/INSTALL.md` ## Required Controller Environment The controller must have: - `python3` - `systemd` - outbound network access to the Mattermost webhook - read access to: - `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter` - `/tmp/.log` ## Required Secrets The controller needs a watcher environment file with: - `MATTERMOST_ATVM_WEBHOOK` - `MATTERMOST_ATVM_CHANNEL` Recommended file: - `/etc/atvm-run-watcher.env` Recommended permissions: - owner: `root` - mode: `0600` ## Deployment Steps 1. Create controller directories. - `/opt/atvm-watcher-service` - `/var/lib/atvm-run-watcher` 2. Copy package files to the controller. - copy the Python watcher - copy the `systemd` unit file - copy the helper scripts - copy `vm-inventory.md` 3. Set executable permissions. - `atvm_run_watcher.py` - `start-atvm-run-watcher.sh` - `cancel-atvm-run-watcher.sh` 4. Create `/etc/atvm-run-watcher.env`. - add Mattermost webhook/channel - keep permissions restricted 5. Install the `systemd` unit file. - copy to `/etc/systemd/system/atvm-run-watcher@.service` 6. Reload `systemd`. - `systemctl daemon-reload` 7. Run a syntax/smoke validation. - check Python import/launch - check helper script usage - verify the unit resolves 8. Do a non-production test. - start a watcher for a fake or completed build name - confirm state directory creation - confirm the watcher exits as expected 9. Do a real ATVM run test. - launch a real run - start the watcher for that build name - if the run uses `--categorize`, also pass `--categorize` to the watcher start helper - confirm final Mattermost delivery for a completed run - confirm categorized execution sends one post per completed grouped sub-run - confirm the watcher stays alive between categorized grouped runs while the parent request is still active - confirm reused parent build names do not inherit stale `cancelled.marker`, `posted.marker`, or `subruns/` state from older runs ## Recommended Validation Commands Examples for later execution on the controller: ```bash mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher ``` ```bash chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh ``` ```bash systemctl daemon-reload systemctl cat atvm-run-watcher@.service ``` ```bash python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help ``` ```bash /opt/atvm-watcher-service/start-atvm-run-watcher.sh --help ``` ## Per-Run Usage After Install Once installed, the intended workflow is: 1. Launch the ATVM run as usual. 2. Start the watcher for that build name. - the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance 3. Let the watcher run on the controller. 4. The watcher exits on terminal state. Example: ```bash /opt/atvm-watcher-service/start-atvm-run-watcher.sh \ --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \ --template cmc-e2e \ --config-family gold \ --migration-style "ATVM end-to-end migration validation" \ --integration-plugin "pure with fc" \ --categorize \ --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set" ``` Cancel example: ```bash /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \ --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc ``` The cancel helper should: - write `cancelled.marker` - update `state.json` so the final watcher state is `CANCELLED` - stop the watcher instance - avoid any Mattermost post for that run ## Operational Notes - This is not a daemon. - One watcher instance is started per ATVM run. - Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs. - In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed. - The watcher exits after the run reaches a terminal state. - The watcher writes state under `/var/lib/atvm-run-watcher/`. - The watcher prevents duplicate Mattermost posts by writing posted markers. - Categorized sub-run state is written under `/var/lib/atvm-run-watcher//subruns//`. ## Failure Handling Expected terminal behavior: - `COMPLETED` - post to Mattermost - verify `ok` - exit - `FAILED` - post to Mattermost - verify `ok` - exit - categorized `COMPLETED` / `FAILED` - post once for that grouped sub-run - verify `ok` - continue until the parent request finishes - `CANCELLED` - write final `CANCELLED` state to `state.json` - do not post - exit - `TERMINATED` - do not post - exit - `HUNG` - do not post - exit - `UNKNOWN` - do not post - exit ## Answer To "Do We Need An Installer README?" Not strictly, but yes, it is useful. Why: - it gives a repeatable controller deployment procedure - it separates local package design from controller installation steps - it makes later install/reinstall safer - it gives you a review checkpoint before anything is installed on `192.168.3.190` That is the purpose of this file.