Add ATVM watcher service and explicit watcher approval flow

- add the per-run ATVM watcher service package under atvm/watcher-service, including the Python watcher, systemd template unit, helper scripts, and deployment docs
- document the watcher-service install and operating model, including one-run-per-instance behavior, Mattermost posting rules, and the best-practice /opt/atvm-watcher-service install path
- clarify ATVM run approval semantics so `approve` means run without watcher and `approve with watcher` means run and start the watcher
- update the ATVM automation guide and AGENTS rules so watcher usage and approval behavior are explicit and consistent
This commit is contained in:
2026-03-25 17:41:50 -04:00
parent fe228ff0e9
commit ba8354b95c
9 changed files with 962 additions and 8 deletions

View File

@@ -0,0 +1,220 @@
# ATVM Watcher Service Install Plan
This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`.
This is a deployment plan only. It does not perform the installation.
## Goal
Install the local watcher package so the controller can:
- watch one ATVM run per watcher instance
- send final Mattermost status only for `COMPLETED` or `FAILED`
- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
- stop automatically after the watched run reaches a terminal state
## Controller Target Layout
Recommended controller paths:
- package root:
- `/opt/atvm-watcher-service`
- service unit:
- `/etc/systemd/system/atvm-run-watcher@.service`
- global environment file:
- `/etc/atvm-run-watcher.env`
- state root:
- `/var/lib/atvm-run-watcher`
- ATVM automation root:
- `/root/cdc-e2e-cyp-12.17.4`
Best-practice rule:
- install the watcher service package under `/opt/atvm-watcher-service`
- do not use `/root/atvm-watcher-service` as the standard install location
- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
## Files To Install
From the local workspace:
- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
Optional reference docs:
- `/home/aw/code/cds/atvm/watcher-service/README.md`
- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
## Required Controller Environment
The controller must have:
- `python3`
- `systemd`
- outbound network access to the Mattermost webhook
- read access to:
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
- `/tmp/<build-name>.log`
## Required Secrets
The controller needs a watcher environment file with:
- `MATTERMOST_ATVM_WEBHOOK`
- `MATTERMOST_ATVM_CHANNEL`
Recommended file:
- `/etc/atvm-run-watcher.env`
Recommended permissions:
- owner: `root`
- mode: `0600`
## Deployment Steps
1. Create controller directories.
- `/opt/atvm-watcher-service`
- `/var/lib/atvm-run-watcher`
2. Copy package files to the controller.
- copy the Python watcher
- copy the `systemd` unit file
- copy the helper scripts
- copy `vm-inventory.md`
3. Set executable permissions.
- `atvm_run_watcher.py`
- `start-atvm-run-watcher.sh`
- `cancel-atvm-run-watcher.sh`
4. Create `/etc/atvm-run-watcher.env`.
- add Mattermost webhook/channel
- keep permissions restricted
5. Install the `systemd` unit file.
- copy to `/etc/systemd/system/atvm-run-watcher@.service`
6. Reload `systemd`.
- `systemctl daemon-reload`
7. Run a syntax/smoke validation.
- check Python import/launch
- check helper script usage
- verify the unit resolves
8. Do a non-production test.
- start a watcher for a fake or completed build name
- confirm state directory creation
- confirm the watcher exits as expected
9. Do a real ATVM run test.
- launch a real run
- start the watcher for that build name
- confirm final Mattermost delivery for a completed run
## Recommended Validation Commands
Examples for later execution on the controller:
```bash
mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
```
```bash
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
```
```bash
systemctl daemon-reload
systemctl cat atvm-run-watcher@.service
```
```bash
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
```
```bash
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
```
## Per-Run Usage After Install
Once installed, the intended workflow is:
1. Launch the ATVM run as usual.
2. Start the watcher for that build name.
3. Let the watcher run on the controller.
4. The watcher exits on terminal state.
Example:
```bash
/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--template cmc-e2e \
--config-family gold \
--migration-style "ATVM end-to-end migration validation" \
--integration-plugin "pure with fc" \
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
```
Cancel example:
```bash
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
```
## Operational Notes
- This is not a daemon.
- One watcher instance is started per ATVM run.
- The watcher exits after the run reaches a terminal state.
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
- The watcher prevents duplicate Mattermost posts by writing a posted marker.
## Failure Handling
Expected terminal behavior:
- `COMPLETED`
- post to Mattermost
- verify `ok`
- exit
- `FAILED`
- post to Mattermost
- verify `ok`
- exit
- `CANCELLED`
- do not post
- exit
- `TERMINATED`
- do not post
- exit
- `HUNG`
- do not post
- exit
- `UNKNOWN`
- do not post
- exit
## Answer To "Do We Need An Installer README?"
Not strictly, but yes, it is useful.
Why:
- it gives a repeatable controller deployment procedure
- it separates local package design from controller installation steps
- it makes later install/reinstall safer
- it gives you a review checkpoint before anything is installed on `192.168.3.190`
That is the purpose of this file.