Files
cds-ai/atvm/watcher-service/INSTALL.md
anthony.wen 9673d769e2 fix atvm watcher-backed run launch sequence
Execute the template step before starting watcher-backed ATVM runs.

- run --template-command synchronously in start-atvm-run.sh
- write template output to /tmp/<build>.launch.log
- stop before watcher/runner startup if template generation fails
- document the corrected wrapper behavior in watcher-service docs
- record the stale specPattern failure mode in automation run learnings
2026-04-29 12:14:55 -04:00

303 lines
9.8 KiB
Markdown

# ATVM Watcher Service Install Plan
This document describes how to deploy the ATVM per-run watcher and runner services to the ATVM Cypress controller at `192.168.3.190`.
This is a deployment plan only. It does not perform the installation.
## Goal
Install the local watcher/runner package so the controller can:
- start one requested ATVM Cypress runner per service instance
- watch one requested ATVM run per watcher instance
- for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED`
- for categorized runs, send one final Mattermost status per completed categorized sub-run/group
- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
- stop automatically after the watched run reaches a terminal state
## Controller Target Layout
Recommended controller paths:
- package root:
- `/opt/atvm-watcher-service`
- runner service unit:
- `/etc/systemd/system/atvm-runner@.service`
- service unit:
- `/etc/systemd/system/atvm-run-watcher@.service`
- global environment file:
- `/etc/atvm-run-watcher.env`
- state root:
- `/var/lib/atvm-run-watcher`
- ATVM automation root:
- `/root/cdc-e2e-cyp-12.17.4`
Best-practice rule:
- install the watcher service package under `/opt/atvm-watcher-service`
- do not use `/root/atvm-watcher-service` as the standard install location
- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
## Files To Install
From the local workspace:
- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
- `/home/aw/code/cds/atvm/watcher-service/run-atvm-runner.sh`
- `/home/aw/code/cds/atvm/watcher-service/atvm-runner@.service`
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-runner.sh`
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-runner.sh`
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run.sh`
- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
Optional reference docs:
- `/home/aw/code/cds/atvm/watcher-service/README.md`
- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
## Required Controller Environment
The controller must have:
- `python3`
- `systemd`
- outbound network access to the Mattermost webhook
- read access to:
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
- `/tmp/<build-name>.log`
## Required Secrets
The controller needs a watcher environment file with:
- `MATTERMOST_ATVM_WEBHOOK`
- `MATTERMOST_ATVM_CHANNEL`
Recommended file:
- `/etc/atvm-run-watcher.env`
Recommended permissions:
- owner: `root`
- mode: `0600`
## Deployment Steps
1. Create controller directories.
- `/opt/atvm-watcher-service`
- `/var/lib/atvm-run-watcher`
2. Copy package files to the controller.
- copy the runner wrapper
- copy the runner `systemd` unit file
- copy the runner helper scripts
- copy the Python watcher
- copy the `systemd` unit file
- copy the helper scripts
- copy `vm-inventory.md`
3. Set executable permissions.
- `run-atvm-runner.sh`
- `start-atvm-runner.sh`
- `cancel-atvm-runner.sh`
- `start-atvm-run.sh`
- `atvm_run_watcher.py`
- `start-atvm-run-watcher.sh`
- `cancel-atvm-run-watcher.sh`
4. Create `/etc/atvm-run-watcher.env`.
- add Mattermost webhook/channel
- keep permissions restricted
5. Install the `systemd` unit file.
- copy the runner unit to `/etc/systemd/system/atvm-runner@.service`
- copy to `/etc/systemd/system/atvm-run-watcher@.service`
6. Reload `systemd`.
- `systemctl daemon-reload`
7. Run a syntax/smoke validation.
- check Python import/launch
- check helper script usage
- verify the unit resolves
8. Do a non-production test.
- start a watcher for a fake or completed build name
- confirm state directory creation
- confirm the watcher exits as expected
9. Do a real ATVM run test.
- launch a real run
- start the watcher for that build name
- if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
- confirm final Mattermost delivery for a completed run
- confirm categorized execution sends one post per completed grouped sub-run
- confirm the watcher stays alive between categorized grouped runs while the parent request is still active
- confirm reused parent build names do not inherit stale `cancelled.marker`, `posted.marker`, or `subruns/` state from older runs
## Recommended Validation Commands
Examples for later execution on the controller:
```bash
mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
```
```bash
chmod 755 /opt/atvm-watcher-service/run-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/start-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-runner.sh
chmod 755 /opt/atvm-watcher-service/start-atvm-run.sh
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
```
```bash
systemctl daemon-reload
systemctl cat atvm-runner@.service
systemctl cat atvm-run-watcher@.service
```
```bash
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
```
```bash
/opt/atvm-watcher-service/start-atvm-run.sh --help
```
```bash
/opt/atvm-watcher-service/start-atvm-runner.sh --help
```
```bash
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
```
## Per-Run Usage After Install
Once installed, the intended workflow is:
1. Run the approved `cmc-templates.py` command for that build name.
- when using `start-atvm-run.sh`, the wrapper should execute `--template-command` synchronously and stop immediately if that step fails
2. Start the watcher for that build name.
- the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance
3. Start the runner service for that build name.
4. Let the runner and watcher run on the controller.
5. The watcher exits on terminal state.
Example:
```bash
/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--template cmc-e2e \
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
--config-family gold \
--migration-style "ATVM end-to-end migration validation" \
--integration-plugin "pure with fc" \
--categorize \
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
/opt/atvm-watcher-service/start-atvm-runner.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize"
```
Preferred combined start:
```bash
/opt/atvm-watcher-service/start-atvm-run.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--template cmc-e2e \
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
--config-family gold \
--config-file cypress.atvm-config-gold.ts \
--migration-style "ATVM end-to-end migration validation" \
--integration-plugin "pure with fc" \
--categorize
```
Cancel example:
```bash
/opt/atvm-watcher-service/cancel-atvm-runner.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
```
```bash
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
```
The cancel helper should:
- write `cancelled.marker`
- update `state.json` so the final watcher state is `CANCELLED`
- stop the watcher instance
- avoid any Mattermost post for that run
## Operational Notes
- This is not a daemon.
- One runner instance is started per ATVM run.
- One watcher instance is started per ATVM run.
- Prefer the `atvm-runner@...` service over detached SSH background launch patterns for `run-sorry-cypress.py`.
- Prefer `start-atvm-run.sh` over launching watcher and runner separately when both are needed, because it enforces the safe watcher-first order.
- Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
- In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed.
- The watcher exits after the run reaches a terminal state.
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
- The watcher prevents duplicate Mattermost posts by writing posted markers.
- Categorized sub-run state is written under `/var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/`.
## Failure Handling
Expected terminal behavior:
- `COMPLETED`
- post to Mattermost
- verify `ok`
- exit
- `FAILED`
- post to Mattermost
- verify `ok`
- exit
- categorized `COMPLETED` / `FAILED`
- post once for that grouped sub-run
- verify `ok`
- continue until the parent request finishes
- `CANCELLED`
- write final `CANCELLED` state to `state.json`
- do not post
- exit
- `TERMINATED`
- do not post
- exit
- `HUNG`
- do not post
- exit
- `UNKNOWN`
- do not post
- exit
## Answer To "Do We Need An Installer README?"
Not strictly, but yes, it is useful.
Why:
- it gives a repeatable controller deployment procedure
- it separates local package design from controller installation steps
- it makes later install/reinstall safer
- it gives you a review checkpoint before anything is installed on `192.168.3.190`
That is the purpose of this file.