Execute the template step before starting watcher-backed ATVM runs. - run --template-command synchronously in start-atvm-run.sh - write template output to /tmp/<build>.launch.log - stop before watcher/runner startup if template generation fails - document the corrected wrapper behavior in watcher-service docs - record the stale specPattern failure mode in automation run learnings
303 lines
9.8 KiB
Markdown
303 lines
9.8 KiB
Markdown
# ATVM Watcher Service Install Plan
|
|
|
|
This document describes how to deploy the ATVM per-run watcher and runner services to the ATVM Cypress controller at `192.168.3.190`.
|
|
|
|
This is a deployment plan only. It does not perform the installation.
|
|
|
|
## Goal
|
|
|
|
Install the local watcher/runner package so the controller can:
|
|
|
|
- start one requested ATVM Cypress runner per service instance
|
|
- watch one requested ATVM run per watcher instance
|
|
- for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED`
|
|
- for categorized runs, send one final Mattermost status per completed categorized sub-run/group
|
|
- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
|
|
- stop automatically after the watched run reaches a terminal state
|
|
|
|
## Controller Target Layout
|
|
|
|
Recommended controller paths:
|
|
|
|
- package root:
|
|
- `/opt/atvm-watcher-service`
|
|
- runner service unit:
|
|
- `/etc/systemd/system/atvm-runner@.service`
|
|
- service unit:
|
|
- `/etc/systemd/system/atvm-run-watcher@.service`
|
|
- global environment file:
|
|
- `/etc/atvm-run-watcher.env`
|
|
- state root:
|
|
- `/var/lib/atvm-run-watcher`
|
|
- ATVM automation root:
|
|
- `/root/cdc-e2e-cyp-12.17.4`
|
|
|
|
Best-practice rule:
|
|
|
|
- install the watcher service package under `/opt/atvm-watcher-service`
|
|
- do not use `/root/atvm-watcher-service` as the standard install location
|
|
- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
|
|
|
|
## Files To Install
|
|
|
|
From the local workspace:
|
|
|
|
- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
|
|
- `/home/aw/code/cds/atvm/watcher-service/run-atvm-runner.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/atvm-runner@.service`
|
|
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-runner.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-runner.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
|
|
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
|
|
- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
|
|
|
|
Optional reference docs:
|
|
|
|
- `/home/aw/code/cds/atvm/watcher-service/README.md`
|
|
- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
|
|
|
|
## Required Controller Environment
|
|
|
|
The controller must have:
|
|
|
|
- `python3`
|
|
- `systemd`
|
|
- outbound network access to the Mattermost webhook
|
|
- read access to:
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
|
|
- `/tmp/<build-name>.log`
|
|
|
|
## Required Secrets
|
|
|
|
The controller needs a watcher environment file with:
|
|
|
|
- `MATTERMOST_ATVM_WEBHOOK`
|
|
- `MATTERMOST_ATVM_CHANNEL`
|
|
|
|
Recommended file:
|
|
|
|
- `/etc/atvm-run-watcher.env`
|
|
|
|
Recommended permissions:
|
|
|
|
- owner: `root`
|
|
- mode: `0600`
|
|
|
|
## Deployment Steps
|
|
|
|
1. Create controller directories.
|
|
- `/opt/atvm-watcher-service`
|
|
- `/var/lib/atvm-run-watcher`
|
|
|
|
2. Copy package files to the controller.
|
|
- copy the runner wrapper
|
|
- copy the runner `systemd` unit file
|
|
- copy the runner helper scripts
|
|
- copy the Python watcher
|
|
- copy the `systemd` unit file
|
|
- copy the helper scripts
|
|
- copy `vm-inventory.md`
|
|
|
|
3. Set executable permissions.
|
|
- `run-atvm-runner.sh`
|
|
- `start-atvm-runner.sh`
|
|
- `cancel-atvm-runner.sh`
|
|
- `start-atvm-run.sh`
|
|
- `atvm_run_watcher.py`
|
|
- `start-atvm-run-watcher.sh`
|
|
- `cancel-atvm-run-watcher.sh`
|
|
|
|
4. Create `/etc/atvm-run-watcher.env`.
|
|
- add Mattermost webhook/channel
|
|
- keep permissions restricted
|
|
|
|
5. Install the `systemd` unit file.
|
|
- copy the runner unit to `/etc/systemd/system/atvm-runner@.service`
|
|
- copy to `/etc/systemd/system/atvm-run-watcher@.service`
|
|
|
|
6. Reload `systemd`.
|
|
- `systemctl daemon-reload`
|
|
|
|
7. Run a syntax/smoke validation.
|
|
- check Python import/launch
|
|
- check helper script usage
|
|
- verify the unit resolves
|
|
|
|
8. Do a non-production test.
|
|
- start a watcher for a fake or completed build name
|
|
- confirm state directory creation
|
|
- confirm the watcher exits as expected
|
|
|
|
9. Do a real ATVM run test.
|
|
- launch a real run
|
|
- start the watcher for that build name
|
|
- if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
|
|
- confirm final Mattermost delivery for a completed run
|
|
- confirm categorized execution sends one post per completed grouped sub-run
|
|
- confirm the watcher stays alive between categorized grouped runs while the parent request is still active
|
|
- confirm reused parent build names do not inherit stale `cancelled.marker`, `posted.marker`, or `subruns/` state from older runs
|
|
|
|
## Recommended Validation Commands
|
|
|
|
Examples for later execution on the controller:
|
|
|
|
```bash
|
|
mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
|
|
```
|
|
|
|
```bash
|
|
chmod 755 /opt/atvm-watcher-service/run-atvm-runner.sh
|
|
chmod 755 /opt/atvm-watcher-service/start-atvm-runner.sh
|
|
chmod 755 /opt/atvm-watcher-service/cancel-atvm-runner.sh
|
|
chmod 755 /opt/atvm-watcher-service/start-atvm-run.sh
|
|
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
|
|
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
|
|
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
|
|
```
|
|
|
|
```bash
|
|
systemctl daemon-reload
|
|
systemctl cat atvm-runner@.service
|
|
systemctl cat atvm-run-watcher@.service
|
|
```
|
|
|
|
```bash
|
|
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
|
|
```
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run.sh --help
|
|
```
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-runner.sh --help
|
|
```
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
|
|
```
|
|
|
|
## Per-Run Usage After Install
|
|
|
|
Once installed, the intended workflow is:
|
|
|
|
1. Run the approved `cmc-templates.py` command for that build name.
|
|
- when using `start-atvm-run.sh`, the wrapper should execute `--template-command` synchronously and stop immediately if that step fails
|
|
2. Start the watcher for that build name.
|
|
- the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance
|
|
3. Start the runner service for that build name.
|
|
4. Let the runner and watcher run on the controller.
|
|
5. The watcher exits on terminal state.
|
|
|
|
Example:
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
|
--template cmc-e2e \
|
|
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
|
|
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
|
|
--config-family gold \
|
|
--migration-style "ATVM end-to-end migration validation" \
|
|
--integration-plugin "pure with fc" \
|
|
--categorize \
|
|
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
|
|
|
|
/opt/atvm-watcher-service/start-atvm-runner.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
|
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize"
|
|
```
|
|
|
|
Preferred combined start:
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
|
--template cmc-e2e \
|
|
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
|
|
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
|
|
--config-family gold \
|
|
--config-file cypress.atvm-config-gold.ts \
|
|
--migration-style "ATVM end-to-end migration validation" \
|
|
--integration-plugin "pure with fc" \
|
|
--categorize
|
|
```
|
|
|
|
Cancel example:
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/cancel-atvm-runner.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
|
|
```
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
|
|
```
|
|
|
|
The cancel helper should:
|
|
|
|
- write `cancelled.marker`
|
|
- update `state.json` so the final watcher state is `CANCELLED`
|
|
- stop the watcher instance
|
|
- avoid any Mattermost post for that run
|
|
|
|
## Operational Notes
|
|
|
|
- This is not a daemon.
|
|
- One runner instance is started per ATVM run.
|
|
- One watcher instance is started per ATVM run.
|
|
- Prefer the `atvm-runner@...` service over detached SSH background launch patterns for `run-sorry-cypress.py`.
|
|
- Prefer `start-atvm-run.sh` over launching watcher and runner separately when both are needed, because it enforces the safe watcher-first order.
|
|
- Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
|
|
- In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed.
|
|
- The watcher exits after the run reaches a terminal state.
|
|
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
|
|
- The watcher prevents duplicate Mattermost posts by writing posted markers.
|
|
- Categorized sub-run state is written under `/var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/`.
|
|
|
|
## Failure Handling
|
|
|
|
Expected terminal behavior:
|
|
|
|
- `COMPLETED`
|
|
- post to Mattermost
|
|
- verify `ok`
|
|
- exit
|
|
- `FAILED`
|
|
- post to Mattermost
|
|
- verify `ok`
|
|
- exit
|
|
- categorized `COMPLETED` / `FAILED`
|
|
- post once for that grouped sub-run
|
|
- verify `ok`
|
|
- continue until the parent request finishes
|
|
- `CANCELLED`
|
|
- write final `CANCELLED` state to `state.json`
|
|
- do not post
|
|
- exit
|
|
- `TERMINATED`
|
|
- do not post
|
|
- exit
|
|
- `HUNG`
|
|
- do not post
|
|
- exit
|
|
- `UNKNOWN`
|
|
- do not post
|
|
- exit
|
|
|
|
## Answer To "Do We Need An Installer README?"
|
|
|
|
Not strictly, but yes, it is useful.
|
|
|
|
Why:
|
|
|
|
- it gives a repeatable controller deployment procedure
|
|
- it separates local package design from controller installation steps
|
|
- it makes later install/reinstall safer
|
|
- it gives you a review checkpoint before anything is installed on `192.168.3.190`
|
|
|
|
That is the purpose of this file.
|