Files
cds-ai/atvm/watcher-service/README.md

189 lines
8.0 KiB
Markdown

# ATVM Watcher Service
This folder contains a per-run ATVM watcher service package that is intended to be reviewed locally first and installed on the ATVM Cypress controller later only when explicitly requested.
## Purpose
Watch an ATVM automation request until it reaches a terminal state, then:
- for non-categorized runs:
- post one final status to Mattermost if the run state is `COMPLETED` or `FAILED`
- for categorized runs:
- detect each sequential categorized sub-run
- post one final status per completed categorized sub-run if that grouped run state is `COMPLETED` or `FAILED`
- verify each Mattermost post succeeded
- write durable watcher state
- exit cleanly so the service stops
The watcher does not run indefinitely. It is designed for one run per service instance.
## Files
- `atvm-runner@.service`
- `systemd` template unit for one runner instance per build name
- `atvm_run_watcher.py`
- main watcher implementation
- `atvm-run-watcher@.service`
- `systemd` template unit for one watcher instance per build name
- `run-atvm-runner.sh`
- runner wrapper used by the `systemd` runner unit
- `start-atvm-runner.sh`
- helper to write per-run runner environment data and start a runner instance
- `cancel-atvm-runner.sh`
- helper to stop a runner instance
- `start-atvm-run.sh`
- wrapper that starts watcher first, waits for it to be active, then starts the runner
- `start-atvm-run-watcher.sh`
- helper to write per-run environment data and start a watcher instance
- `cancel-atvm-run-watcher.sh`
- helper to mark a run cancelled and stop the watcher instance
## Intended Controller Paths
These are the default install targets assumed by the included unit file:
- service package root: `/opt/atvm-watcher-service`
- runner unit: `/etc/systemd/system/atvm-runner@.service`
- watcher state root: `/var/lib/atvm-run-watcher`
- controller ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
- watcher environment file: `/etc/atvm-run-watcher.env`
Use `/opt/atvm-watcher-service` as the controller install root for future installs and reinstalls.
Do not treat `/root/atvm-watcher-service` as the preferred long-term install location.
## Per-Run Behavior
Each watcher instance is tied to one requested build name.
Typical workflow:
1. Start the watcher for that run.
2. Start the runner service for that run.
3. The watcher polls the runner log, process state, and `cmcReporter` artifacts.
- before starting, the helper resets any prior watcher state for the same requested build name so stale cancellation or posted markers do not leak into a new run
4. For non-categorized runs, when the run reaches a terminal state:
- `COMPLETED` or `FAILED`
- build the final ATVM status
- send the status to Mattermost
- verify Mattermost returned `ok`
- mark the run as posted
- exit
- `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`
- do not post
- mark the final state
- exit
5. For categorized runs:
- detect each grouped sub-run in sequence from the parent run log
- wait for that grouped sub-run to finish
- send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED`
- keep the watcher alive while the parent categorized runner or related child Cypress process is still active
- do not treat one completed grouped sub-run as proof that the whole parent request is finished
- continue to the next grouped sub-run
- exit after the parent request reaches a terminal state
## Required Environment
The service expects the local credentials file values to be made available on the controller through the service environment:
- `MATTERMOST_ATVM_WEBHOOK`
- `MATTERMOST_ATVM_CHANNEL`
Optional metadata for better status formatting:
- `ATVM_WATCHER_TEMPLATE`
- `ATVM_WATCHER_CONFIG_FAMILY`
- `ATVM_WATCHER_MIGRATION_STYLE`
- `ATVM_WATCHER_INTEGRATION_PLUGIN`
- `ATVM_WATCHER_TEMPLATE_COMMAND`
- `ATVM_WATCHER_RUNNER_COMMAND`
- `ATVM_WATCHER_SCOPE_DESCRIPTION`
- `ATVM_WATCHER_CATEGORIZED`
Runner environment required per run:
- `ATVM_RUNNER_COMMAND`
Runner environment optional per run:
- `ATVM_RUNNER_WORKDIR`
- `ATVM_RUNNER_LOG`
## Start Example
These helpers write per-run environment files and start the matching instances:
```bash
./start-atvm-run-watcher.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--template cmc-e2e \
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
--config-family gold \
--migration-style "ATVM end-to-end migration validation" \
--integration-plugin "pure with fc" \
--categorize \
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
./start-atvm-runner.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize"
```
Preferred one-shot wrapper:
```bash
./start-atvm-run.sh \
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
--template cmc-e2e \
--template-command "python3 ./cmc-templates.py --template_name cmc-e2e --config_file cypress.atvm-config-gold.ts" \
--runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" \
--config-family gold \
--config-file cypress.atvm-config-gold.ts \
--migration-style "ATVM end-to-end migration validation" \
--integration-plugin "pure with fc" \
--categorize
```
That results in:
- state dir:
- `/var/lib/atvm-run-watcher/e2e-redhat9.6-ubuntu24.04-w2k25-fc`
- service instance:
- `atvm-run-watcher@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service`
- `atvm-runner@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service`
The helper also:
- stops any stale watcher instance for that same requested build name
- removes the old watcher state directory for that requested build name
- starts the new watcher with a clean state root for the new run
## Cancel Example
```bash
./cancel-atvm-run-watcher.sh --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
```
This writes a cancellation marker, updates `state.json` to `CANCELLED`, and stops the watcher instance. The watcher will not send Mattermost results for that run.
Runner cancel example:
```bash
./cancel-atvm-runner.sh --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
```
## Notes
- The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`.
- Prefer the controller-local `atvm-runner@...` service over ad hoc `nohup` or detached SSH launch patterns for `run-sorry-cypress.py`.
- Prefer `start-atvm-run.sh` when launching both services together because it prevents the watcher/runner log-path race by enforcing watcher-first ordering.
- Kernel values are resolved from `atvm/inventory/vm-inventory.md`.
- Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases.
- In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately.
- In categorized mode, if the child build id label does not match the host/spec actually being executed, the watcher reports the grouped run using the inferred host-based group name instead of trusting the raw child build id label.
- In categorized mode, grouped XML can finish with only `check-xml-files.ts`; when that happens, the watcher must recover per-host results from the matching host reporter artifacts.
- Do not infer `PASS completed` from host artifact presence alone. Parse the per-host reporter result and preserve real `FAIL` and `RUN/pending` state when reconstructing grouped results.
- When the repo copy of the watcher changes, the controller install under `/opt/atvm-watcher-service` must be updated before expecting the new reporting behavior from live runs.
- Best-practice controller install path: `/opt/atvm-watcher-service`.
- This package is local-only right now. Nothing here is installed on the controller yet.