- update the watcher cancel helper so it writes a final CANCELLED state into state.json before stopping the service - record cancellation timestamps and a cancellation note in the watcher state file for clearer post-run inspection - update the watcher service docs so the documented cancel behavior matches the state-file handling
229 lines
5.7 KiB
Markdown
229 lines
5.7 KiB
Markdown
# ATVM Watcher Service Install Plan
|
|
|
|
This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`.
|
|
|
|
This is a deployment plan only. It does not perform the installation.
|
|
|
|
## Goal
|
|
|
|
Install the local watcher package so the controller can:
|
|
|
|
- watch one ATVM run per watcher instance
|
|
- send final Mattermost status only for `COMPLETED` or `FAILED`
|
|
- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
|
|
- stop automatically after the watched run reaches a terminal state
|
|
|
|
## Controller Target Layout
|
|
|
|
Recommended controller paths:
|
|
|
|
- package root:
|
|
- `/opt/atvm-watcher-service`
|
|
- service unit:
|
|
- `/etc/systemd/system/atvm-run-watcher@.service`
|
|
- global environment file:
|
|
- `/etc/atvm-run-watcher.env`
|
|
- state root:
|
|
- `/var/lib/atvm-run-watcher`
|
|
- ATVM automation root:
|
|
- `/root/cdc-e2e-cyp-12.17.4`
|
|
|
|
Best-practice rule:
|
|
|
|
- install the watcher service package under `/opt/atvm-watcher-service`
|
|
- do not use `/root/atvm-watcher-service` as the standard install location
|
|
- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
|
|
|
|
## Files To Install
|
|
|
|
From the local workspace:
|
|
|
|
- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
|
|
- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
|
|
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
|
|
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
|
|
- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
|
|
|
|
Optional reference docs:
|
|
|
|
- `/home/aw/code/cds/atvm/watcher-service/README.md`
|
|
- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
|
|
|
|
## Required Controller Environment
|
|
|
|
The controller must have:
|
|
|
|
- `python3`
|
|
- `systemd`
|
|
- outbound network access to the Mattermost webhook
|
|
- read access to:
|
|
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
|
|
- `/tmp/<build-name>.log`
|
|
|
|
## Required Secrets
|
|
|
|
The controller needs a watcher environment file with:
|
|
|
|
- `MATTERMOST_ATVM_WEBHOOK`
|
|
- `MATTERMOST_ATVM_CHANNEL`
|
|
|
|
Recommended file:
|
|
|
|
- `/etc/atvm-run-watcher.env`
|
|
|
|
Recommended permissions:
|
|
|
|
- owner: `root`
|
|
- mode: `0600`
|
|
|
|
## Deployment Steps
|
|
|
|
1. Create controller directories.
|
|
- `/opt/atvm-watcher-service`
|
|
- `/var/lib/atvm-run-watcher`
|
|
|
|
2. Copy package files to the controller.
|
|
- copy the Python watcher
|
|
- copy the `systemd` unit file
|
|
- copy the helper scripts
|
|
- copy `vm-inventory.md`
|
|
|
|
3. Set executable permissions.
|
|
- `atvm_run_watcher.py`
|
|
- `start-atvm-run-watcher.sh`
|
|
- `cancel-atvm-run-watcher.sh`
|
|
|
|
4. Create `/etc/atvm-run-watcher.env`.
|
|
- add Mattermost webhook/channel
|
|
- keep permissions restricted
|
|
|
|
5. Install the `systemd` unit file.
|
|
- copy to `/etc/systemd/system/atvm-run-watcher@.service`
|
|
|
|
6. Reload `systemd`.
|
|
- `systemctl daemon-reload`
|
|
|
|
7. Run a syntax/smoke validation.
|
|
- check Python import/launch
|
|
- check helper script usage
|
|
- verify the unit resolves
|
|
|
|
8. Do a non-production test.
|
|
- start a watcher for a fake or completed build name
|
|
- confirm state directory creation
|
|
- confirm the watcher exits as expected
|
|
|
|
9. Do a real ATVM run test.
|
|
- launch a real run
|
|
- start the watcher for that build name
|
|
- confirm final Mattermost delivery for a completed run
|
|
|
|
## Recommended Validation Commands
|
|
|
|
Examples for later execution on the controller:
|
|
|
|
```bash
|
|
mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
|
|
```
|
|
|
|
```bash
|
|
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
|
|
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
|
|
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
|
|
```
|
|
|
|
```bash
|
|
systemctl daemon-reload
|
|
systemctl cat atvm-run-watcher@.service
|
|
```
|
|
|
|
```bash
|
|
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
|
|
```
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
|
|
```
|
|
|
|
## Per-Run Usage After Install
|
|
|
|
Once installed, the intended workflow is:
|
|
|
|
1. Launch the ATVM run as usual.
|
|
2. Start the watcher for that build name.
|
|
3. Let the watcher run on the controller.
|
|
4. The watcher exits on terminal state.
|
|
|
|
Example:
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
|
--template cmc-e2e \
|
|
--config-family gold \
|
|
--migration-style "ATVM end-to-end migration validation" \
|
|
--integration-plugin "pure with fc" \
|
|
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
|
|
```
|
|
|
|
Cancel example:
|
|
|
|
```bash
|
|
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
|
|
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
|
|
```
|
|
|
|
The cancel helper should:
|
|
|
|
- write `cancelled.marker`
|
|
- update `state.json` so the final watcher state is `CANCELLED`
|
|
- stop the watcher instance
|
|
- avoid any Mattermost post for that run
|
|
|
|
## Operational Notes
|
|
|
|
- This is not a daemon.
|
|
- One watcher instance is started per ATVM run.
|
|
- The watcher exits after the run reaches a terminal state.
|
|
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
|
|
- The watcher prevents duplicate Mattermost posts by writing a posted marker.
|
|
|
|
## Failure Handling
|
|
|
|
Expected terminal behavior:
|
|
|
|
- `COMPLETED`
|
|
- post to Mattermost
|
|
- verify `ok`
|
|
- exit
|
|
- `FAILED`
|
|
- post to Mattermost
|
|
- verify `ok`
|
|
- exit
|
|
- `CANCELLED`
|
|
- write final `CANCELLED` state to `state.json`
|
|
- do not post
|
|
- exit
|
|
- `TERMINATED`
|
|
- do not post
|
|
- exit
|
|
- `HUNG`
|
|
- do not post
|
|
- exit
|
|
- `UNKNOWN`
|
|
- do not post
|
|
- exit
|
|
|
|
## Answer To "Do We Need An Installer README?"
|
|
|
|
Not strictly, but yes, it is useful.
|
|
|
|
Why:
|
|
|
|
- it gives a repeatable controller deployment procedure
|
|
- it separates local package design from controller installation steps
|
|
- it makes later install/reinstall safer
|
|
- it gives you a review checkpoint before anything is installed on `192.168.3.190`
|
|
|
|
That is the purpose of this file.
|