diff --git a/atvm/AGENTS.md b/atvm/AGENTS.md index 7945421..6898c79 100644 --- a/atvm/AGENTS.md +++ b/atvm/AGENTS.md @@ -78,6 +78,7 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co - Unless the operator explicitly reconfirms `both` for `cmc-reboot`, prefer only `fc` or only `iscsi`, not both. - When the watcher is requested, start the watcher before `run-sorry-cypress.py`. - When the watcher is requested, build the watcher-start command so it automatically includes the exact approved `cmc-templates.py` command via `--template-command` and the exact approved `run-sorry-cypress.py` command via `--runner-command`; the operator should not need to restate them separately. +- When watcher-backed execution is used, prefer the controller-local `atvm-runner@...` systemd service over detached SSH background launch patterns for `run-sorry-cypress.py`. - Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/.log` and can delete the fresh live runner log if the runner starts first. - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`. - Apply failed-host detail recovery consistently for every ATVM template run, not just `cmc-reboot`. diff --git a/atvm/docs/automation/guide.md b/atvm/docs/automation/guide.md index 2a3422a..44d11d5 100644 --- a/atvm/docs/automation/guide.md +++ b/atvm/docs/automation/guide.md @@ -105,8 +105,9 @@ Typical sequence: 6. Verify the generated `.ts` files and the config `specPattern` include every requested VM before starting the runner. 7. If the watcher is approved, make sure the controller's deployed watcher code is the intended version before relying on its posts. 8. If the watcher is approved, build the watcher-start command so it automatically includes the exact approved `cmc-templates.py` command via `--template-command` and the exact approved `run-sorry-cypress.py` command via `--runner-command`. -9. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`. -10. Run `run-sorry-cypress.py` with the matching approved config and build name. +9. If the watcher is approved, prefer the controller-local `atvm-runner@...` systemd service instead of detached SSH background launch patterns for `run-sorry-cypress.py`. +10. If the watcher is approved, start the watcher before launching the runner service. +11. Start the runner with the matching approved config and build name. Completed-run verification sequence: 1. Read the launch log for the build. @@ -205,15 +206,15 @@ Before any new automation request: 8. If the run uses `--categorize` and the watcher is requested, include `--categorize` on the watcher start command too so the watcher tracks sequential categorized sub-runs correctly. 9. Run only approved command(s), no extra options and no silent substitutions. 10. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel. -11. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files. -12. After `cmc-templates.py`, always verify that the generated spec files on disk and the config `specPattern` both contain the full requested VM set before launching `run-sorry-cypress.py`. +11. Do not launch the ATVM runner until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files. +12. After `cmc-templates.py`, always verify that the generated spec files on disk and the config `specPattern` both contain the full requested VM set before launching the ATVM runner. 13. If any requested VM is missing from the generated files or `specPattern`, stop and report the mismatch instead of launching the runner. 14. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve. 15. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval. 16. If the planned command is `cmc-reboot` with `--use_specified_plugin both`, add the FC+iSCSI timing warning to the review message and require explicit confirmation that `both` is intended before execution. 17. If monitoring was not requested, report immediate success/failure for each command. 18. If monitoring was requested, keep monitoring until completion and report final outcome. -19. When the watcher is requested, launch the watcher before `run-sorry-cypress.py`. +19. When the watcher is requested, launch the watcher before the runner service. 20. Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/.log` and can delete the fresh live runner log if the runner starts first. ## Requested Test Style diff --git a/atvm/watcher-service/INSTALL.md b/atvm/watcher-service/INSTALL.md index 660e9e1..f1e10a5 100644 --- a/atvm/watcher-service/INSTALL.md +++ b/atvm/watcher-service/INSTALL.md @@ -1,13 +1,14 @@ # ATVM Watcher Service Install Plan -This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`. +This document describes how to deploy the ATVM per-run watcher and runner services to the ATVM Cypress controller at `192.168.3.190`. This is a deployment plan only. It does not perform the installation. ## Goal -Install the local watcher package so the controller can: +Install the local watcher/runner package so the controller can: +- start one requested ATVM Cypress runner per service instance - watch one requested ATVM run per watcher instance - for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED` - for categorized runs, send one final Mattermost status per completed categorized sub-run/group @@ -20,6 +21,8 @@ Recommended controller paths: - package root: - `/opt/atvm-watcher-service` +- runner service unit: + - `/etc/systemd/system/atvm-runner@.service` - service unit: - `/etc/systemd/system/atvm-run-watcher@.service` - global environment file: @@ -40,6 +43,10 @@ Best-practice rule: From the local workspace: - `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py` +- `/home/aw/code/cds/atvm/watcher-service/run-atvm-runner.sh` +- `/home/aw/code/cds/atvm/watcher-service/atvm-runner@.service` +- `/home/aw/code/cds/atvm/watcher-service/start-atvm-runner.sh` +- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-runner.sh` - `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service` - `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh` - `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh` @@ -84,12 +91,18 @@ Recommended permissions: - `/var/lib/atvm-run-watcher` 2. Copy package files to the controller. + - copy the runner wrapper + - copy the runner `systemd` unit file + - copy the runner helper scripts - copy the Python watcher - copy the `systemd` unit file - copy the helper scripts - copy `vm-inventory.md` 3. Set executable permissions. + - `run-atvm-runner.sh` + - `start-atvm-runner.sh` + - `cancel-atvm-runner.sh` - `atvm_run_watcher.py` - `start-atvm-run-watcher.sh` - `cancel-atvm-run-watcher.sh` @@ -99,6 +112,7 @@ Recommended permissions: - keep permissions restricted 5. Install the `systemd` unit file. + - copy the runner unit to `/etc/systemd/system/atvm-runner@.service` - copy to `/etc/systemd/system/atvm-run-watcher@.service` 6. Reload `systemd`. @@ -132,6 +146,9 @@ mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher ``` ```bash +chmod 755 /opt/atvm-watcher-service/run-atvm-runner.sh +chmod 755 /opt/atvm-watcher-service/start-atvm-runner.sh +chmod 755 /opt/atvm-watcher-service/cancel-atvm-runner.sh chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh @@ -139,6 +156,7 @@ chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh ```bash systemctl daemon-reload +systemctl cat atvm-runner@.service systemctl cat atvm-run-watcher@.service ``` @@ -146,6 +164,10 @@ systemctl cat atvm-run-watcher@.service python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help ``` +```bash +/opt/atvm-watcher-service/start-atvm-runner.sh --help +``` + ```bash /opt/atvm-watcher-service/start-atvm-run-watcher.sh --help ``` @@ -154,10 +176,10 @@ python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help Once installed, the intended workflow is: -1. Launch the ATVM run as usual. -2. Start the watcher for that build name. +1. Start the watcher for that build name. - the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance -3. Let the watcher run on the controller. +2. Start the runner service for that build name. +3. Let the runner and watcher run on the controller. 4. The watcher exits on terminal state. Example: @@ -173,10 +195,19 @@ Example: --integration-plugin "pure with fc" \ --categorize \ --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set" + +/opt/atvm-watcher-service/start-atvm-runner.sh \ + --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \ + --runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" ``` Cancel example: +```bash +/opt/atvm-watcher-service/cancel-atvm-runner.sh \ + --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc +``` + ```bash /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \ --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc @@ -192,7 +223,9 @@ The cancel helper should: ## Operational Notes - This is not a daemon. +- One runner instance is started per ATVM run. - One watcher instance is started per ATVM run. +- Prefer the `atvm-runner@...` service over detached SSH background launch patterns for `run-sorry-cypress.py`. - Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs. - In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed. - The watcher exits after the run reaches a terminal state. diff --git a/atvm/watcher-service/README.md b/atvm/watcher-service/README.md index f9100d9..ec62724 100644 --- a/atvm/watcher-service/README.md +++ b/atvm/watcher-service/README.md @@ -19,10 +19,18 @@ The watcher does not run indefinitely. It is designed for one run per service in ## Files +- `atvm-runner@.service` + - `systemd` template unit for one runner instance per build name - `atvm_run_watcher.py` - main watcher implementation - `atvm-run-watcher@.service` - `systemd` template unit for one watcher instance per build name +- `run-atvm-runner.sh` + - runner wrapper used by the `systemd` runner unit +- `start-atvm-runner.sh` + - helper to write per-run runner environment data and start a runner instance +- `cancel-atvm-runner.sh` + - helper to stop a runner instance - `start-atvm-run-watcher.sh` - helper to write per-run environment data and start a watcher instance - `cancel-atvm-run-watcher.sh` @@ -33,6 +41,7 @@ The watcher does not run indefinitely. It is designed for one run per service in These are the default install targets assumed by the included unit file: - service package root: `/opt/atvm-watcher-service` +- runner unit: `/etc/systemd/system/atvm-runner@.service` - watcher state root: `/var/lib/atvm-run-watcher` - controller ATVM automation root: `/root/cdc-e2e-cyp-12.17.4` - watcher environment file: `/etc/atvm-run-watcher.env` @@ -46,9 +55,9 @@ Each watcher instance is tied to one requested build name. Typical workflow: -1. Launch the ATVM run. -2. Start the watcher for that run. -3. The watcher polls the run log, process state, and `cmcReporter` artifacts. +1. Start the watcher for that run. +2. Start the runner service for that run. +3. The watcher polls the runner log, process state, and `cmcReporter` artifacts. - before starting, the helper resets any prior watcher state for the same requested build name so stale cancellation or posted markers do not leak into a new run 4. For non-categorized runs, when the run reaches a terminal state: - `COMPLETED` or `FAILED` @@ -88,9 +97,18 @@ Optional metadata for better status formatting: - `ATVM_WATCHER_SCOPE_DESCRIPTION` - `ATVM_WATCHER_CATEGORIZED` +Runner environment required per run: + +- `ATVM_RUNNER_COMMAND` + +Runner environment optional per run: + +- `ATVM_RUNNER_WORKDIR` +- `ATVM_RUNNER_LOG` + ## Start Example -This helper writes a per-run environment file and starts the matching instance: +These helpers write per-run environment files and start the matching instances: ```bash ./start-atvm-run-watcher.sh \ @@ -103,6 +121,10 @@ This helper writes a per-run environment file and starts the matching instance: --integration-plugin "pure with fc" \ --categorize \ --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set" + +./start-atvm-runner.sh \ + --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \ + --runner-command "python3 ./run-sorry-cypress.py --config_file cypress.atvm-config-gold.ts --build_name e2e-redhat9.6-ubuntu24.04-w2k25-fc --categorize" ``` That results in: @@ -111,6 +133,7 @@ That results in: - `/var/lib/atvm-run-watcher/e2e-redhat9.6-ubuntu24.04-w2k25-fc` - service instance: - `atvm-run-watcher@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service` + - `atvm-runner@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service` The helper also: @@ -126,9 +149,16 @@ The helper also: This writes a cancellation marker, updates `state.json` to `CANCELLED`, and stops the watcher instance. The watcher will not send Mattermost results for that run. +Runner cancel example: + +```bash +./cancel-atvm-runner.sh --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc +``` + ## Notes - The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`. +- Prefer the controller-local `atvm-runner@...` service over ad hoc `nohup` or detached SSH launch patterns for `run-sorry-cypress.py`. - Kernel values are resolved from `atvm/inventory/vm-inventory.md`. - Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases. - In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately. diff --git a/atvm/watcher-service/atvm-runner@.service b/atvm/watcher-service/atvm-runner@.service new file mode 100644 index 0000000..f433d72 --- /dev/null +++ b/atvm/watcher-service/atvm-runner@.service @@ -0,0 +1,14 @@ +[Unit] +Description=ATVM Cypress runner for %i +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +WorkingDirectory=/opt/atvm-watcher-service +EnvironmentFile=-/var/lib/atvm-run-watcher/%i/run.env +ExecStart=/opt/atvm-watcher-service/run-atvm-runner.sh %i +Restart=no + +[Install] +WantedBy=multi-user.target diff --git a/atvm/watcher-service/cancel-atvm-runner.sh b/atvm/watcher-service/cancel-atvm-runner.sh new file mode 100644 index 0000000..459e932 --- /dev/null +++ b/atvm/watcher-service/cancel-atvm-runner.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + cancel-atvm-runner.sh --build-name +EOF +} + +BUILD_NAME="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --build-name) BUILD_NAME="${2:-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;; + esac +done + +if [[ -z "$BUILD_NAME" ]]; then + echo "--build-name is required" >&2 + usage >&2 + exit 1 +fi + +systemctl stop "atvm-runner@${BUILD_NAME}.service" || true diff --git a/atvm/watcher-service/run-atvm-runner.sh b/atvm/watcher-service/run-atvm-runner.sh new file mode 100644 index 0000000..a658eb0 --- /dev/null +++ b/atvm/watcher-service/run-atvm-runner.sh @@ -0,0 +1,37 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + run-atvm-runner.sh + +This script is intended to be launched by systemd for one ATVM run. +It expects environment variables from the runner unit/environment files: + ATVM_RUNNER_COMMAND + ATVM_RUNNER_WORKDIR + ATVM_RUNNER_LOG +EOF +} + +BUILD_NAME="${1:-}" +if [[ -z "$BUILD_NAME" ]]; then + echo "build name is required" >&2 + usage >&2 + exit 1 +fi + +RUNNER_COMMAND="${ATVM_RUNNER_COMMAND:-}" +RUNNER_WORKDIR="${ATVM_RUNNER_WORKDIR:-/root/cdc-e2e-cyp-12.17.4}" +RUNNER_LOG="${ATVM_RUNNER_LOG:-/tmp/${BUILD_NAME}.log}" + +if [[ -z "$RUNNER_COMMAND" ]]; then + echo "ATVM_RUNNER_COMMAND is required" >&2 + exit 1 +fi + +mkdir -p "$(dirname "$RUNNER_LOG")" +: > "$RUNNER_LOG" + +cd "$RUNNER_WORKDIR" +exec bash -lc "$RUNNER_COMMAND" >>"$RUNNER_LOG" 2>&1 diff --git a/atvm/watcher-service/start-atvm-runner.sh b/atvm/watcher-service/start-atvm-runner.sh new file mode 100644 index 0000000..63012bd --- /dev/null +++ b/atvm/watcher-service/start-atvm-runner.sh @@ -0,0 +1,63 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: + start-atvm-runner.sh --build-name --runner-command [options] + +Options: + --build-name + --runner-command + --workdir Default: /root/cdc-e2e-cyp-12.17.4 + --log-path Default: /tmp/.log + --state-root Default: /var/lib/atvm-run-watcher +EOF +} + +BUILD_NAME="" +RUNNER_COMMAND="" +RUNNER_WORKDIR="/root/cdc-e2e-cyp-12.17.4" +RUNNER_LOG="" +STATE_ROOT="/var/lib/atvm-run-watcher" + +while [[ $# -gt 0 ]]; do + case "$1" in + --build-name) BUILD_NAME="${2:-}"; shift 2 ;; + --runner-command) RUNNER_COMMAND="${2:-}"; shift 2 ;; + --workdir) RUNNER_WORKDIR="${2:-}"; shift 2 ;; + --log-path) RUNNER_LOG="${2:-}"; shift 2 ;; + --state-root) STATE_ROOT="${2:-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;; + esac +done + +if [[ -z "$BUILD_NAME" ]]; then + echo "--build-name is required" >&2 + usage >&2 + exit 1 +fi + +if [[ -z "$RUNNER_COMMAND" ]]; then + echo "--runner-command is required" >&2 + usage >&2 + exit 1 +fi + +if [[ -z "$RUNNER_LOG" ]]; then + RUNNER_LOG="/tmp/${BUILD_NAME}.log" +fi + +RUN_DIR="${STATE_ROOT}/${BUILD_NAME}" +mkdir -p "$RUN_DIR" + +cat >"${RUN_DIR}/run.env" </dev/null 2>&1 || true +systemctl start "atvm-runner@${BUILD_NAME}.service" +systemctl status --no-pager "atvm-runner@${BUILD_NAME}.service" || true