From 9673d769e2a2d42a1c11d33b5bad4eab2a07d114 Mon Sep 17 00:00:00 2001 From: "anthony.wen" Date: Wed, 29 Apr 2026 12:14:55 -0400 Subject: [PATCH] fix atvm watcher-backed run launch sequence Execute the template step before starting watcher-backed ATVM runs. - run --template-command synchronously in start-atvm-run.sh - write template output to /tmp/.launch.log - stop before watcher/runner startup if template generation fails - document the corrected wrapper behavior in watcher-service docs - record the stale specPattern failure mode in automation run learnings --- atvm/docs/automation/run-learnings.md | 9 +++++++++ atvm/watcher-service/INSTALL.md | 10 ++++++---- atvm/watcher-service/README.md | 14 +++++++++----- atvm/watcher-service/start-atvm-run.sh | 22 ++++++++++++++++++++++ 4 files changed, 46 insertions(+), 9 deletions(-) diff --git a/atvm/docs/automation/run-learnings.md b/atvm/docs/automation/run-learnings.md index 1ad8aa2..388a6bd 100644 --- a/atvm/docs/automation/run-learnings.md +++ b/atvm/docs/automation/run-learnings.md @@ -9,6 +9,15 @@ This file stores run-specific examples only when a run produced a new learning r ## Current State - No run-learning entries recorded yet from `guide.md` source material. +## Run Learning: 2026-04-29 (Combined watcher wrapper must execute template generation before runner startup) +- Observed failure mode: + - A watcher-backed `start-atvm-run.sh` launch for `cmc-migrateops-compute-migration` started `run-sorry-cypress.py` without ever running the approved `cmc-templates.py` command. + - The wrapper passed `--template-command` into watcher metadata only, so the runner consumed stale controller config state and started against a previous `specPattern` pointing at `atvm121-ubuntu24.04`. +- Action for future runs: + - The combined watcher wrapper must execute `--template-command` synchronously before watcher and runner startup. + - Write the template phase output to `/tmp/.launch.log` so template activity is preserved separately from the live runner log. + - If the template step fails, stop immediately and do not start the watcher or the runner. + ## Run Learning: 2026-04-24 (Categorized watcher false-PASS guardrail) - Observed failure mode: - A categorized compute-migration run was incorrectly reported as `PASS` for `atvm121-ubuntu24.04` even though the actual Ubuntu grouped sub-run failed. diff --git a/atvm/watcher-service/INSTALL.md b/atvm/watcher-service/INSTALL.md index 7b1a413..1eedc3e 100644 --- a/atvm/watcher-service/INSTALL.md +++ b/atvm/watcher-service/INSTALL.md @@ -183,11 +183,13 @@ python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help Once installed, the intended workflow is: -1. Start the watcher for that build name. +1. Run the approved `cmc-templates.py` command for that build name. + - when using `start-atvm-run.sh`, the wrapper should execute `--template-command` synchronously and stop immediately if that step fails +2. Start the watcher for that build name. - the start helper must clear any stale watcher state for that same requested build name before starting the new watcher instance -2. Start the runner service for that build name. -3. Let the runner and watcher run on the controller. -4. The watcher exits on terminal state. +3. Start the runner service for that build name. +4. Let the runner and watcher run on the controller. +5. The watcher exits on terminal state. Example: diff --git a/atvm/watcher-service/README.md b/atvm/watcher-service/README.md index cd9374d..910448e 100644 --- a/atvm/watcher-service/README.md +++ b/atvm/watcher-service/README.md @@ -57,11 +57,12 @@ Each watcher instance is tied to one requested build name. Typical workflow: -1. Start the watcher for that run. -2. Start the runner service for that run. -3. The watcher polls the runner log, process state, and `cmcReporter` artifacts. +1. Run the approved `cmc-templates.py` command for that run when one is provided. +2. Start the watcher for that run. +3. Start the runner service for that run. +4. The watcher polls the runner log, process state, and `cmcReporter` artifacts. - before starting, the helper resets any prior watcher state for the same requested build name so stale cancellation or posted markers do not leak into a new run -4. For non-categorized runs, when the run reaches a terminal state: +5. For non-categorized runs, when the run reaches a terminal state: - `COMPLETED` or `FAILED` - build the final ATVM status - send the status to Mattermost @@ -72,7 +73,7 @@ Typical workflow: - do not post - mark the final state - exit -5. For categorized runs: +6. For categorized runs: - detect each grouped sub-run in sequence from the parent run log - wait for that grouped sub-run to finish - send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED` @@ -154,6 +155,9 @@ That results in: The helper also: +- runs `--template-command` synchronously first when one is provided +- writes the template phase output to `/tmp/.launch.log` +- exits before watcher/runner startup if the template step fails - stops any stale watcher instance for that same requested build name - removes the old watcher state directory for that requested build name - starts the new watcher with a clean state root for the new run diff --git a/atvm/watcher-service/start-atvm-run.sh b/atvm/watcher-service/start-atvm-run.sh index 1796df6..77ac674 100644 --- a/atvm/watcher-service/start-atvm-run.sh +++ b/atvm/watcher-service/start-atvm-run.sh @@ -37,6 +37,7 @@ SCOPE_DESCRIPTION="" WATCHER_CATEGORIZED="false" RUNNER_WORKDIR="/root/cdc-e2e-cyp-12.17.4" RUNNER_LOG="" +LAUNCH_LOG="" STATE_ROOT="/var/lib/atvm-run-watcher" while [[ $# -gt 0 ]]; do @@ -76,6 +77,8 @@ if [[ -z "$RUNNER_LOG" ]]; then RUNNER_LOG="/tmp/${BUILD_NAME}.log" fi +LAUNCH_LOG="/tmp/${BUILD_NAME}.launch.log" + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" watcher_cmd=( @@ -112,6 +115,25 @@ runner_cmd=( --state-root "${STATE_ROOT}" ) +mkdir -p "$(dirname "${LAUNCH_LOG}")" +: > "${LAUNCH_LOG}" + +if [[ -n "${TEMPLATE_COMMAND}" ]]; then + { + echo "Running template command:" + echo "${TEMPLATE_COMMAND}" + echo + } >>"${LAUNCH_LOG}" + + if ! ( + cd "${RUNNER_WORKDIR}" + bash -lc "${TEMPLATE_COMMAND}" + ) >>"${LAUNCH_LOG}" 2>&1; then + echo "Template command failed for ${BUILD_NAME}. See ${LAUNCH_LOG}" >&2 + exit 1 + fi +fi + "${watcher_cmd[@]}" for _ in {1..15}; do