diff --git a/atvm/AGENTS.md b/atvm/AGENTS.md index 8fe3ee2..48dd1a6 100644 --- a/atvm/AGENTS.md +++ b/atvm/AGENTS.md @@ -74,6 +74,10 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co - When the watcher is requested, start the watcher before `run-sorry-cypress.py`. - Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/.log` and can delete the fresh live runner log if the runner starts first. - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`. +- Treat `/var/lib/atvm-run-watcher//state.json` as cached watcher output, not the source of truth for a completed-run confirmation. +- Before confirming a completed ATVM run status, verify in this order: live launch log, matching reporter artifacts, `Cloud Run Finished` summary / Currents URL, then compare against saved watcher state. +- If saved watcher state disagrees with the launch log or a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not report from it. +- Never confirm a completed ATVM run from `state.json` alone. - If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally. - If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed. - Do not call out expected, harmless `systemctl reset-failed ... unit not loaded` output in routine run updates; mention it only if it blocks startup or matters for debugging. diff --git a/atvm/docs/automation/guide.md b/atvm/docs/automation/guide.md index 944ff24..43df118 100644 --- a/atvm/docs/automation/guide.md +++ b/atvm/docs/automation/guide.md @@ -76,6 +76,14 @@ Run ATVM CMC automation tests on the designated automation VM without unintended - Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed. - For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`. - If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run. +- Treat saved watcher state under `/var/lib/atvm-run-watcher//state.json` as cached status only. +- For completed-run verification, confirm in this order: + - launch log under `/tmp/.launch.log` + - matching `cmcReporter` artifacts + - `Cloud Run Finished` summary and Currents URL + - saved watcher state only as a comparison layer +- If saved watcher state disagrees with the launch log or with a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not use it as the reported result. +- Never confirm a completed run from `state.json` alone. Typical sequence: 1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request. @@ -88,6 +96,13 @@ Typical sequence: 8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`. 9. Run `run-sorry-cypress.py` with the matching approved config and build name. +Completed-run verification sequence: +1. Read the launch log for the build. +2. Inspect the matching reporter artifacts for the relevant host(s). +3. Use the `Cloud Run Finished` summary and Currents URL as the final parent-run check when present. +4. Compare that result against saved watcher state. +5. If there is any disagreement, replay the exact artifacts through the current watcher code in an isolated temp state directory before confirming the result. + ## Config File / Gold Disk Mapping - `cypress.atvm-config-gold.ts` -> Gold Disk 1 - `cypress.atvm-config-gold-2.ts` -> Gold Disk 2 diff --git a/atvm/docs/automation/run-learnings.md b/atvm/docs/automation/run-learnings.md index b815e24..9b16328 100644 --- a/atvm/docs/automation/run-learnings.md +++ b/atvm/docs/automation/run-learnings.md @@ -401,3 +401,28 @@ This file stores run-specific examples only when a run produced a new learning r - Do not classify a reporter TXT artifact as failed just because it contains the word `error`. - For TXT fallback, require explicit terminal failure markers such as `cy:command error`, `cy:task error`, or real `Error:`/`AssertionError:`/timeout text. - Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text. + +## Run Learning: 2026-03-30 (Replay exact artifacts before assuming a thin closed-run detail is a current watcher bug) +- Observed failure mode: + - The saved controller state for `reboot-redhat8.10-both` still showed only `1 failures` under the host detail, even though the launch log contained the full md5sum failure text. + - Replaying the exact launch log and reporter artifacts through the currently installed watcher produced the correct host detail with `57 tests, 1 failures` and the failing testcase/error text. +- Action for future runs: + - Before patching the watcher again for a thin closed-run detail, replay the exact run artifacts through the currently installed watcher code. + - Treat a mismatch between saved state and current replay as evidence of a stale in-memory watcher instance or stale deployment, not automatically as a parser regression. + - Use an isolated temp state directory or other no-post path for that replay so historical validation does not repost results. + +## Run Learning: 2026-03-30 (Red Hat 8.10 Pure both failure on step 38 was a missing FC reboot-validation artifact with concurrent storage instability) +- Observed failure mode: + - The failing testcase was `38. Verify diskname2Reboot file is the same as diskname2Reboot’s source (Reboot test)`. + - The concrete error was `md5sum: /root/tmp/fcDisk/diskname2Reboot.md5: No such file or directory`. + - On the target after the run, `/root/tmp/fcDisk` contained `diskname2Disk` and `diskname2Disk.md5`, but not `diskname2Reboot.md5`. +- Additional host findings: + - The target showed repeated iSCSI authorization failures and later `Could not log into all portals`. + - `mtdi-driver.service` started at `17:30:26 EDT`. + - `iscsid.service` / `Open-iSCSI` started at `17:30:30 EDT`. + - `iscsi.service`, `mtdi-daemon.service`, and `galaxy-migrate.service` reached active state at `17:32:45 EDT`. + - Repeated multipath reinitialization and `failed to get ... uid` messages continued through the run window. +- Action for future runs: + - If this failure recurs, treat it as a host/storage investigation first, not just a watcher-formatting issue. + - Check whether the FC reboot-validation step actually created `diskname2Reboot.md5` on `/root/tmp/fcDisk` before the md5 verification step ran. + - Check whether repeated iSCSI auth failures or multipath churn during the same boot window are interfering with the expected disk/file state.