Tighten ATVM completed-run status verification

2026-03-30 20:12:56 -04:00
parent b45375dbbc
commit dec13a4667
3 changed files with 44 additions and 0 deletions
--- a/atvm/AGENTS.md
+++ b/atvm/AGENTS.md
@@ -74,6 +74,10 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
 - When the watcher is requested, start the watcher before `run-sorry-cypress.py`.
 - Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/<build-name>.log` and can delete the fresh live runner log if the runner starts first.
 - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
 - Treat `/var/lib/atvm-run-watcher/<build>/state.json` as cached watcher output, not the source of truth for a completed-run confirmation.
 - Before confirming a completed ATVM run status, verify in this order: live launch log, matching reporter artifacts, `Cloud Run Finished` summary / Currents URL, then compare against saved watcher state.
 - If saved watcher state disagrees with the launch log or a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not report from it.
 - Never confirm a completed ATVM run from `state.json` alone.
 - If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
 - If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
 - Do not call out expected, harmless `systemctl reset-failed ... unit not loaded` output in routine run updates; mention it only if it blocks startup or matters for debugging.
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -76,6 +76,14 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
 - Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
 - For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
 - If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
 - Treat saved watcher state under `/var/lib/atvm-run-watcher/<build>/state.json` as cached status only.
 - For completed-run verification, confirm in this order:
  - launch log under `/tmp/<build>.launch.log`
  - matching `cmcReporter` artifacts
  - `Cloud Run Finished` summary and Currents URL
  - saved watcher state only as a comparison layer
 - If saved watcher state disagrees with the launch log or with a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not use it as the reported result.
 - Never confirm a completed run from `state.json` alone.
 Typical sequence:
 1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
@@ -88,6 +96,13 @@ Typical sequence:
 8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
 9. Run `run-sorry-cypress.py` with the matching approved config and build name.
 Completed-run verification sequence:
 1. Read the launch log for the build.
 2. Inspect the matching reporter artifacts for the relevant host(s).
 3. Use the `Cloud Run Finished` summary and Currents URL as the final parent-run check when present.
 4. Compare that result against saved watcher state.
 5. If there is any disagreement, replay the exact artifacts through the current watcher code in an isolated temp state directory before confirming the result.
 ## Config File / Gold Disk Mapping
 - `cypress.atvm-config-gold.ts` -> Gold Disk 1
 - `cypress.atvm-config-gold-2.ts` -> Gold Disk 2
--- a/atvm/docs/automation/run-learnings.md
+++ b/atvm/docs/automation/run-learnings.md
@@ -401,3 +401,28 @@ This file stores run-specific examples only when a run produced a new learning r
  - Do not classify a reporter TXT artifact as failed just because it contains the word `error`.
  - For TXT fallback, require explicit terminal failure markers such as `cy:command error`, `cy:task error`, or real `Error:`/`AssertionError:`/timeout text.
  - Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text.
 ## Run Learning: 2026-03-30 (Replay exact artifacts before assuming a thin closed-run detail is a current watcher bug)
 - Observed failure mode:
  - The saved controller state for `reboot-redhat8.10-both` still showed only `1 failures` under the host detail, even though the launch log contained the full md5sum failure text.
  - Replaying the exact launch log and reporter artifacts through the currently installed watcher produced the correct host detail with `57 tests, 1 failures` and the failing testcase/error text.
 - Action for future runs:
  - Before patching the watcher again for a thin closed-run detail, replay the exact run artifacts through the currently installed watcher code.
  - Treat a mismatch between saved state and current replay as evidence of a stale in-memory watcher instance or stale deployment, not automatically as a parser regression.
  - Use an isolated temp state directory or other no-post path for that replay so historical validation does not repost results.
 ## Run Learning: 2026-03-30 (Red Hat 8.10 Pure both failure on step 38 was a missing FC reboot-validation artifact with concurrent storage instability)
 - Observed failure mode:
  - The failing testcase was `38. Verify diskname2Reboot file is the same as diskname2Reboot’s source (Reboot test)`.
  - The concrete error was `md5sum: /root/tmp/fcDisk/diskname2Reboot.md5: No such file or directory`.
  - On the target after the run, `/root/tmp/fcDisk` contained `diskname2Disk` and `diskname2Disk.md5`, but not `diskname2Reboot.md5`.
 - Additional host findings:
  - The target showed repeated iSCSI authorization failures and later `Could not log into all portals`.
  - `mtdi-driver.service` started at `17:30:26 EDT`.
  - `iscsid.service` / `Open-iSCSI` started at `17:30:30 EDT`.
  - `iscsi.service`, `mtdi-daemon.service`, and `galaxy-migrate.service` reached active state at `17:32:45 EDT`.
  - Repeated multipath reinitialization and `failed to get ... uid` messages continued through the run window.
 - Action for future runs:
  - If this failure recurs, treat it as a host/storage investigation first, not just a watcher-formatting issue.
  - Check whether the FC reboot-validation step actually created `diskname2Reboot.md5` on `/root/tmp/fcDisk` before the md5 verification step ran.
  - Check whether repeated iSCSI auth failures or multipath churn during the same boot window are interfering with the expected disk/file state.