Tighten ATVM completed-run status verification

2026-03-30 20:12:56 -04:00
parent b45375dbbc
commit dec13a4667
3 changed files with 44 additions and 0 deletions
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -76,6 +76,14 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
 - Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
 - For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
 - If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
+- Treat saved watcher state under `/var/lib/atvm-run-watcher/<build>/state.json` as cached status only.
+- For completed-run verification, confirm in this order:
+  - launch log under `/tmp/<build>.launch.log`
+  - matching `cmcReporter` artifacts
+  - `Cloud Run Finished` summary and Currents URL
+  - saved watcher state only as a comparison layer
+- If saved watcher state disagrees with the launch log or with a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not use it as the reported result.
+- Never confirm a completed run from `state.json` alone.

 Typical sequence:
 1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
@@ -88,6 +96,13 @@ Typical sequence:
 8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
 9. Run `run-sorry-cypress.py` with the matching approved config and build name.

+Completed-run verification sequence:
+1. Read the launch log for the build.
+2. Inspect the matching reporter artifacts for the relevant host(s).
+3. Use the `Cloud Run Finished` summary and Currents URL as the final parent-run check when present.
+4. Compare that result against saved watcher state.
+5. If there is any disagreement, replay the exact artifacts through the current watcher code in an isolated temp state directory before confirming the result.
+
 ## Config File / Gold Disk Mapping
 - `cypress.atvm-config-gold.ts` -> Gold Disk 1
 - `cypress.atvm-config-gold-2.ts` -> Gold Disk 2
--- a/atvm/docs/automation/run-learnings.md
+++ b/atvm/docs/automation/run-learnings.md
@@ -401,3 +401,28 @@ This file stores run-specific examples only when a run produced a new learning r
  - Do not classify a reporter TXT artifact as failed just because it contains the word `error`.
  - For TXT fallback, require explicit terminal failure markers such as `cy:command error`, `cy:task error`, or real `Error:`/`AssertionError:`/timeout text.
  - Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text.
+
+## Run Learning: 2026-03-30 (Replay exact artifacts before assuming a thin closed-run detail is a current watcher bug)
+- Observed failure mode:
+  - The saved controller state for `reboot-redhat8.10-both` still showed only `1 failures` under the host detail, even though the launch log contained the full md5sum failure text.
+  - Replaying the exact launch log and reporter artifacts through the currently installed watcher produced the correct host detail with `57 tests, 1 failures` and the failing testcase/error text.
+- Action for future runs:
+  - Before patching the watcher again for a thin closed-run detail, replay the exact run artifacts through the currently installed watcher code.
+  - Treat a mismatch between saved state and current replay as evidence of a stale in-memory watcher instance or stale deployment, not automatically as a parser regression.
+  - Use an isolated temp state directory or other no-post path for that replay so historical validation does not repost results.
+
+## Run Learning: 2026-03-30 (Red Hat 8.10 Pure both failure on step 38 was a missing FC reboot-validation artifact with concurrent storage instability)
+- Observed failure mode:
+  - The failing testcase was `38. Verify diskname2Reboot file is the same as diskname2Reboot’s source (Reboot test)`.
+  - The concrete error was `md5sum: /root/tmp/fcDisk/diskname2Reboot.md5: No such file or directory`.
+  - On the target after the run, `/root/tmp/fcDisk` contained `diskname2Disk` and `diskname2Disk.md5`, but not `diskname2Reboot.md5`.
+- Additional host findings:
+  - The target showed repeated iSCSI authorization failures and later `Could not log into all portals`.
+  - `mtdi-driver.service` started at `17:30:26 EDT`.
+  - `iscsid.service` / `Open-iSCSI` started at `17:30:30 EDT`.
+  - `iscsi.service`, `mtdi-daemon.service`, and `galaxy-migrate.service` reached active state at `17:32:45 EDT`.
+  - Repeated multipath reinitialization and `failed to get ... uid` messages continued through the run window.
+- Action for future runs:
+  - If this failure recurs, treat it as a host/storage investigation first, not just a watcher-formatting issue.
+  - Check whether the FC reboot-validation step actually created `diskname2Reboot.md5` on `/root/tmp/fcDisk` before the md5 verification step ran.
+  - Check whether repeated iSCSI auth failures or multipath churn during the same boot window are interfering with the expected disk/file state.