Tighten ATVM completed-run status verification
This commit is contained in:
@@ -74,6 +74,10 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
|
|||||||
- When the watcher is requested, start the watcher before `run-sorry-cypress.py`.
|
- When the watcher is requested, start the watcher before `run-sorry-cypress.py`.
|
||||||
- Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/<build-name>.log` and can delete the fresh live runner log if the runner starts first.
|
- Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/<build-name>.log` and can delete the fresh live runner log if the runner starts first.
|
||||||
- For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
|
- For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
|
||||||
|
- Treat `/var/lib/atvm-run-watcher/<build>/state.json` as cached watcher output, not the source of truth for a completed-run confirmation.
|
||||||
|
- Before confirming a completed ATVM run status, verify in this order: live launch log, matching reporter artifacts, `Cloud Run Finished` summary / Currents URL, then compare against saved watcher state.
|
||||||
|
- If saved watcher state disagrees with the launch log or a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not report from it.
|
||||||
|
- Never confirm a completed ATVM run from `state.json` alone.
|
||||||
- If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
|
- If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
|
||||||
- If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
|
- If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
|
||||||
- Do not call out expected, harmless `systemctl reset-failed ... unit not loaded` output in routine run updates; mention it only if it blocks startup or matters for debugging.
|
- Do not call out expected, harmless `systemctl reset-failed ... unit not loaded` output in routine run updates; mention it only if it blocks startup or matters for debugging.
|
||||||
|
|||||||
@@ -76,6 +76,14 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
|
|||||||
- Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
|
- Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
|
||||||
- For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
|
- For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
|
||||||
- If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
|
- If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
|
||||||
|
- Treat saved watcher state under `/var/lib/atvm-run-watcher/<build>/state.json` as cached status only.
|
||||||
|
- For completed-run verification, confirm in this order:
|
||||||
|
- launch log under `/tmp/<build>.launch.log`
|
||||||
|
- matching `cmcReporter` artifacts
|
||||||
|
- `Cloud Run Finished` summary and Currents URL
|
||||||
|
- saved watcher state only as a comparison layer
|
||||||
|
- If saved watcher state disagrees with the launch log or with a replay of the exact artifacts through the current watcher code, treat the saved state as stale and do not use it as the reported result.
|
||||||
|
- Never confirm a completed run from `state.json` alone.
|
||||||
|
|
||||||
Typical sequence:
|
Typical sequence:
|
||||||
1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
|
1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
|
||||||
@@ -88,6 +96,13 @@ Typical sequence:
|
|||||||
8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
|
8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
|
||||||
9. Run `run-sorry-cypress.py` with the matching approved config and build name.
|
9. Run `run-sorry-cypress.py` with the matching approved config and build name.
|
||||||
|
|
||||||
|
Completed-run verification sequence:
|
||||||
|
1. Read the launch log for the build.
|
||||||
|
2. Inspect the matching reporter artifacts for the relevant host(s).
|
||||||
|
3. Use the `Cloud Run Finished` summary and Currents URL as the final parent-run check when present.
|
||||||
|
4. Compare that result against saved watcher state.
|
||||||
|
5. If there is any disagreement, replay the exact artifacts through the current watcher code in an isolated temp state directory before confirming the result.
|
||||||
|
|
||||||
## Config File / Gold Disk Mapping
|
## Config File / Gold Disk Mapping
|
||||||
- `cypress.atvm-config-gold.ts` -> Gold Disk 1
|
- `cypress.atvm-config-gold.ts` -> Gold Disk 1
|
||||||
- `cypress.atvm-config-gold-2.ts` -> Gold Disk 2
|
- `cypress.atvm-config-gold-2.ts` -> Gold Disk 2
|
||||||
|
|||||||
@@ -401,3 +401,28 @@ This file stores run-specific examples only when a run produced a new learning r
|
|||||||
- Do not classify a reporter TXT artifact as failed just because it contains the word `error`.
|
- Do not classify a reporter TXT artifact as failed just because it contains the word `error`.
|
||||||
- For TXT fallback, require explicit terminal failure markers such as `cy:command error`, `cy:task error`, or real `Error:`/`AssertionError:`/timeout text.
|
- For TXT fallback, require explicit terminal failure markers such as `cy:command error`, `cy:task error`, or real `Error:`/`AssertionError:`/timeout text.
|
||||||
- Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text.
|
- Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text.
|
||||||
|
|
||||||
|
## Run Learning: 2026-03-30 (Replay exact artifacts before assuming a thin closed-run detail is a current watcher bug)
|
||||||
|
- Observed failure mode:
|
||||||
|
- The saved controller state for `reboot-redhat8.10-both` still showed only `1 failures` under the host detail, even though the launch log contained the full md5sum failure text.
|
||||||
|
- Replaying the exact launch log and reporter artifacts through the currently installed watcher produced the correct host detail with `57 tests, 1 failures` and the failing testcase/error text.
|
||||||
|
- Action for future runs:
|
||||||
|
- Before patching the watcher again for a thin closed-run detail, replay the exact run artifacts through the currently installed watcher code.
|
||||||
|
- Treat a mismatch between saved state and current replay as evidence of a stale in-memory watcher instance or stale deployment, not automatically as a parser regression.
|
||||||
|
- Use an isolated temp state directory or other no-post path for that replay so historical validation does not repost results.
|
||||||
|
|
||||||
|
## Run Learning: 2026-03-30 (Red Hat 8.10 Pure both failure on step 38 was a missing FC reboot-validation artifact with concurrent storage instability)
|
||||||
|
- Observed failure mode:
|
||||||
|
- The failing testcase was `38. Verify diskname2Reboot file is the same as diskname2Reboot’s source (Reboot test)`.
|
||||||
|
- The concrete error was `md5sum: /root/tmp/fcDisk/diskname2Reboot.md5: No such file or directory`.
|
||||||
|
- On the target after the run, `/root/tmp/fcDisk` contained `diskname2Disk` and `diskname2Disk.md5`, but not `diskname2Reboot.md5`.
|
||||||
|
- Additional host findings:
|
||||||
|
- The target showed repeated iSCSI authorization failures and later `Could not log into all portals`.
|
||||||
|
- `mtdi-driver.service` started at `17:30:26 EDT`.
|
||||||
|
- `iscsid.service` / `Open-iSCSI` started at `17:30:30 EDT`.
|
||||||
|
- `iscsi.service`, `mtdi-daemon.service`, and `galaxy-migrate.service` reached active state at `17:32:45 EDT`.
|
||||||
|
- Repeated multipath reinitialization and `failed to get ... uid` messages continued through the run window.
|
||||||
|
- Action for future runs:
|
||||||
|
- If this failure recurs, treat it as a host/storage investigation first, not just a watcher-formatting issue.
|
||||||
|
- Check whether the FC reboot-validation step actually created `diskname2Reboot.md5` on `/root/tmp/fcDisk` before the md5 verification step ran.
|
||||||
|
- Check whether repeated iSCSI auth failures or multipath churn during the same boot window are interfering with the expected disk/file state.
|
||||||
|
|||||||
Reference in New Issue
Block a user