Fix categorized ATVM watcher host result recovery

This commit is contained in:
2026-03-30 14:02:32 -04:00
parent 89f558bd39
commit 1405a2e879
4 changed files with 117 additions and 17 deletions

View File

@@ -73,6 +73,9 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
- per-run HTML reports
- When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
- When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
- Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
- For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
- If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
Typical sequence:
1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
@@ -81,8 +84,9 @@ Typical sequence:
4. Run `cmc-templates.py` with the approved options.
5. Wait for `cmc-templates.py` to fully finish and confirm success.
6. Verify the generated `.ts` files and the config `specPattern` include every requested VM before starting the runner.
7. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
8. Run `run-sorry-cypress.py` with the matching approved config and build name.
7. If the watcher is approved, make sure the controller's deployed watcher code is the intended version before relying on its posts.
8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
9. Run `run-sorry-cypress.py` with the matching approved config and build name.
## Config File / Gold Disk Mapping
- `cypress.atvm-config-gold.ts` -> Gold Disk 1

View File

@@ -356,3 +356,20 @@ This file stores run-specific examples only when a run produced a new learning r
- Keep the maintained `--exclude_partial_match` list for broad selectors such as `--containsVm` or `--randomize`.
- When the operator uses `--specify_vms`, do not auto-add the blacklist unless they explicitly request it.
- Even when the operator uses `--specify_vms`, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
## Run Learning: 2026-03-30 (Controller watcher deployment must match the repo watcher before trusting live posts)
- Observed failure mode:
- The repo watcher had the corrected `cmc-reboot` flow, but the controller install at `/opt/atvm-watcher-service/atvm_run_watcher.py` still had the old generic 5-step fallback.
- A live categorized reboot subrun therefore posted the stale 5-step `TEST FLOW:` even though the repo copy had already been fixed.
- Action for future runs:
- Before trusting watcher-generated live posts for new watcher behavior, verify that the controller install matches the intended repo watcher version.
- If the controller install is stale and the operator approves it, deploy the updated watcher code to `/opt/atvm-watcher-service` and restart only the watcher instance for the active build.
## Run Learning: 2026-03-30 (Categorized grouped recovery must parse real per-host reporter status, not assume pass)
- Observed failure mode:
- A categorized Red Hat reboot subrun posted both hosts as passed even though `atvm71-redhat9.1` actually failed during `1. Verifying set up`.
- The grouped XML only contained `check-xml-files.ts`, and the watcher incorrectly treated the presence of a per-host reporter artifact as `PASS completed`.
- Action for future runs:
- When grouped XML lacks explicit host testcase results, recover grouped host status from the per-host reporter JSON or equivalent detailed artifact.
- Carry through the real `failures`, `pending`, and failure message from that host artifact instead of assuming `PASS completed`.
- If a correction post is needed because stale or reconstructed state was wrong, mark it explicitly as a correction that supersedes the earlier result.