Fix categorized ATVM watcher host result recovery

This commit is contained in:
2026-03-30 14:02:32 -04:00
parent 89f558bd39
commit 1405a2e879
4 changed files with 117 additions and 17 deletions

View File

@@ -356,3 +356,20 @@ This file stores run-specific examples only when a run produced a new learning r
- Keep the maintained `--exclude_partial_match` list for broad selectors such as `--containsVm` or `--randomize`.
- When the operator uses `--specify_vms`, do not auto-add the blacklist unless they explicitly request it.
- Even when the operator uses `--specify_vms`, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
## Run Learning: 2026-03-30 (Controller watcher deployment must match the repo watcher before trusting live posts)
- Observed failure mode:
- The repo watcher had the corrected `cmc-reboot` flow, but the controller install at `/opt/atvm-watcher-service/atvm_run_watcher.py` still had the old generic 5-step fallback.
- A live categorized reboot subrun therefore posted the stale 5-step `TEST FLOW:` even though the repo copy had already been fixed.
- Action for future runs:
- Before trusting watcher-generated live posts for new watcher behavior, verify that the controller install matches the intended repo watcher version.
- If the controller install is stale and the operator approves it, deploy the updated watcher code to `/opt/atvm-watcher-service` and restart only the watcher instance for the active build.
## Run Learning: 2026-03-30 (Categorized grouped recovery must parse real per-host reporter status, not assume pass)
- Observed failure mode:
- A categorized Red Hat reboot subrun posted both hosts as passed even though `atvm71-redhat9.1` actually failed during `1. Verifying set up`.
- The grouped XML only contained `check-xml-files.ts`, and the watcher incorrectly treated the presence of a per-host reporter artifact as `PASS completed`.
- Action for future runs:
- When grouped XML lacks explicit host testcase results, recover grouped host status from the per-host reporter JSON or equivalent detailed artifact.
- Carry through the real `failures`, `pending`, and failure message from that host artifact instead of assuming `PASS completed`.
- If a correction post is needed because stale or reconstructed state was wrong, mark it explicitly as a correction that supersedes the earlier result.