Improve ATVM failed-host detail recovery

2026-03-30 21:38:59 -04:00
parent d1a909f9ab
commit 18dcbc89f9
4 changed files with 209 additions and 20 deletions
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -76,6 +76,12 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
  - `mochawesome/`
    - per-run HTML reports
 - When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
+- Apply the failed-host detail recovery path to every ATVM template type, not just reboot.
+- For any failed host, recover detail in this order when available:
+  - consolidated run log
+  - matching `mochawesome` HTML
+  - structured reporter artifacts such as per-host JSON or XML
+  - text reporter artifacts
 - When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
 - Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
 - For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
@@ -269,6 +275,8 @@ Status-report expectations:
 - Do not include generic watcher bookkeeping messages in `NOTES:` such as artifact-detection confirmations.
 - Do not include internal watcher fallback notes in `NOTES:` such as `check-xml-files.ts` validation confirmations or reporter-artifact recovery details.
 - The `HOSTS:` table includes `Host`, `Kernel`, `Status`, and `Detail` columns in that order.
+- For any failed host, keep the `Detail` column compact by showing the failing step plus a short error summary, not the full raw stack trace.
+- If richer failure text is available, put the longer trimmed excerpt in `NOTES:` so the result stays readable in Mattermost and local status output.
 - In `COVERAGE:`, describe the important `cmc-templates.py` command inputs such as template, categorize mode, datastore/config family, config filename, migration style, any real plugin/integration path, and other operator-relevant run options, but do not list target hosts there or include verbose prose scope descriptions.
 - Only include coverage fields that the template command actually used. Do not show empty or irrelevant fields such as an integration/plugin path for templates that did not use one.
 - If `categorize mode: enabled` is already shown in `COVERAGE:`, do not also repeat `--categorize` under `run options`.
--- a/atvm/docs/automation/run-learnings.md
+++ b/atvm/docs/automation/run-learnings.md
@@ -451,3 +451,30 @@ This file stores run-specific examples only when a run produced a new learning r
  - Resolve `TEST FLOW:` from the generated `.ts` spec for the actual run whenever that spec exists.
  - Extract the numbered `it(...)` steps from the generated spec referenced by the run's `specPattern`.
  - Only use template-level or static fallback flow definitions when the generated spec cannot be found or parsed.
+
+## Run Learning: 2026-03-30 (Event-log reporter JSON must not be ignored in non-categorized fallback)
+- Observed failure mode:
+  - A failed non-categorized run still posted/saved host detail as only `1 failures` even though the per-host reporter artifacts preserved the failing step.
+  - The per-host `.json` artifact used an event-log format with `metadata` plus `tests`, but no top-level `stats` block.
+  - The watcher ignored that JSON format, fell back to the `.txt`, and lost structured test counts/detail.
+- Action for future runs:
+  - Support the event-log JSON format directly when parsing per-host reporter artifacts.
+  - In non-categorized fallback, prefer the structured `.json` artifact over the matching `.txt` when they belong to the same run timestamp.
+  - Recover at least the failing testcase name and a nonzero test count from those artifacts even when the consolidated run log is missing.
+
+## Run Learning: 2026-03-30 (Use `mochawesome` as the rich fallback for host failure detail)
+- Observed failure mode:
+  - The full UI-visible Cypress error text for a failed ATVM host run existed in `cypress/cmcReporter/mochawesome/*.html`, but the lower-fidelity host-level `.json` and `.txt` reporter artifacts only preserved the failing step boundary.
+  - That made the host detail fall back to a thin summary even though a richer error payload was available on the controller.
+- Action for future runs:
+  - When the consolidated run log is missing, use `mochawesome` as the rich fallback source for per-host failure text before settling for lower-fidelity reporter artifacts.
+  - Keep the `HOSTS` table compact by showing the failing step plus a short error summary.
+  - Put the longer trimmed failure excerpt in `NOTES:` instead of dumping the full raw stack trace into the host-detail column.
+
+## Run Learning: 2026-03-30 (Apply rich failed-host detail recovery to every ATVM template)
+- Observed operator requirement:
+  - The same failed-host recovery and formatting rules should apply across all ATVM template runs, not only reboot scenarios.
+  - If any ATVM test template fails, the result should still recover the best available failure detail and present it consistently.
+- Action for future runs:
+  - Use the same failure-detail recovery order for every ATVM template: consolidated run log, `mochawesome`, structured reporter artifacts, then text reporter artifacts.
+  - Keep failed-host `Detail` compact and put the longer trimmed excerpt in `NOTES:` for every template type.