Improve ATVM failed-host detail recovery

2026-03-30 21:38:59 -04:00
parent d1a909f9ab
commit 18dcbc89f9
4 changed files with 209 additions and 20 deletions
--- a/atvm/AGENTS.md
+++ b/atvm/AGENTS.md
@@ -78,6 +78,9 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
 - When the watcher is requested, start the watcher before `run-sorry-cypress.py`.
 - Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/<build-name>.log` and can delete the fresh live runner log if the runner starts first.
 - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
 - Apply failed-host detail recovery consistently for every ATVM template run, not just `cmc-reboot`.
 - For any failed ATVM host, recover failure detail in this order when available: consolidated run log, `mochawesome`, structured reporter artifacts (`json`/`xml`), then text reporter artifacts.
 - Keep the `HOSTS` detail column compact with the failing step plus a short error summary, and put the longer trimmed failure excerpt in `NOTES:`.
 - When reporting `TEST FLOW:` for an ATVM run, prefer the numbered steps extracted from the generated spec for that exact run.
 - If the generated spec exists, do not rely on a static template flow list for `TEST FLOW:`.
 - Only fall back to template-level or static flow definitions when the generated spec cannot be located or parsed.
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -76,6 +76,12 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
  - `mochawesome/`
    - per-run HTML reports
 - When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
 - Apply the failed-host detail recovery path to every ATVM template type, not just reboot.
 - For any failed host, recover detail in this order when available:
  - consolidated run log
  - matching `mochawesome` HTML
  - structured reporter artifacts such as per-host JSON or XML
  - text reporter artifacts
 - When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
 - Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
 - For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
@@ -269,6 +275,8 @@ Status-report expectations:
 - Do not include generic watcher bookkeeping messages in `NOTES:` such as artifact-detection confirmations.
 - Do not include internal watcher fallback notes in `NOTES:` such as `check-xml-files.ts` validation confirmations or reporter-artifact recovery details.
 - The `HOSTS:` table includes `Host`, `Kernel`, `Status`, and `Detail` columns in that order.
 - For any failed host, keep the `Detail` column compact by showing the failing step plus a short error summary, not the full raw stack trace.
 - If richer failure text is available, put the longer trimmed excerpt in `NOTES:` so the result stays readable in Mattermost and local status output.
 - In `COVERAGE:`, describe the important `cmc-templates.py` command inputs such as template, categorize mode, datastore/config family, config filename, migration style, any real plugin/integration path, and other operator-relevant run options, but do not list target hosts there or include verbose prose scope descriptions.
 - Only include coverage fields that the template command actually used. Do not show empty or irrelevant fields such as an integration/plugin path for templates that did not use one.
 - If `categorize mode: enabled` is already shown in `COVERAGE:`, do not also repeat `--categorize` under `run options`.
--- a/atvm/docs/automation/run-learnings.md
+++ b/atvm/docs/automation/run-learnings.md
@@ -451,3 +451,30 @@ This file stores run-specific examples only when a run produced a new learning r
  - Resolve `TEST FLOW:` from the generated `.ts` spec for the actual run whenever that spec exists.
  - Extract the numbered `it(...)` steps from the generated spec referenced by the run's `specPattern`.
  - Only use template-level or static fallback flow definitions when the generated spec cannot be found or parsed.
 ## Run Learning: 2026-03-30 (Event-log reporter JSON must not be ignored in non-categorized fallback)
 - Observed failure mode:
  - A failed non-categorized run still posted/saved host detail as only `1 failures` even though the per-host reporter artifacts preserved the failing step.
  - The per-host `.json` artifact used an event-log format with `metadata` plus `tests`, but no top-level `stats` block.
  - The watcher ignored that JSON format, fell back to the `.txt`, and lost structured test counts/detail.
 - Action for future runs:
  - Support the event-log JSON format directly when parsing per-host reporter artifacts.
  - In non-categorized fallback, prefer the structured `.json` artifact over the matching `.txt` when they belong to the same run timestamp.
  - Recover at least the failing testcase name and a nonzero test count from those artifacts even when the consolidated run log is missing.
 ## Run Learning: 2026-03-30 (Use `mochawesome` as the rich fallback for host failure detail)
 - Observed failure mode:
  - The full UI-visible Cypress error text for a failed ATVM host run existed in `cypress/cmcReporter/mochawesome/*.html`, but the lower-fidelity host-level `.json` and `.txt` reporter artifacts only preserved the failing step boundary.
  - That made the host detail fall back to a thin summary even though a richer error payload was available on the controller.
 - Action for future runs:
  - When the consolidated run log is missing, use `mochawesome` as the rich fallback source for per-host failure text before settling for lower-fidelity reporter artifacts.
  - Keep the `HOSTS` table compact by showing the failing step plus a short error summary.
  - Put the longer trimmed failure excerpt in `NOTES:` instead of dumping the full raw stack trace into the host-detail column.
 ## Run Learning: 2026-03-30 (Apply rich failed-host detail recovery to every ATVM template)
 - Observed operator requirement:
  - The same failed-host recovery and formatting rules should apply across all ATVM template runs, not only reboot scenarios.
  - If any ATVM test template fails, the result should still recover the best available failure detail and present it consistently.
 - Action for future runs:
  - Use the same failure-detail recovery order for every ATVM template: consolidated run log, `mochawesome`, structured reporter artifacts, then text reporter artifacts.
  - Keep failed-host `Detail` compact and put the longer trimmed excerpt in `NOTES:` for every template type.
--- a/atvm/watcher-service/atvm_run_watcher.py
+++ b/atvm/watcher-service/atvm_run_watcher.py
@@ -3,6 +3,7 @@ from __future__ import annotations
 import argparse
 import ast
 import html
 import json
 import os
 import re
@@ -439,6 +440,12 @@ def append_failure_detail(detail: str, failure_detail: Optional[str]) -> str:
    return f"{detail} - {failure_detail}"
 def concise_testcase_name(raw: str) -> str:
    if "->" in raw:
        return raw.split("->", 1)[1].strip()
    return raw.strip()
 def extract_failure_detail_from_xml_suite(suite: ET.Element) -> Optional[str]:
    for testcase in suite.findall("testcase"):
        failure = testcase.find("failure")
@@ -470,6 +477,39 @@ def extract_failure_detail_from_text_blob(text: str) -> Optional[str]:
    return None
 def extract_failure_from_reporter_events(testcase_name: str, testcase_events: object) -> Optional[str]:
    if not isinstance(testcase_events, list):
        return None
    for event in testcase_events:
        if not isinstance(event, list) or len(event) < 3:
            continue
        event_type = str(event[0]).lower()
        message_value = str(event[1]) if len(event) > 1 else ""
        status_value = str(event[2]).lower()
        if event_type in {"cy:command", "cy:task"} and status_value in {"failed", "fail", "error"}:
            return compact_failure_detail(f"{concise_testcase_name(testcase_name)} - {message_value}")
    return None
 def extract_failure_from_reporter_text(text: str) -> Optional[str]:
    sections = re.split(r"^=== (.+?) ===\s*$", text, flags=re.M)
    if len(sections) < 3:
        return None
    for index in range(1, len(sections), 2):
        testcase_name = sections[index]
        section_body = sections[index + 1] if index + 1 < len(sections) else ""
        for line in section_body.splitlines():
            parts = line.split("\t")
            if len(parts) < 3:
                continue
            event_type = parts[0].strip().lower()
            status_value = parts[1].strip().lower()
            message_value = "\t".join(parts[2:]).strip()
            if event_type in {"cy:command", "cy:task"} and status_value in {"failed", "fail", "error"}:
                return compact_failure_detail(f"{concise_testcase_name(testcase_name)} - {message_value}")
    return None
 def extract_failure_detail_from_log_text(log_text: str, host: str) -> Optional[str]:
    pattern = (
        rf"\d+\)\s+Testing .*?{re.escape(host)}.*?\n"
@@ -486,6 +526,66 @@ def extract_failure_detail_from_log_text(log_text: str, host: str) -> Optional[s
    return compact_failure_detail(testcase)
 def decode_json_string_fragment(raw: str) -> str:
    try:
        return json.loads(f'"{raw}"')
    except json.JSONDecodeError:
        return raw
 def extract_first_json_string(block: str, key: str) -> Optional[str]:
    match = re.search(rf'"{re.escape(key)}":"((?:\\.|[^"])*)"', block, re.S)
    if not match:
        return None
    return decode_json_string_fragment(match.group(1))
 def extract_failure_from_mochawesome(
    reporter_root: Path,
    build_name: str,
    host: str,
 ) -> Optional[Tuple[str, str, str]]:
    mochawesome_dir = reporter_root / "mochawesome"
    if not mochawesome_dir.exists():
        return None
    candidates = sorted(
        mochawesome_dir.glob(f"*{build_name}*.html"),
        key=lambda path: (path.stat().st_mtime, 0 if path.name.endswith("_001.html") else 1),
        reverse=True,
    )
    host_pattern = re.escape(host)
    for html_path in candidates:
        try:
            text = html.unescape(html_path.read_text(encoding="utf-8", errors="replace"))
        except OSError:
            continue
        for match in re.finditer(r'"fullTitle":"((?:\\.|[^"])*)"', text, re.S):
            full_title = decode_json_string_fragment(match.group(1))
            if host not in full_title:
                continue
            block_start = max(0, match.start() - 1200)
            block_end = min(len(text), match.end() + 8000)
            block = text[block_start:block_end]
            if not re.search(r'"state":"failed"', block):
                continue
            testcase = extract_first_json_string(block, "title") or "failed testcase"
            message = extract_first_json_string(block, "message") or ""
            estack = extract_first_json_string(block, "estack") or ""
            if testcase or message or estack:
                return testcase, message, estack
    return None
 def summarize_host_detail_with_mochawesome(detail: str, testcase: str, message: str) -> str:
    prefix_match = re.match(r"^(\d+ tests, \d+ failures(?:, \d+ pending)?)", detail)
    prefix = prefix_match.group(1) if prefix_match else detail
    message_summary = compact_failure_detail(message or testcase, limit=260)
    testcase_summary = compact_failure_detail(testcase, limit=140)
    return f"{prefix} - {testcase_summary} - {message_summary}"
 def extract_host_results_from_run_finished_segment(segment_text: str, inventory: Dict[str, str]) -> Dict[str, HostResult]:
    host_results: Dict[str, HostResult] = {}
    normalized = re.sub(r"\n\s*│\s*s\s*│", "s", segment_text)
@@ -637,7 +737,7 @@ def collect_latest_host_reporter_artifact(
    if not logs_dir.exists():
        return None
-    latest: Optional[Tuple[str, HostResult]] = None
+    latest: Optional[Tuple[str, HostResult, datetime, str]] = None
    for host_dir in sorted(logs_dir.iterdir()):
        if not host_dir.is_dir():
            continue
@@ -661,14 +761,23 @@ def collect_latest_host_reporter_artifact(
                continue
            artifact_ts = result.timestamp or reporter_artifact_run_timestamp(artifact_path) or artifact_mtime
            result.timestamp = artifact_ts
-            candidate = (host, result)
+            candidate = (host, result, artifact_ts, artifact_path.suffix)
            if latest is None:
                latest = candidate
                continue
-            latest_ts = latest[1].timestamp or datetime.fromtimestamp(0, tz=timezone.utc)
+
-            if artifact_mtime >= latest_ts:
+            latest_result_ts = latest[2]
            latest_suffix = latest[3]
            # Prefer the newest logical run artifact, and prefer JSON over TXT
            # when both artifacts represent the same run timestamp.
            if artifact_ts > latest_result_ts:
                latest = candidate
-    return latest
+                continue
            if artifact_ts == latest_result_ts and artifact_path.suffix == ".json" and latest_suffix != ".json":
                latest = candidate
    if latest is None:
        return None
    return latest[0], latest[1]
 def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str, str]) -> Optional[HostResult]:
@@ -681,8 +790,42 @@ def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str,
    stats = payload.get("stats")
    metadata = payload.get("metadata")
    tests_payload = payload.get("tests")
    if not isinstance(stats, dict):
        if not isinstance(tests_payload, dict):
            return None
        tests = len([name for name, events in tests_payload.items() if isinstance(name, str) and isinstance(events, list)])
        failures = 0
        pending = 0
        duration_ms = 0
        failure_detail = None
        for testcase_name, testcase_events in tests_payload.items():
            current_failure = extract_failure_from_reporter_events(testcase_name, testcase_events)
            if current_failure:
                failures += 1
                if failure_detail is None:
                    failure_detail = current_failure
        timestamp = None
        if isinstance(metadata, dict):
            timestamp = parse_reporter_metadata_timestamp(metadata.get("timestamp"))
        if timestamp is None:
            timestamp = reporter_artifact_run_timestamp(artifact_path)
        detail_parts = [f"{tests} tests", f"{failures} failures"]
        detail = ", ".join(detail_parts)
        if failures:
            detail = append_failure_detail(detail, failure_detail)
        status = "FAIL" if failures else "PASS"
        return HostResult(
            host=host,
            kernel=kernels.get(host, "unknown"),
            status=status,
            detail=detail,
            tests=tests,
            failures=failures,
            duration_seconds=None,
            timestamp=timestamp,
        )
    tests = int(stats.get("tests", 0) or 0)
    failures = int(stats.get("failures", 0) or 0)
@@ -699,20 +842,10 @@ def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str,
        detail_parts.append(f"{pending} pending")
    failure_detail = None
    tests_payload = payload.get("tests")
    if failures and isinstance(tests_payload, dict):
        for testcase_name, testcase_events in tests_payload.items():
-            if not isinstance(testcase_name, str) or not isinstance(testcase_events, list):
+            failure_detail = extract_failure_from_reporter_events(testcase_name, testcase_events)
-                continue
+            if failure_detail is not None:
            for event in testcase_events:
                if not isinstance(event, list) or len(event) < 3:
                    continue
                status_value = str(event[2]).lower()
                message_value = str(event[1]) if len(event) > 1 else ""
                if status_value in {"failed", "fail", "error"} or re.search(r"\b(Error|AssertionError|Timed out!)\b", message_value):
                    failure_detail = compact_failure_detail(f"{testcase_name} - {message_value}")
                    break
            if failure_detail:
                break
    if failures:
@@ -751,11 +884,13 @@ def parse_host_reporter_artifact(artifact_path: Path, host: str, kernels: Dict[s
    except OSError:
        text = ""
-    failure_detail = extract_failure_detail_from_text_blob(text)
+    sectioned_failure_detail = extract_failure_from_reporter_text(text)
    failure_detail = extract_failure_detail_from_text_blob(text) or sectioned_failure_detail
    structured_failure = re.search(r"^(?:cy:command|cy:task)\terror\t", text, re.I | re.M)
    failures = 1 if failure_detail or structured_failure else 0
    tests = len(re.findall(r"^=== .+? ===\s*$", text, re.M))
    status = "FAIL" if failures else "PASS"
-    detail = "1 failures" if failures else "completed"
+    detail = f"{tests} tests, 1 failures" if failures and tests else ("1 failures" if failures else "completed")
    if failures:
        detail = append_failure_detail(detail, failure_detail)
    return HostResult(
@@ -763,6 +898,7 @@ def parse_host_reporter_artifact(artifact_path: Path, host: str, kernels: Dict[s
        kernel=kernels.get(host, "unknown"),
        status=status,
        detail=detail,
        tests=tests,
        failures=failures,
        timestamp=artifact_ts,
    )
@@ -1044,6 +1180,20 @@ def build_status_markdown(
    longest = max((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
    average = (sum(durations) / len(durations)) if durations else None
    additional_failure_notes: List[str] = []
    for host in ordered_hosts:
        if host.status != "FAIL":
            continue
        mochawesome_failure = extract_failure_from_mochawesome(reporter_root, build_name, host.host)
        if not mochawesome_failure:
            continue
        testcase, message, estack = mochawesome_failure
        host.detail = summarize_host_detail_with_mochawesome(host.detail, testcase, message)
        failure_excerpt_source = estack or message
        additional_failure_notes.append(
            f"{host.host} failure excerpt: `{compact_failure_detail(failure_excerpt_source, limit=420)}`"
        )
    host_lines = ["| Host | Kernel | Status | Detail |", "| --- | --- | --- | --- |"]
    for host in ordered_hosts:
        icon = {
@@ -1070,6 +1220,7 @@ def build_status_markdown(
        notes = notes + [
            "Both iscsi and fc disks were used for the reboot test. As a result, iscsi disks may not have attached before the mtdi started. So if the test failed, that is most likely the issue."
        ]
    notes = notes + additional_failure_notes
    notes_block = "\n".join(f"- {note}" for note in notes) if notes else "- none"
    resolved_flow = extract_test_flow_from_generated_spec(reporter_root, log_text) or get_test_flow(metadata.get("template"))