Improve ATVM failed-host detail recovery

This commit is contained in:
2026-03-30 21:38:59 -04:00
parent d1a909f9ab
commit 18dcbc89f9
4 changed files with 209 additions and 20 deletions

View File

@@ -78,6 +78,9 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
- When the watcher is requested, start the watcher before `run-sorry-cypress.py`.
- Do not start the runner before the watcher, because the watcher helper clears stale `/tmp/<build-name>.log` and can delete the fresh live runner log if the runner starts first.
- For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
- Apply failed-host detail recovery consistently for every ATVM template run, not just `cmc-reboot`.
- For any failed ATVM host, recover failure detail in this order when available: consolidated run log, `mochawesome`, structured reporter artifacts (`json`/`xml`), then text reporter artifacts.
- Keep the `HOSTS` detail column compact with the failing step plus a short error summary, and put the longer trimmed failure excerpt in `NOTES:`.
- When reporting `TEST FLOW:` for an ATVM run, prefer the numbered steps extracted from the generated spec for that exact run.
- If the generated spec exists, do not rely on a static template flow list for `TEST FLOW:`.
- Only fall back to template-level or static flow definitions when the generated spec cannot be located or parsed.

View File

@@ -76,6 +76,12 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
- `mochawesome/`
- per-run HTML reports
- When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
- Apply the failed-host detail recovery path to every ATVM template type, not just reboot.
- For any failed host, recover detail in this order when available:
- consolidated run log
- matching `mochawesome` HTML
- structured reporter artifacts such as per-host JSON or XML
- text reporter artifacts
- When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
- Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
- For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
@@ -269,6 +275,8 @@ Status-report expectations:
- Do not include generic watcher bookkeeping messages in `NOTES:` such as artifact-detection confirmations.
- Do not include internal watcher fallback notes in `NOTES:` such as `check-xml-files.ts` validation confirmations or reporter-artifact recovery details.
- The `HOSTS:` table includes `Host`, `Kernel`, `Status`, and `Detail` columns in that order.
- For any failed host, keep the `Detail` column compact by showing the failing step plus a short error summary, not the full raw stack trace.
- If richer failure text is available, put the longer trimmed excerpt in `NOTES:` so the result stays readable in Mattermost and local status output.
- In `COVERAGE:`, describe the important `cmc-templates.py` command inputs such as template, categorize mode, datastore/config family, config filename, migration style, any real plugin/integration path, and other operator-relevant run options, but do not list target hosts there or include verbose prose scope descriptions.
- Only include coverage fields that the template command actually used. Do not show empty or irrelevant fields such as an integration/plugin path for templates that did not use one.
- If `categorize mode: enabled` is already shown in `COVERAGE:`, do not also repeat `--categorize` under `run options`.

View File

@@ -451,3 +451,30 @@ This file stores run-specific examples only when a run produced a new learning r
- Resolve `TEST FLOW:` from the generated `.ts` spec for the actual run whenever that spec exists.
- Extract the numbered `it(...)` steps from the generated spec referenced by the run's `specPattern`.
- Only use template-level or static fallback flow definitions when the generated spec cannot be found or parsed.
## Run Learning: 2026-03-30 (Event-log reporter JSON must not be ignored in non-categorized fallback)
- Observed failure mode:
- A failed non-categorized run still posted/saved host detail as only `1 failures` even though the per-host reporter artifacts preserved the failing step.
- The per-host `.json` artifact used an event-log format with `metadata` plus `tests`, but no top-level `stats` block.
- The watcher ignored that JSON format, fell back to the `.txt`, and lost structured test counts/detail.
- Action for future runs:
- Support the event-log JSON format directly when parsing per-host reporter artifacts.
- In non-categorized fallback, prefer the structured `.json` artifact over the matching `.txt` when they belong to the same run timestamp.
- Recover at least the failing testcase name and a nonzero test count from those artifacts even when the consolidated run log is missing.
## Run Learning: 2026-03-30 (Use `mochawesome` as the rich fallback for host failure detail)
- Observed failure mode:
- The full UI-visible Cypress error text for a failed ATVM host run existed in `cypress/cmcReporter/mochawesome/*.html`, but the lower-fidelity host-level `.json` and `.txt` reporter artifacts only preserved the failing step boundary.
- That made the host detail fall back to a thin summary even though a richer error payload was available on the controller.
- Action for future runs:
- When the consolidated run log is missing, use `mochawesome` as the rich fallback source for per-host failure text before settling for lower-fidelity reporter artifacts.
- Keep the `HOSTS` table compact by showing the failing step plus a short error summary.
- Put the longer trimmed failure excerpt in `NOTES:` instead of dumping the full raw stack trace into the host-detail column.
## Run Learning: 2026-03-30 (Apply rich failed-host detail recovery to every ATVM template)
- Observed operator requirement:
- The same failed-host recovery and formatting rules should apply across all ATVM template runs, not only reboot scenarios.
- If any ATVM test template fails, the result should still recover the best available failure detail and present it consistently.
- Action for future runs:
- Use the same failure-detail recovery order for every ATVM template: consolidated run log, `mochawesome`, structured reporter artifacts, then text reporter artifacts.
- Keep failed-host `Detail` compact and put the longer trimmed excerpt in `NOTES:` for every template type.

View File

@@ -3,6 +3,7 @@ from __future__ import annotations
import argparse
import ast
import html
import json
import os
import re
@@ -439,6 +440,12 @@ def append_failure_detail(detail: str, failure_detail: Optional[str]) -> str:
return f"{detail} - {failure_detail}"
def concise_testcase_name(raw: str) -> str:
if "->" in raw:
return raw.split("->", 1)[1].strip()
return raw.strip()
def extract_failure_detail_from_xml_suite(suite: ET.Element) -> Optional[str]:
for testcase in suite.findall("testcase"):
failure = testcase.find("failure")
@@ -470,6 +477,39 @@ def extract_failure_detail_from_text_blob(text: str) -> Optional[str]:
return None
def extract_failure_from_reporter_events(testcase_name: str, testcase_events: object) -> Optional[str]:
if not isinstance(testcase_events, list):
return None
for event in testcase_events:
if not isinstance(event, list) or len(event) < 3:
continue
event_type = str(event[0]).lower()
message_value = str(event[1]) if len(event) > 1 else ""
status_value = str(event[2]).lower()
if event_type in {"cy:command", "cy:task"} and status_value in {"failed", "fail", "error"}:
return compact_failure_detail(f"{concise_testcase_name(testcase_name)} - {message_value}")
return None
def extract_failure_from_reporter_text(text: str) -> Optional[str]:
sections = re.split(r"^=== (.+?) ===\s*$", text, flags=re.M)
if len(sections) < 3:
return None
for index in range(1, len(sections), 2):
testcase_name = sections[index]
section_body = sections[index + 1] if index + 1 < len(sections) else ""
for line in section_body.splitlines():
parts = line.split("\t")
if len(parts) < 3:
continue
event_type = parts[0].strip().lower()
status_value = parts[1].strip().lower()
message_value = "\t".join(parts[2:]).strip()
if event_type in {"cy:command", "cy:task"} and status_value in {"failed", "fail", "error"}:
return compact_failure_detail(f"{concise_testcase_name(testcase_name)} - {message_value}")
return None
def extract_failure_detail_from_log_text(log_text: str, host: str) -> Optional[str]:
pattern = (
rf"\d+\)\s+Testing .*?{re.escape(host)}.*?\n"
@@ -486,6 +526,66 @@ def extract_failure_detail_from_log_text(log_text: str, host: str) -> Optional[s
return compact_failure_detail(testcase)
def decode_json_string_fragment(raw: str) -> str:
try:
return json.loads(f'"{raw}"')
except json.JSONDecodeError:
return raw
def extract_first_json_string(block: str, key: str) -> Optional[str]:
match = re.search(rf'"{re.escape(key)}":"((?:\\.|[^"])*)"', block, re.S)
if not match:
return None
return decode_json_string_fragment(match.group(1))
def extract_failure_from_mochawesome(
reporter_root: Path,
build_name: str,
host: str,
) -> Optional[Tuple[str, str, str]]:
mochawesome_dir = reporter_root / "mochawesome"
if not mochawesome_dir.exists():
return None
candidates = sorted(
mochawesome_dir.glob(f"*{build_name}*.html"),
key=lambda path: (path.stat().st_mtime, 0 if path.name.endswith("_001.html") else 1),
reverse=True,
)
host_pattern = re.escape(host)
for html_path in candidates:
try:
text = html.unescape(html_path.read_text(encoding="utf-8", errors="replace"))
except OSError:
continue
for match in re.finditer(r'"fullTitle":"((?:\\.|[^"])*)"', text, re.S):
full_title = decode_json_string_fragment(match.group(1))
if host not in full_title:
continue
block_start = max(0, match.start() - 1200)
block_end = min(len(text), match.end() + 8000)
block = text[block_start:block_end]
if not re.search(r'"state":"failed"', block):
continue
testcase = extract_first_json_string(block, "title") or "failed testcase"
message = extract_first_json_string(block, "message") or ""
estack = extract_first_json_string(block, "estack") or ""
if testcase or message or estack:
return testcase, message, estack
return None
def summarize_host_detail_with_mochawesome(detail: str, testcase: str, message: str) -> str:
prefix_match = re.match(r"^(\d+ tests, \d+ failures(?:, \d+ pending)?)", detail)
prefix = prefix_match.group(1) if prefix_match else detail
message_summary = compact_failure_detail(message or testcase, limit=260)
testcase_summary = compact_failure_detail(testcase, limit=140)
return f"{prefix} - {testcase_summary} - {message_summary}"
def extract_host_results_from_run_finished_segment(segment_text: str, inventory: Dict[str, str]) -> Dict[str, HostResult]:
host_results: Dict[str, HostResult] = {}
normalized = re.sub(r"\n\s*│\s*s\s*│", "s", segment_text)
@@ -637,7 +737,7 @@ def collect_latest_host_reporter_artifact(
if not logs_dir.exists():
return None
latest: Optional[Tuple[str, HostResult]] = None
latest: Optional[Tuple[str, HostResult, datetime, str]] = None
for host_dir in sorted(logs_dir.iterdir()):
if not host_dir.is_dir():
continue
@@ -661,14 +761,23 @@ def collect_latest_host_reporter_artifact(
continue
artifact_ts = result.timestamp or reporter_artifact_run_timestamp(artifact_path) or artifact_mtime
result.timestamp = artifact_ts
candidate = (host, result)
candidate = (host, result, artifact_ts, artifact_path.suffix)
if latest is None:
latest = candidate
continue
latest_ts = latest[1].timestamp or datetime.fromtimestamp(0, tz=timezone.utc)
if artifact_mtime >= latest_ts:
latest_result_ts = latest[2]
latest_suffix = latest[3]
# Prefer the newest logical run artifact, and prefer JSON over TXT
# when both artifacts represent the same run timestamp.
if artifact_ts > latest_result_ts:
latest = candidate
return latest
continue
if artifact_ts == latest_result_ts and artifact_path.suffix == ".json" and latest_suffix != ".json":
latest = candidate
if latest is None:
return None
return latest[0], latest[1]
def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str, str]) -> Optional[HostResult]:
@@ -681,8 +790,42 @@ def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str,
stats = payload.get("stats")
metadata = payload.get("metadata")
tests_payload = payload.get("tests")
if not isinstance(stats, dict):
if not isinstance(tests_payload, dict):
return None
tests = len([name for name, events in tests_payload.items() if isinstance(name, str) and isinstance(events, list)])
failures = 0
pending = 0
duration_ms = 0
failure_detail = None
for testcase_name, testcase_events in tests_payload.items():
current_failure = extract_failure_from_reporter_events(testcase_name, testcase_events)
if current_failure:
failures += 1
if failure_detail is None:
failure_detail = current_failure
timestamp = None
if isinstance(metadata, dict):
timestamp = parse_reporter_metadata_timestamp(metadata.get("timestamp"))
if timestamp is None:
timestamp = reporter_artifact_run_timestamp(artifact_path)
detail_parts = [f"{tests} tests", f"{failures} failures"]
detail = ", ".join(detail_parts)
if failures:
detail = append_failure_detail(detail, failure_detail)
status = "FAIL" if failures else "PASS"
return HostResult(
host=host,
kernel=kernels.get(host, "unknown"),
status=status,
detail=detail,
tests=tests,
failures=failures,
duration_seconds=None,
timestamp=timestamp,
)
tests = int(stats.get("tests", 0) or 0)
failures = int(stats.get("failures", 0) or 0)
@@ -699,20 +842,10 @@ def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str,
detail_parts.append(f"{pending} pending")
failure_detail = None
tests_payload = payload.get("tests")
if failures and isinstance(tests_payload, dict):
for testcase_name, testcase_events in tests_payload.items():
if not isinstance(testcase_name, str) or not isinstance(testcase_events, list):
continue
for event in testcase_events:
if not isinstance(event, list) or len(event) < 3:
continue
status_value = str(event[2]).lower()
message_value = str(event[1]) if len(event) > 1 else ""
if status_value in {"failed", "fail", "error"} or re.search(r"\b(Error|AssertionError|Timed out!)\b", message_value):
failure_detail = compact_failure_detail(f"{testcase_name} - {message_value}")
break
if failure_detail:
failure_detail = extract_failure_from_reporter_events(testcase_name, testcase_events)
if failure_detail is not None:
break
if failures:
@@ -751,11 +884,13 @@ def parse_host_reporter_artifact(artifact_path: Path, host: str, kernels: Dict[s
except OSError:
text = ""
failure_detail = extract_failure_detail_from_text_blob(text)
sectioned_failure_detail = extract_failure_from_reporter_text(text)
failure_detail = extract_failure_detail_from_text_blob(text) or sectioned_failure_detail
structured_failure = re.search(r"^(?:cy:command|cy:task)\terror\t", text, re.I | re.M)
failures = 1 if failure_detail or structured_failure else 0
tests = len(re.findall(r"^=== .+? ===\s*$", text, re.M))
status = "FAIL" if failures else "PASS"
detail = "1 failures" if failures else "completed"
detail = f"{tests} tests, 1 failures" if failures and tests else ("1 failures" if failures else "completed")
if failures:
detail = append_failure_detail(detail, failure_detail)
return HostResult(
@@ -763,6 +898,7 @@ def parse_host_reporter_artifact(artifact_path: Path, host: str, kernels: Dict[s
kernel=kernels.get(host, "unknown"),
status=status,
detail=detail,
tests=tests,
failures=failures,
timestamp=artifact_ts,
)
@@ -1044,6 +1180,20 @@ def build_status_markdown(
longest = max((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
average = (sum(durations) / len(durations)) if durations else None
additional_failure_notes: List[str] = []
for host in ordered_hosts:
if host.status != "FAIL":
continue
mochawesome_failure = extract_failure_from_mochawesome(reporter_root, build_name, host.host)
if not mochawesome_failure:
continue
testcase, message, estack = mochawesome_failure
host.detail = summarize_host_detail_with_mochawesome(host.detail, testcase, message)
failure_excerpt_source = estack or message
additional_failure_notes.append(
f"{host.host} failure excerpt: `{compact_failure_detail(failure_excerpt_source, limit=420)}`"
)
host_lines = ["| Host | Kernel | Status | Detail |", "| --- | --- | --- | --- |"]
for host in ordered_hosts:
icon = {
@@ -1070,6 +1220,7 @@ def build_status_markdown(
notes = notes + [
"Both iscsi and fc disks were used for the reboot test. As a result, iscsi disks may not have attached before the mtdi started. So if the test failed, that is most likely the issue."
]
notes = notes + additional_failure_notes
notes_block = "\n".join(f"- {note}" for note in notes) if notes else "- none"
resolved_flow = extract_test_flow_from_generated_spec(reporter_root, log_text) or get_test_flow(metadata.get("template"))