Fix categorized ATVM watcher host result recovery
This commit is contained in:
@@ -73,6 +73,9 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
|
|||||||
- per-run HTML reports
|
- per-run HTML reports
|
||||||
- When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
|
- When a machine fails, use the matching `logs/` entry first to capture the detailed failure context for that host.
|
||||||
- When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
|
- When reconstructing historical status, prefer `cmcReporter` artifacts over less-specific runner output because they preserve per-host results after the live run has ended.
|
||||||
|
- Do not treat the existence of a per-host reporter artifact by itself as proof that the host passed.
|
||||||
|
- For categorized grouped recovery, prefer the matching per-host reporter JSON or mochawesome result and carry through the real `failures`, `pending`, and failure message instead of assuming `PASS completed`.
|
||||||
|
- If grouped XML only contains `check-xml-files.ts`, cross-check the grouped result against the per-host reporter artifacts before posting or repeating status for that grouped sub-run.
|
||||||
|
|
||||||
Typical sequence:
|
Typical sequence:
|
||||||
1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
|
1. Build the exact `cmc-templates.py` and `run-sorry-cypress.py` commands for the request.
|
||||||
@@ -81,8 +84,9 @@ Typical sequence:
|
|||||||
4. Run `cmc-templates.py` with the approved options.
|
4. Run `cmc-templates.py` with the approved options.
|
||||||
5. Wait for `cmc-templates.py` to fully finish and confirm success.
|
5. Wait for `cmc-templates.py` to fully finish and confirm success.
|
||||||
6. Verify the generated `.ts` files and the config `specPattern` include every requested VM before starting the runner.
|
6. Verify the generated `.ts` files and the config `specPattern` include every requested VM before starting the runner.
|
||||||
7. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
|
7. If the watcher is approved, make sure the controller's deployed watcher code is the intended version before relying on its posts.
|
||||||
8. Run `run-sorry-cypress.py` with the matching approved config and build name.
|
8. If the watcher is approved, start the watcher before launching `run-sorry-cypress.py`.
|
||||||
|
9. Run `run-sorry-cypress.py` with the matching approved config and build name.
|
||||||
|
|
||||||
## Config File / Gold Disk Mapping
|
## Config File / Gold Disk Mapping
|
||||||
- `cypress.atvm-config-gold.ts` -> Gold Disk 1
|
- `cypress.atvm-config-gold.ts` -> Gold Disk 1
|
||||||
|
|||||||
@@ -356,3 +356,20 @@ This file stores run-specific examples only when a run produced a new learning r
|
|||||||
- Keep the maintained `--exclude_partial_match` list for broad selectors such as `--containsVm` or `--randomize`.
|
- Keep the maintained `--exclude_partial_match` list for broad selectors such as `--containsVm` or `--randomize`.
|
||||||
- When the operator uses `--specify_vms`, do not auto-add the blacklist unless they explicitly request it.
|
- When the operator uses `--specify_vms`, do not auto-add the blacklist unless they explicitly request it.
|
||||||
- Even when the operator uses `--specify_vms`, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
|
- Even when the operator uses `--specify_vms`, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
|
||||||
|
|
||||||
|
## Run Learning: 2026-03-30 (Controller watcher deployment must match the repo watcher before trusting live posts)
|
||||||
|
- Observed failure mode:
|
||||||
|
- The repo watcher had the corrected `cmc-reboot` flow, but the controller install at `/opt/atvm-watcher-service/atvm_run_watcher.py` still had the old generic 5-step fallback.
|
||||||
|
- A live categorized reboot subrun therefore posted the stale 5-step `TEST FLOW:` even though the repo copy had already been fixed.
|
||||||
|
- Action for future runs:
|
||||||
|
- Before trusting watcher-generated live posts for new watcher behavior, verify that the controller install matches the intended repo watcher version.
|
||||||
|
- If the controller install is stale and the operator approves it, deploy the updated watcher code to `/opt/atvm-watcher-service` and restart only the watcher instance for the active build.
|
||||||
|
|
||||||
|
## Run Learning: 2026-03-30 (Categorized grouped recovery must parse real per-host reporter status, not assume pass)
|
||||||
|
- Observed failure mode:
|
||||||
|
- A categorized Red Hat reboot subrun posted both hosts as passed even though `atvm71-redhat9.1` actually failed during `1. Verifying set up`.
|
||||||
|
- The grouped XML only contained `check-xml-files.ts`, and the watcher incorrectly treated the presence of a per-host reporter artifact as `PASS completed`.
|
||||||
|
- Action for future runs:
|
||||||
|
- When grouped XML lacks explicit host testcase results, recover grouped host status from the per-host reporter JSON or equivalent detailed artifact.
|
||||||
|
- Carry through the real `failures`, `pending`, and failure message from that host artifact instead of assuming `PASS completed`.
|
||||||
|
- If a correction post is needed because stale or reconstructed state was wrong, mark it explicitly as a correction that supersedes the earlier result.
|
||||||
|
|||||||
@@ -129,5 +129,8 @@ This writes a cancellation marker, updates `state.json` to `CANCELLED`, and stop
|
|||||||
- Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases.
|
- Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases.
|
||||||
- In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately.
|
- In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately.
|
||||||
- In categorized mode, if the child build id label does not match the host/spec actually being executed, the watcher reports the grouped run using the inferred host-based group name instead of trusting the raw child build id label.
|
- In categorized mode, if the child build id label does not match the host/spec actually being executed, the watcher reports the grouped run using the inferred host-based group name instead of trusting the raw child build id label.
|
||||||
|
- In categorized mode, grouped XML can finish with only `check-xml-files.ts`; when that happens, the watcher must recover per-host results from the matching host reporter artifacts.
|
||||||
|
- Do not infer `PASS completed` from host artifact presence alone. Parse the per-host reporter result and preserve real `FAIL` and `RUN/pending` state when reconstructing grouped results.
|
||||||
|
- When the repo copy of the watcher changes, the controller install under `/opt/atvm-watcher-service` must be updated before expecting the new reporting behavior from live runs.
|
||||||
- Best-practice controller install path: `/opt/atvm-watcher-service`.
|
- Best-practice controller install path: `/opt/atvm-watcher-service`.
|
||||||
- This package is local-only right now. Nothing here is installed on the controller yet.
|
- This package is local-only right now. Nothing here is installed on the controller yet.
|
||||||
|
|||||||
@@ -607,6 +607,78 @@ def collect_latest_host_reporter_artifact(
|
|||||||
return latest
|
return latest
|
||||||
|
|
||||||
|
|
||||||
|
def parse_host_reporter_json(artifact_path: Path, host: str, kernels: Dict[str, str]) -> Optional[HostResult]:
|
||||||
|
try:
|
||||||
|
payload = json.loads(artifact_path.read_text(encoding="utf-8"))
|
||||||
|
except (json.JSONDecodeError, OSError):
|
||||||
|
return None
|
||||||
|
if not isinstance(payload, dict):
|
||||||
|
return None
|
||||||
|
|
||||||
|
stats = payload.get("stats")
|
||||||
|
metadata = payload.get("metadata")
|
||||||
|
if not isinstance(stats, dict):
|
||||||
|
return None
|
||||||
|
|
||||||
|
tests = int(stats.get("tests", 0) or 0)
|
||||||
|
failures = int(stats.get("failures", 0) or 0)
|
||||||
|
pending = int(stats.get("pending", 0) or 0)
|
||||||
|
duration_ms = stats.get("duration", 0) or 0
|
||||||
|
timestamp = None
|
||||||
|
if isinstance(metadata, dict):
|
||||||
|
timestamp = parse_reporter_metadata_timestamp(metadata.get("timestamp"))
|
||||||
|
if timestamp is None:
|
||||||
|
timestamp = reporter_artifact_run_timestamp(artifact_path)
|
||||||
|
|
||||||
|
detail_parts = [f"{tests} tests", f"{failures} failures"]
|
||||||
|
if pending:
|
||||||
|
detail_parts.append(f"{pending} pending")
|
||||||
|
|
||||||
|
if failures:
|
||||||
|
status = "FAIL"
|
||||||
|
elif pending:
|
||||||
|
status = "RUN"
|
||||||
|
else:
|
||||||
|
status = "PASS"
|
||||||
|
|
||||||
|
return HostResult(
|
||||||
|
host=host,
|
||||||
|
kernel=kernels.get(host, "unknown"),
|
||||||
|
status=status,
|
||||||
|
detail=", ".join(detail_parts),
|
||||||
|
tests=tests,
|
||||||
|
failures=failures,
|
||||||
|
duration_seconds=float(duration_ms) / 1000 if duration_ms else None,
|
||||||
|
timestamp=timestamp,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_host_reporter_artifact(artifact_path: Path, host: str, kernels: Dict[str, str]) -> Optional[HostResult]:
|
||||||
|
if artifact_path.suffix == ".json":
|
||||||
|
return parse_host_reporter_json(artifact_path, host, kernels)
|
||||||
|
|
||||||
|
artifact_ts = reporter_artifact_run_timestamp(artifact_path)
|
||||||
|
if artifact_ts is None:
|
||||||
|
artifact_ts = datetime.fromtimestamp(artifact_path.stat().st_mtime, tz=timezone.utc)
|
||||||
|
|
||||||
|
try:
|
||||||
|
text = artifact_path.read_text(encoding="utf-8", errors="replace")
|
||||||
|
except OSError:
|
||||||
|
text = ""
|
||||||
|
|
||||||
|
failures = 1 if re.search(r"\bTimed out!\b|\bAssertionError\b|\bFAIL\b|\berror\b", text, re.I) else 0
|
||||||
|
status = "FAIL" if failures else "PASS"
|
||||||
|
detail = "1 failures" if failures else "completed"
|
||||||
|
return HostResult(
|
||||||
|
host=host,
|
||||||
|
kernel=kernels.get(host, "unknown"),
|
||||||
|
status=status,
|
||||||
|
detail=detail,
|
||||||
|
failures=failures,
|
||||||
|
timestamp=artifact_ts,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def collect_group_host_reporter_artifacts(
|
def collect_group_host_reporter_artifacts(
|
||||||
reporter_root: Path,
|
reporter_root: Path,
|
||||||
group_label: Optional[str],
|
group_label: Optional[str],
|
||||||
@@ -632,7 +704,7 @@ def collect_group_host_reporter_artifacts(
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
latest_artifact_mtime: Optional[datetime] = None
|
latest_artifact_mtime: Optional[datetime] = None
|
||||||
latest_run_ts: Optional[datetime] = None
|
latest_result: Optional[HostResult] = None
|
||||||
for artifact_path in sorted(host_dir.iterdir()):
|
for artifact_path in sorted(host_dir.iterdir()):
|
||||||
if artifact_path.suffix not in {".txt", ".json"}:
|
if artifact_path.suffix not in {".txt", ".json"}:
|
||||||
continue
|
continue
|
||||||
@@ -641,20 +713,24 @@ def collect_group_host_reporter_artifacts(
|
|||||||
continue
|
continue
|
||||||
if run_ended_at and artifact_mtime >= run_ended_at:
|
if run_ended_at and artifact_mtime >= run_ended_at:
|
||||||
continue
|
continue
|
||||||
if latest_artifact_mtime is None or artifact_mtime >= latest_artifact_mtime:
|
|
||||||
latest_artifact_mtime = artifact_mtime
|
|
||||||
latest_run_ts = reporter_artifact_run_timestamp(artifact_path) or artifact_mtime
|
|
||||||
|
|
||||||
if latest_artifact_mtime is None:
|
parsed_result = parse_host_reporter_artifact(artifact_path, host, kernels)
|
||||||
|
if parsed_result is None:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
results[host] = HostResult(
|
# Prefer the newest artifact, and prefer JSON over TXT when timestamps tie.
|
||||||
host=host,
|
if latest_artifact_mtime is None or artifact_mtime > latest_artifact_mtime or (
|
||||||
kernel=kernels.get(host, "unknown"),
|
artifact_mtime == latest_artifact_mtime and artifact_path.suffix == ".json"
|
||||||
status="PASS",
|
):
|
||||||
detail="completed",
|
latest_artifact_mtime = artifact_mtime
|
||||||
timestamp=latest_run_ts or latest_artifact_mtime,
|
latest_result = parsed_result
|
||||||
)
|
|
||||||
|
if latest_artifact_mtime is None or latest_result is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if latest_result.timestamp is None:
|
||||||
|
latest_result.timestamp = latest_artifact_mtime
|
||||||
|
results[host] = latest_result
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user