atvm: fail runs explicitly on hang-kill and runner non-zero exit

This commit is contained in:
2026-05-07 13:34:37 -04:00
parent e3497111dd
commit 65330ee9f8
3 changed files with 65 additions and 0 deletions

View File

@@ -63,6 +63,8 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
- If monitoring is requested, allow long runtime windows (15-30+ minutes) and continue until completion unless operator instructs otherwise.
- Report command errors immediately.
- `sshpass` may be used where password-based SSH automation is required.
- Treat runner hang-kill events (`Sending SIGKILL ... due to no change` / `Max hang retries reached`) as explicit `FAILED` outcomes, not `RUNNING` or ambiguous termination.
- For manual `run-sorry-cypress.py` execution, treat `ATVM_HANG_FAIL ...` log markers and `/tmp/atvm-runner-state-<build>.json` terminal state files as the source of truth for hang-failure terminal status.
## Core Scripts
- Template prep: `/root/cdc-e2e-cyp-12.17.4/cmc-templates.py`

View File

@@ -33,6 +33,20 @@ This file stores run-specific examples only when a run produced a new learning r
- For Windows-involved ATVM automation runs, add `--hang_retries 0` to `run-sorry-cypress.py` by default unless the operator explicitly requests a different value.
- Keep this as an operator-default behavior even though the underlying runner option is generic and not Windows-only in code.
## Run Learning: 2026-05-07 (Treat hang-kill as explicit failure)
- Observed failure mode:
- A run can stall long enough for `run-sorry-cypress.py` to force-kill Cypress (`Sending SIGKILL ... due to no change`) and still be reported as an ambiguous terminated state.
- Action for future runs:
- When run logs contain hang-kill markers (`Sending SIGKILL ... due to no change` and `Max hang retries reached.`), classify the run as `FAILED`.
- When the runner service exits non-zero, classify the run as `FAILED` instead of generic terminated.
## Run Learning: 2026-05-07 (Manual runner emits explicit hang-fail markers and terminal state)
- Observed failure mode:
- Manual `run-sorry-cypress.py` execution can appear "still running" after hang-kill handling because failure state was not emitted in a machine-readable terminal marker.
- Action for future runs:
- `run-sorry-cypress.py` now emits `ATVM_HANG_FAIL ...` on hang-kill paths and writes terminal state JSON under `/tmp/atvm-runner-state-<build>.json`.
- Max hang-retry exhaustion now writes terminal failure state before exiting non-zero, including categorized and non-categorized flows.
## Run Learning: 2026-05-02 (Do not reuse the previous controller status check for a new ATVM request)
- Observed failure mode:
- A later ATVM run request was blocked because the assistant reused the immediately previous controller status result instead of performing a fresh live running-state check at request time.