Keep categorized ATVM watcher alive until parent run finishes
- update the watcher to treat categorized parent-run activity as the authoritative signal for whether the overall request is still running - prevent the watcher from exiting early just because one categorized grouped sub-run completed and wrote artifacts - document that categorized watcher instances must remain alive between grouped runs until the parent request has actually gone inactive past the grace window - update the ATVM guide, watcher design, and install docs to reflect the stricter categorized parent-run completion rule
This commit is contained in:
@@ -45,6 +45,8 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
|
|||||||
- Treat `approve with watcher` as approval to run and also start the per-run watcher service for that build.
|
- Treat `approve with watcher` as approval to run and also start the per-run watcher service for that build.
|
||||||
- When `--categorize` is used with watcher enabled, treat the watcher as a sequential grouped-run watcher:
|
- When `--categorize` is used with watcher enabled, treat the watcher as a sequential grouped-run watcher:
|
||||||
- it must post one final Mattermost status per completed categorized group/sub-run
|
- it must post one final Mattermost status per completed categorized group/sub-run
|
||||||
|
- it must stay active between grouped sub-runs while the parent categorized request is still running
|
||||||
|
- it must not stop after the first grouped run simply because one grouped run completed
|
||||||
- it must not wait and replace those with one single parent-only post
|
- it must not wait and replace those with one single parent-only post
|
||||||
- After execution, report immediate success/failure only.
|
- After execution, report immediate success/failure only.
|
||||||
- Do not actively monitor completion unless explicitly requested.
|
- Do not actively monitor completion unless explicitly requested.
|
||||||
|
|||||||
@@ -42,6 +42,8 @@ A categorized run must be treated differently:
|
|||||||
- the watcher must wait for that grouped sub-run to complete
|
- the watcher must wait for that grouped sub-run to complete
|
||||||
- then send that grouped sub-run's final Mattermost status
|
- then send that grouped sub-run's final Mattermost status
|
||||||
- then continue watching for the next grouped sub-run
|
- then continue watching for the next grouped sub-run
|
||||||
|
- the watcher must remain alive while the parent categorized request or related child Cypress process is still active
|
||||||
|
- one completed grouped sub-run must not be treated as proof that the parent categorized request is finished
|
||||||
- the watcher must not wait until the very end to send one single parent-only post
|
- the watcher must not wait until the very end to send one single parent-only post
|
||||||
|
|
||||||
Evidence sources:
|
Evidence sources:
|
||||||
|
|||||||
@@ -120,6 +120,7 @@ Recommended permissions:
|
|||||||
- if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
|
- if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
|
||||||
- confirm final Mattermost delivery for a completed run
|
- confirm final Mattermost delivery for a completed run
|
||||||
- confirm categorized execution sends one post per completed grouped sub-run
|
- confirm categorized execution sends one post per completed grouped sub-run
|
||||||
|
- confirm the watcher stays alive between categorized grouped runs while the parent request is still active
|
||||||
- confirm reused parent build names do not inherit stale `cancelled.marker`, `posted.marker`, or `subruns/` state from older runs
|
- confirm reused parent build names do not inherit stale `cancelled.marker`, `posted.marker`, or `subruns/` state from older runs
|
||||||
|
|
||||||
## Recommended Validation Commands
|
## Recommended Validation Commands
|
||||||
@@ -191,6 +192,7 @@ The cancel helper should:
|
|||||||
- This is not a daemon.
|
- This is not a daemon.
|
||||||
- One watcher instance is started per ATVM run.
|
- One watcher instance is started per ATVM run.
|
||||||
- Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
|
- Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
|
||||||
|
- In categorized execution, the watcher must remain alive until the parent request has actually gone inactive past the grace window, even if one grouped sub-run already completed.
|
||||||
- The watcher exits after the run reaches a terminal state.
|
- The watcher exits after the run reaches a terminal state.
|
||||||
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
|
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
|
||||||
- The watcher prevents duplicate Mattermost posts by writing posted markers.
|
- The watcher prevents duplicate Mattermost posts by writing posted markers.
|
||||||
|
|||||||
@@ -65,6 +65,8 @@ Typical workflow:
|
|||||||
- detect each grouped sub-run in sequence from the parent run log
|
- detect each grouped sub-run in sequence from the parent run log
|
||||||
- wait for that grouped sub-run to finish
|
- wait for that grouped sub-run to finish
|
||||||
- send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED`
|
- send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED`
|
||||||
|
- keep the watcher alive while the parent categorized runner or related child Cypress process is still active
|
||||||
|
- do not treat one completed grouped sub-run as proof that the whole parent request is finished
|
||||||
- continue to the next grouped sub-run
|
- continue to the next grouped sub-run
|
||||||
- exit after the parent request reaches a terminal state
|
- exit after the parent request reaches a terminal state
|
||||||
|
|
||||||
|
|||||||
@@ -108,6 +108,16 @@ def process_active(build_name: str) -> bool:
|
|||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def related_process_active(build_name: str) -> bool:
|
||||||
|
output = run_ps()
|
||||||
|
for line in output.splitlines():
|
||||||
|
if build_name not in line:
|
||||||
|
continue
|
||||||
|
if any(token in line for token in ("run-sorry-cypress.py", "cypress-cloud", "node ")):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
def extract_active_subrun_build(build_name: str) -> Optional[str]:
|
def extract_active_subrun_build(build_name: str) -> Optional[str]:
|
||||||
output = run_ps()
|
output = run_ps()
|
||||||
matches: List[str] = []
|
matches: List[str] = []
|
||||||
@@ -713,7 +723,7 @@ def determine_state(
|
|||||||
) -> Tuple[str, List[Dict[str, object]], Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
|
) -> Tuple[str, List[Dict[str, object]], Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
|
||||||
cancelled_marker = build_dir / "cancelled.marker"
|
cancelled_marker = build_dir / "cancelled.marker"
|
||||||
log_text = read_text(run_log)
|
log_text = read_text(run_log)
|
||||||
active = process_active(build_name)
|
active = related_process_active(build_name) if metadata.get("categorized") else process_active(build_name)
|
||||||
cancelled = cancelled_marker.exists()
|
cancelled = cancelled_marker.exists()
|
||||||
notes: List[str] = []
|
notes: List[str] = []
|
||||||
subrun_states: List[Dict[str, object]] = []
|
subrun_states: List[Dict[str, object]] = []
|
||||||
@@ -776,6 +786,10 @@ def determine_state(
|
|||||||
return "HUNG", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
|
return "HUNG", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
|
||||||
return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
|
return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
|
||||||
|
|
||||||
|
if metadata.get("categorized") and process_gone_since and (now_utc() - process_gone_since).total_seconds() < process_exit_grace_seconds:
|
||||||
|
notes.append("Categorized parent runner has not been gone long enough to treat the request as finished.")
|
||||||
|
return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
|
||||||
|
|
||||||
terminal_subruns = [subrun for subrun in subrun_states if subrun["state"] in {"COMPLETED", "FAILED"}]
|
terminal_subruns = [subrun for subrun in subrun_states if subrun["state"] in {"COMPLETED", "FAILED"}]
|
||||||
if terminal_subruns:
|
if terminal_subruns:
|
||||||
state = "FAILED" if any(result.failures for result in parent_host_results.values()) else "COMPLETED"
|
state = "FAILED" if any(result.failures for result in parent_host_results.values()) else "COMPLETED"
|
||||||
|
|||||||
Reference in New Issue
Block a user