Update ATVM watcher for categorized sub-run posting

- update the watcher design and automation guide to treat --categorize as sequential ATVM sub-runs rather than one parent run with internal phases - document that categorized runs should send one Mattermost status per completed grouped sub-run instead of one parent-only final post - add a --categorize option to the watcher start helper so categorized mode is explicit in watcher startup - update the watcher implementation to track categorized sub-runs separately, write per-subrun state, and post each completed grouped run once
2026-03-26 11:00:39 -04:00
parent 68cd428733
commit d60b8b9b18
6 changed files with 399 additions and 89 deletions
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -43,6 +43,9 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
 - Execute ATVM run commands only after explicit approval.
 - Treat `approve` as approval to run without the watcher service.
 - Treat `approve with watcher` as approval to run and also start the per-run watcher service for that build.
 - When `--categorize` is used with watcher enabled, treat the watcher as a sequential grouped-run watcher:
  - it must post one final Mattermost status per completed categorized group/sub-run
  - it must not wait and replace those with one single parent-only post
 - After execution, report immediate success/failure only.
 - Do not actively monitor completion unless explicitly requested.
 - If monitoring is requested, allow long runtime windows (15-30+ minutes) and continue until completion unless operator instructs otherwise.
@@ -154,13 +157,14 @@ Before any new automation request:
 4. When the watcher is available, present the watcher-start command separately from the core run commands.
 5. Treat `approve` as approval to execute the ATVM run without starting the watcher.
 6. Treat `approve with watcher` as approval to execute the ATVM run and start the watcher for that build.
-7. Run only approved command(s), no extra options and no silent substitutions.
+7. If the run uses `--categorize` and the watcher is requested, include `--categorize` on the watcher start command too so the watcher tracks sequential categorized sub-runs correctly.
-8. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
+8. Run only approved command(s), no extra options and no silent substitutions.
-9. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
+9. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
-10. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
+10. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
-11. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
+11. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
-12. If monitoring was not requested, report immediate success/failure for each command.
+12. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
-13. If monitoring was requested, keep monitoring until completion and report final outcome.
+13. If monitoring was not requested, report immediate success/failure for each command.
 14. If monitoring was requested, keep monitoring until completion and report final outcome.
 ## Requested Test Style
 When asked for one VM or a VM set:
@@ -193,6 +197,7 @@ When asked for one VM or a VM set:
 - Use the same ATVM status layout that would be shown to the operator locally when posting to Mattermost.
 - Default status template: `/home/aw/code/cds/atvm/docs/automation/status-template.md`
 - Do not post to Mattermost unless the operator explicitly asks for the run status to be sent there.
 - For categorized execution with watcher enabled, send one Mattermost status per completed categorized sub-run/group after that grouped run fully finishes.
 ## Status Reporting Format
 When the operator asks for the status of an ATVM automation run, report in this order:
--- a/atvm/docs/automation/mattermost-watcher-design.md
+++ b/atvm/docs/automation/mattermost-watcher-design.md
@@ -1,7 +1,7 @@
 # ATVM Mattermost Watcher Design
 ## Purpose
-Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
+Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts final run status to Mattermost only after the watched scope has fully completed.
 This watcher must continue working even if the local operator machine is offline.
@@ -9,9 +9,10 @@ This watcher must continue working even if the local operator machine is offline
 Use a `systemd`-managed watcher on the ATVM Cypress controller.
 Recommended structure:
- one watcher script that evaluates the state of a specific ATVM run
+- one watcher script that evaluates a specific ATVM run request
 - one `systemd` service to execute the watcher
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
+- no always-on daemon
 - for categorized ATVM runs, one watcher instance tracks the parent request and posts each categorized sub-run separately as those grouped runs complete
 Preferred deployment target:
 - controller host: `192.168.3.190`
@@ -26,14 +27,23 @@ Expected variables:
 - `MATTERMOST_ATVM_CHANNEL`
 ## Run Completion Rule
-The watcher must send Mattermost results only after the ATVM run has fully completed.
+The watcher must send Mattermost results only after the watched scope has fully completed.
-A run is considered fully completed only when:
+A non-categorized run is considered fully completed only when:
 - there are no active runner processes for the run
 - the expected machine scope has final result artifacts
 - no machine remains in `RUNNING` or `NOT STARTED`
 - final reporter artifacts confirm the run has ended
 A categorized run must be treated differently:
 - `--categorize` splits the request into sequential ATVM sub-runs
 - each categorized group is its own run/job
 - the watcher must detect each grouped sub-run in order
 - the watcher must wait for that grouped sub-run to complete
 - then send that grouped sub-run's final Mattermost status
 - then continue watching for the next grouped sub-run
 - the watcher must not wait until the very end to send one single parent-only post
 Evidence sources:
 - live runner processes on `192.168.3.190`
 - `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
@@ -70,7 +80,7 @@ Definitions:
  - the run is still active and not yet complete
 ## Mattermost Posting Rule
-Post to Mattermost only when the run has fully completed.
+Post to Mattermost only when the watched scope has fully completed.
 Send Mattermost status for:
 - `COMPLETED`
@@ -86,6 +96,9 @@ Do not send Mattermost status for:
 Important clarification:
 - a completed run with failed hosts should still be posted
 - a cancelled, terminated, hung, or unknown run should not be posted
 - for categorized execution, this rule applies per categorized sub-run
 - one categorized group completion should produce one Mattermost post
 - do not send one parent-level aggregate post in place of the per-group posts
 ## Required Cancellation / Termination Handling
 If a run is cancelled or terminated, the watcher must:
@@ -106,33 +119,47 @@ For each run, keep durable state such as:
 - last observed machine summary
 - timestamps for first seen, last seen, closed
 For categorized runs, keep durable state for:
 - the parent request build name
 - each detected categorized sub-run
 - whether each categorized sub-run has already been posted
 ## Duplicate-Post Prevention
 The watcher must prevent duplicate Mattermost posts.
 Required behavior:
- only one final post per run
+- for non-categorized execution, only one final post per run
- if a run is already marked as posted, do not send again
+- for categorized execution, only one final post per categorized sub-run
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
+- if a watched scope is already marked as posted, do not send again
 - if a run or categorized sub-run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
 ## Recommended State Files
 Use a durable controller-local state directory, for example:
 - `/var/lib/atvm-run-watcher/`
 Possible contents:
- one state file per run id
+- one parent state file per requested build name
- one posted marker per run id
+- one posted marker per non-categorized run
- one cancellation marker per run id
+- one subdirectory per categorized sub-run with its own state and posted marker
 - one cancellation marker per parent run id
 - optional lock file to prevent multiple watcher instances from racing
 ## Recommended Operator Workflow
 Normal completion workflow:
 1. ATVM run starts.
-2. Watcher tracks the run id / build name.
+2. Watcher tracks the requested build name.
 3. Watcher polls run state and artifacts.
-4. Run fully completes.
+4. For non-categorized execution:
-5. Watcher builds final status summary.
+   - wait for the run to fully complete
-6. Watcher posts final status to Mattermost once.
+   - build one final status summary
-7. Watcher marks the run as posted and closed.
+   - post one final Mattermost status
 5. For categorized execution:
   - detect each grouped sub-run in order
   - wait for that grouped sub-run to fully complete
   - build that grouped sub-run's final status summary
   - post that grouped sub-run's final Mattermost status
   - continue to the next grouped sub-run
 6. Watcher marks the completed watched scope as posted and closed.
 Cancellation / termination workflow:
 1. Operator stops the ATVM run.
@@ -173,7 +200,9 @@ This watcher design must satisfy all of the following:
 - survive local operator machine downtime
 - use `systemd`
 - distinguish run states clearly
- send Mattermost only after full completion
+- send Mattermost only after full completion of the watched scope
 - send completion results whether hosts passed or failed
 - never send Mattermost for cancelled, terminated, hung, or unknown runs
 - prevent duplicate or misleading posts
 - treat `--categorize` as sequential ATVM sub-runs, not as one parent run with internal phases
 - send one Mattermost post per completed categorized sub-run
--- a/atvm/watcher-service/INSTALL.md
+++ b/atvm/watcher-service/INSTALL.md
@@ -8,8 +8,9 @@ This is a deployment plan only. It does not perform the installation.
 Install the local watcher package so the controller can:
- watch one ATVM run per watcher instance
+- watch one requested ATVM run per watcher instance
- send final Mattermost status only for `COMPLETED` or `FAILED`
+- for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED`
 - for categorized runs, send one final Mattermost status per completed categorized sub-run/group
 - suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
 - stop automatically after the watched run reaches a terminal state
@@ -116,7 +117,9 @@ Recommended permissions:
 9. Do a real ATVM run test.
   - launch a real run
   - start the watcher for that build name
   - if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
   - confirm final Mattermost delivery for a completed run
   - confirm categorized execution sends one post per completed grouped sub-run
 ## Recommended Validation Commands
@@ -163,6 +166,7 @@ Example:
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
 ```
@@ -184,9 +188,11 @@ The cancel helper should:
 - This is not a daemon.
 - One watcher instance is started per ATVM run.
 - Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
 - The watcher exits after the run reaches a terminal state.
 - The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
- The watcher prevents duplicate Mattermost posts by writing a posted marker.
+- The watcher prevents duplicate Mattermost posts by writing posted markers.
 - Categorized sub-run state is written under `/var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/`.
 ## Failure Handling
@@ -200,6 +206,10 @@ Expected terminal behavior:
  - post to Mattermost
  - verify `ok`
  - exit
 - categorized `COMPLETED` / `FAILED`
  - post once for that grouped sub-run
  - verify `ok`
  - continue until the parent request finishes
 - `CANCELLED`
  - write final `CANCELLED` state to `state.json`
  - do not post
--- a/atvm/watcher-service/README.md
+++ b/atvm/watcher-service/README.md
@@ -4,10 +4,14 @@ This folder contains a per-run ATVM watcher service package that is intended to
 ## Purpose
-Watch a single ATVM automation run until it reaches a terminal state, then:
+Watch an ATVM automation request until it reaches a terminal state, then:
- post the final status to Mattermost if the run state is `COMPLETED` or `FAILED`
+- for non-categorized runs:
- verify the Mattermost post succeeded
+  - post one final status to Mattermost if the run state is `COMPLETED` or `FAILED`
 - for categorized runs:
  - detect each sequential categorized sub-run
  - post one final status per completed categorized sub-run if that grouped run state is `COMPLETED` or `FAILED`
 - verify each Mattermost post succeeded
 - write durable watcher state
 - exit cleanly so the service stops
@@ -38,14 +42,14 @@ Do not treat `/root/atvm-watcher-service` as the preferred long-term install loc
 ## Per-Run Behavior
-Each watcher instance is tied to one build name.
+Each watcher instance is tied to one requested build name.
 Typical workflow:
 1. Launch the ATVM run.
 2. Start the watcher for that run.
 3. The watcher polls the run log, process state, and `cmcReporter` artifacts.
-4. When the run reaches a terminal state:
+4. For non-categorized runs, when the run reaches a terminal state:
   - `COMPLETED` or `FAILED`
     - build the final ATVM status
     - send the status to Mattermost
@@ -56,6 +60,12 @@ Typical workflow:
     - do not post
     - mark the final state
     - exit
 5. For categorized runs:
   - detect each grouped sub-run in sequence from the parent run log
   - wait for that grouped sub-run to finish
   - send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED`
   - continue to the next grouped sub-run
   - exit after the parent request reaches a terminal state
 ## Required Environment
@@ -71,6 +81,7 @@ Optional metadata for better status formatting:
 - `ATVM_WATCHER_MIGRATION_STYLE`
 - `ATVM_WATCHER_INTEGRATION_PLUGIN`
 - `ATVM_WATCHER_SCOPE_DESCRIPTION`
 - `ATVM_WATCHER_CATEGORIZED`
 ## Start Example
@@ -83,6 +94,7 @@ This helper writes a per-run environment file and starts the matching instance:
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
 ```
@@ -105,5 +117,7 @@ This writes a cancellation marker, updates `state.json` to `CANCELLED`, and stop
 - The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`.
 - Kernel values are resolved from `atvm/inventory/vm-inventory.md`.
 - Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases.
 - In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately.
 - Best-practice controller install path: `/opt/atvm-watcher-service`.
 - This package is local-only right now. Nothing here is installed on the controller yet.
--- a/atvm/watcher-service/atvm_run_watcher.py
+++ b/atvm/watcher-service/atvm_run_watcher.py
@@ -40,6 +40,17 @@ class HostResult:
    timestamp: Optional[datetime] = None
@dataclass
 class SubRun:
    key: str
    display_name: str
    started_at: datetime
    expected_hosts: List[str]
    completed: bool
    currents_url: Optional[str]
    notes: List[str]
 def now_utc() -> datetime:
    return datetime.now(timezone.utc)
@@ -152,6 +163,13 @@ def parse_xml_timestamp(raw: Optional[str]) -> Optional[datetime]:
        return None
 def parse_log_timestamp(raw: str) -> Optional[datetime]:
    try:
        return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S,%f").replace(tzinfo=timezone.utc)
    except ValueError:
        return None
 def parse_host_xml(xml_path: Path) -> Optional[Tuple[str, HostResult]]:
    try:
        tree = ET.parse(xml_path)
@@ -194,6 +212,7 @@ def collect_host_results(
    expected_hosts: List[str],
    kernels: Dict[str, str],
    run_started_at: datetime,
    run_ended_at: Optional[datetime] = None,
 ) -> Dict[str, HostResult]:
    xml_dir = reporter_root / "xml"
    results: Dict[str, HostResult] = {}
@@ -203,6 +222,8 @@ def collect_host_results(
        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
        if xml_mtime < run_started_at:
            continue
        if run_ended_at and xml_mtime >= run_ended_at:
            continue
        parsed = parse_host_xml(xml_path)
        if not parsed:
            continue
@@ -214,21 +235,46 @@ def collect_host_results(
    return results
-def find_current_running_host(log_text: str, completed_hosts: List[str]) -> Optional[str]:
+def find_check_xml_end(
-    matches = re.findall(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text)
+    reporter_root: Path,
-    for host in reversed(matches):
+    started_at: datetime,
-        if host not in completed_hosts:
+    ended_at: Optional[datetime] = None,
-            return host
+) -> Optional[datetime]:
-    return None
+    xml_dir = reporter_root / "xml"
    if not xml_dir.exists():
        return None
    latest: Optional[datetime] = None
    for xml_path in sorted(xml_dir.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime):
        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
        if xml_mtime < started_at:
            continue
        if ended_at and xml_mtime >= ended_at:
            continue
        text = read_text(xml_path)
        if "check-xml-files.ts" not in text:
            continue
        try:
            tree = ET.parse(xml_path)
            root = tree.getroot()
            suite = root.find("testsuite")
            if suite is None:
                continue
            ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
            if ts:
                latest = ts
        except ET.ParseError:
            continue
    return latest
-def infer_metadata() -> Dict[str, str]:
+def infer_metadata() -> Dict[str, object]:
    return {
        "template": os.environ.get("ATVM_WATCHER_TEMPLATE", "unknown"),
        "config_family": os.environ.get("ATVM_WATCHER_CONFIG_FAMILY", "unknown"),
        "migration_style": os.environ.get("ATVM_WATCHER_MIGRATION_STYLE", "ATVM automation validation"),
        "integration_plugin": os.environ.get("ATVM_WATCHER_INTEGRATION_PLUGIN", "unknown"),
        "scope_description": os.environ.get("ATVM_WATCHER_SCOPE_DESCRIPTION", "requested ATVM run scope"),
        "categorized": os.environ.get("ATVM_WATCHER_CATEGORIZED", "false").lower() == "true",
    }
@@ -253,7 +299,7 @@ def format_timestamp_local(ts: Optional[datetime]) -> str:
 def build_status_markdown(
    build_name: str,
-    metadata: Dict[str, str],
+    metadata: Dict[str, object],
    host_results: Dict[str, HostResult],
    run_state: str,
    currents_url: Optional[str],
@@ -348,80 +394,225 @@ def post_to_mattermost(text: str) -> str:
        return response.read().decode().strip()
 def sanitize_key(raw: str) -> str:
    return re.sub(r"[^A-Za-z0-9_.-]+", "-", raw).strip("-") or "subrun"
 def infer_group_label(hosts: List[str], index: int) -> str:
    if not hosts:
        return f"group{index}"
    labels: List[str] = []
    for host in hosts:
        short = host.split("-", 1)[-1]
        if short.startswith("w2k"):
            label = "windows"
        else:
            label = re.sub(r"\d.*$", "", short) or short
        if label not in labels:
            labels.append(label)
    return "-".join(labels) if labels else f"group{index}"
 def extract_segment_build_name(segment_text: str, parent_build_name: str) -> Optional[str]:
    patterns = [
        rf"({re.escape(parent_build_name)}-[A-Za-z0-9_.-]*batch\d+_\d+)",
        r"([A-Za-z0-9_.-]+-batch\d+_\d+)",
    ]
    for pattern in patterns:
        match = re.search(pattern, segment_text)
        if match:
            return match.group(1)
    return None
 def split_log_segments(log_text: str, parent_build_name: str, categorized: bool, default_started_at: datetime) -> List[SubRun]:
    if not categorized:
        return [
            SubRun(
                key=sanitize_key(parent_build_name),
                display_name=parent_build_name,
                started_at=default_started_at,
                expected_hosts=extract_expected_hosts(log_text),
                completed=False,
                currents_url=extract_currents_url(log_text),
                notes=[],
            )
        ]
    segment_starts: List[Tuple[int, Optional[datetime]]] = []
    for match in re.finditer(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) - INFO - Extracted specPattern:", log_text, re.M):
        segment_starts.append((match.start(), parse_log_timestamp(match.group(1))))
    if not segment_starts:
        return [
            SubRun(
                key=sanitize_key(parent_build_name),
                display_name=parent_build_name,
                started_at=default_started_at,
                expected_hosts=extract_expected_hosts(log_text),
                completed=False,
                currents_url=extract_currents_url(log_text),
                notes=["Categorized mode was requested but no sub-run segment has appeared in the log yet."],
            )
        ]
    segments: List[SubRun] = []
    for index, (start_offset, start_ts) in enumerate(segment_starts, start=1):
        end_offset = segment_starts[index][0] if index < len(segment_starts) else len(log_text)
        segment_text = log_text[start_offset:end_offset]
        expected_hosts = extract_expected_hosts(segment_text)
        display_name = extract_segment_build_name(segment_text, parent_build_name)
        if not display_name:
            display_name = f"{parent_build_name}-{infer_group_label(expected_hosts, index)}"
        segments.append(
            SubRun(
                key=sanitize_key(display_name),
                display_name=display_name,
                started_at=start_ts or default_started_at,
                expected_hosts=expected_hosts,
                completed=index < len(segment_starts),
                currents_url=extract_currents_url(segment_text),
                notes=[f"Categorized sub-run {index} of {len(segment_starts)}."],
            )
        )
    return segments
 def evaluate_subrun(
    subrun: SubRun,
    reporter_root: Path,
    inventory: Dict[str, str],
    end_boundary: Optional[datetime],
    parent_active: bool,
    cancelled: bool,
 ) -> Tuple[str, Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
    notes = list(subrun.notes)
    host_results = collect_host_results(
        reporter_root=reporter_root,
        expected_hosts=subrun.expected_hosts,
        kernels=inventory,
        run_started_at=subrun.started_at,
        run_ended_at=end_boundary,
    )
    check_end = find_check_xml_end(reporter_root, subrun.started_at, end_boundary)
    start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
    end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
    if check_end:
        end_candidates.append(check_end)
    start_ts = min(start_candidates) if start_candidates else subrun.started_at
    end_ts = max(end_candidates) if end_candidates else None
    if cancelled:
        notes.append("Cancellation marker detected.")
        return "CANCELLED", host_results, start_ts, end_ts, subrun.currents_url, notes
    if subrun.completed:
        if not host_results:
            notes.append("This categorized sub-run ended but no host results were detected.")
            return "UNKNOWN", host_results, start_ts, end_ts, subrun.currents_url, notes
        notes.append("Categorized sub-run completed and the next grouped run was launched.")
        if check_end:
            notes.append("Final `check-xml-files.ts` validation passed.")
        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
        return state, host_results, start_ts, end_ts, subrun.currents_url, notes
    if parent_active:
        current_host = next((host for host in subrun.expected_hosts if host not in host_results), None)
        if current_host and current_host not in host_results:
            host_results[current_host] = HostResult(
                host=current_host,
                kernel=inventory.get(current_host, "unknown"),
                status="RUN",
                detail="in progress",
            )
        return "RUNNING", host_results, start_ts, end_ts, subrun.currents_url, notes
    if host_results:
        notes.append("Categorized sub-run completed after the parent runner exited.")
        if check_end:
            notes.append("Final `check-xml-files.ts` validation passed.")
        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
        return state, host_results, start_ts, end_ts, subrun.currents_url, notes
    notes.append("Parent run exited before this categorized sub-run produced host results.")
    return "TERMINATED", host_results, start_ts, end_ts, subrun.currents_url, notes
 def determine_state(
    build_name: str,
    build_dir: Path,
    run_log: Path,
    reporter_root: Path,
    inventory: Dict[str, str],
    metadata: Dict[str, object],
    started_at: datetime,
    process_gone_since: Optional[datetime],
    process_exit_grace_seconds: int,
-) -> Tuple[str, Dict[str, HostResult], str, Optional[datetime], Optional[datetime], Optional[str], List[str]]:
+) -> Tuple[str, List[Dict[str, object]], Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
    cancelled_marker = build_dir / "cancelled.marker"
    log_text = read_text(run_log)
    expected_hosts = extract_expected_hosts(log_text)
    host_results = collect_host_results(reporter_root, expected_hosts, inventory, started_at)
    active = process_active(build_name)
-    currents_url = extract_currents_url(log_text)
+    cancelled = cancelled_marker.exists()
    notes: List[str] = []
    subrun_states: List[Dict[str, object]] = []
    parent_host_results: Dict[str, HostResult] = {}
-    current_host = find_current_running_host(log_text, list(host_results.keys()))
+    subruns = split_log_segments(log_text, build_name, bool(metadata.get("categorized")), started_at)
-    if current_host and current_host not in host_results:
+    for index, subrun in enumerate(subruns):
-        host_results[current_host] = HostResult(
+        next_started_at = subruns[index + 1].started_at if index + 1 < len(subruns) else None
-            host=current_host,
+        state, host_results, start_ts, end_ts, currents_url, subrun_notes = evaluate_subrun(
-            kernel=inventory.get(current_host, "unknown"),
+            subrun=subrun,
-            status="RUN",
+            reporter_root=reporter_root,
-            detail="in progress",
+            inventory=inventory,
            end_boundary=next_started_at,
            parent_active=active,
            cancelled=cancelled,
        )
        for host, result in host_results.items():
            parent_host_results[host] = result
        subrun_states.append(
            {
                "key": subrun.key,
                "display_name": subrun.display_name,
                "state": state,
                "host_results": host_results,
                "start_ts": start_ts,
                "end_ts": end_ts,
                "currents_url": currents_url,
                "notes": subrun_notes,
            }
        )
-    start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    parent_start_candidates = [subrun["start_ts"] for subrun in subrun_states if subrun["start_ts"]]
-    end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    parent_end_candidates = [subrun["end_ts"] for subrun in subrun_states if subrun["end_ts"]]
-    check_xml = reporter_root / "xml"
+    start_ts = min(parent_start_candidates) if parent_start_candidates else started_at
-    for xml_path in sorted(check_xml.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime, reverse=True):
+    end_ts = max(parent_end_candidates) if parent_end_candidates else find_check_xml_end(reporter_root, started_at)
-        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
+    currents_url = extract_currents_url(log_text)
        if xml_mtime < started_at:
            continue
        text = read_text(xml_path)
        if "check-xml-files.ts" in text:
            try:
                tree = ET.parse(xml_path)
                root = tree.getroot()
                suite = root.find("testsuite")
                if suite is not None:
                    ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
                    if ts:
                        end_candidates.append(ts)
            except ET.ParseError:
                pass
            break
-    start_ts = min(start_candidates) if start_candidates else started_at
+    if cancelled:
    end_ts = max(end_candidates) if end_candidates else None
    if cancelled_marker.exists():
        notes.append("Cancellation marker detected.")
-        return "CANCELLED", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "CANCELLED", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
    if active:
        elapsed = (now_utc() - started_at).total_seconds()
        if elapsed > args.max_watch_seconds:
            notes.append("Watcher exceeded max watch duration while the run still appears active.")
-            return "HUNG", host_results, log_text, start_ts, end_ts, currents_url, notes
+            return "HUNG", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
-        return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
-    if "Cloud Run Finished" in log_text or currents_url:
+    terminal_subruns = [subrun for subrun in subrun_states if subrun["state"] in {"COMPLETED", "FAILED"}]
-        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
+    if terminal_subruns:
-        notes.append("Run finished and final reporting artifacts were detected.")
+        state = "FAILED" if any(result.failures for result in parent_host_results.values()) else "COMPLETED"
-        if any("check-xml-files.ts" in line for line in log_text.splitlines()):
+        notes.append("Run finished and one or more sub-run result artifacts were detected.")
-            notes.append("Final `check-xml-files.ts` validation passed.")
+        if end_ts:
-        return state, host_results, log_text, start_ts, end_ts, currents_url, notes
+            notes.append("Final reporting artifacts were detected.")
        return state, subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
    if process_gone_since and (now_utc() - process_gone_since).total_seconds() >= process_exit_grace_seconds:
        notes.append("Run process exited without a clean completion signal.")
-        return "TERMINATED", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "TERMINATED", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
-    return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+    return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
 if __name__ == "__main__":
@@ -455,12 +646,13 @@ if __name__ == "__main__":
        if active:
            process_gone_since = None
-        run_state, host_results, log_text, start_ts, end_ts, currents_url, notes = determine_state(
+        run_state, subrun_states, host_results, start_ts, end_ts, currents_url, notes = determine_state(
            build_name=build_name,
            build_dir=build_dir,
            run_log=run_log,
            reporter_root=reporter_root,
            inventory=inventory,
            metadata=metadata,
            started_at=started_at,
            process_gone_since=process_gone_since,
            process_exit_grace_seconds=args.process_exit_grace_seconds,
@@ -478,8 +670,64 @@ if __name__ == "__main__":
            }
            for host, result in host_results.items()
        }
        state["subruns"] = {
            subrun["display_name"]: {
                "state": subrun["state"],
                "hosts": sorted(subrun["host_results"].keys()),
                "start_ts": subrun["start_ts"].isoformat() if subrun["start_ts"] else None,
                "end_ts": subrun["end_ts"].isoformat() if subrun["end_ts"] else None,
                "currents_url": subrun["currents_url"],
                "notes": subrun["notes"],
            }
            for subrun in subrun_states
        }
        write_state(state_file, state)
        for subrun in subrun_states:
            subrun_dir = build_dir / "subruns" / subrun["key"]
            ensure_dir(subrun_dir)
            subrun_state_file = subrun_dir / "state.json"
            subrun_posted_marker = subrun_dir / "posted.marker"
            subrun_state = {
                "display_name": subrun["display_name"],
                "last_state": subrun["state"],
                "last_seen_at": now_utc().isoformat(),
                "host_results": {
                    host: {
                        "status": result.status,
                        "detail": result.detail,
                        "kernel": result.kernel,
                        "tests": result.tests,
                        "failures": result.failures,
                    }
                    for host, result in subrun["host_results"].items()
                },
                "notes": subrun["notes"],
                "currents_url": subrun["currents_url"],
                "started_at": subrun["start_ts"].isoformat() if subrun["start_ts"] else None,
                "ended_at": subrun["end_ts"].isoformat() if subrun["end_ts"] else None,
            }
            if subrun["state"] in {"COMPLETED", "FAILED"} and not subrun_posted_marker.exists():
                status_text = build_status_markdown(
                    build_name=subrun["display_name"],
                    metadata=metadata,
                    host_results=dict(sorted(subrun["host_results"].items())),
                    run_state=subrun["state"],
                    currents_url=subrun["currents_url"],
                    start_ts=subrun["start_ts"],
                    end_ts=subrun["end_ts"],
                    notes=subrun["notes"],
                )
                print(status_text)
                response = post_to_mattermost(status_text)
                if response != "ok":
                    raise SystemExit(f"Mattermost webhook did not return ok for {subrun['display_name']}: {response!r}")
                subrun_posted_marker.write_text("ok\n", encoding="utf-8")
                subrun_state["mattermost_posted"] = True
                subrun_state["mattermost_response"] = response
                print(f"[watcher] Mattermost post confirmed for {subrun['display_name']}.")
            write_state(subrun_state_file, subrun_state)
        if run_state == "RUNNING":
            print(f"[watcher] {build_name}: RUNNING")
            time.sleep(args.poll_interval)
@@ -497,7 +745,7 @@ if __name__ == "__main__":
        )
        print(status_text)
-        if run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
+        if not metadata.get("categorized") and run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
            response = post_to_mattermost(status_text)
            if response != "ok":
                raise SystemExit(f"Mattermost webhook did not return ok: {response!r}")
--- a/atvm/watcher-service/start-atvm-run-watcher.sh
+++ b/atvm/watcher-service/start-atvm-run-watcher.sh
@@ -13,6 +13,7 @@ Options:
  --migration-style <text>
  --integration-plugin <text>
  --scope-description <text>
  --categorize
  --state-root <path>   Default: /var/lib/atvm-run-watcher
 EOF
 }
@@ -23,6 +24,7 @@ CONFIG_FAMILY=""
 MIGRATION_STYLE=""
 INTEGRATION_PLUGIN=""
 SCOPE_DESCRIPTION=""
 WATCHER_CATEGORIZED="false"
 STATE_ROOT="/var/lib/atvm-run-watcher"
 while [[ $# -gt 0 ]]; do
@@ -33,6 +35,7 @@ while [[ $# -gt 0 ]]; do
    --migration-style) MIGRATION_STYLE="${2:-}"; shift 2 ;;
    --integration-plugin) INTEGRATION_PLUGIN="${2:-}"; shift 2 ;;
    --scope-description) SCOPE_DESCRIPTION="${2:-}"; shift 2 ;;
    --categorize) WATCHER_CATEGORIZED="true"; shift ;;
    --state-root) STATE_ROOT="${2:-}"; shift 2 ;;
    -h|--help) usage; exit 0 ;;
    *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
@@ -54,6 +57,7 @@ ATVM_WATCHER_CONFIG_FAMILY=${CONFIG_FAMILY@Q}
 ATVM_WATCHER_MIGRATION_STYLE=${MIGRATION_STYLE@Q}
 ATVM_WATCHER_INTEGRATION_PLUGIN=${INTEGRATION_PLUGIN@Q}
 ATVM_WATCHER_SCOPE_DESCRIPTION=${SCOPE_DESCRIPTION@Q}
 ATVM_WATCHER_CATEGORIZED=${WATCHER_CATEGORIZED@Q}
 EOF
 systemctl start "atvm-run-watcher@${BUILD_NAME}.service"