Update ATVM watcher for categorized sub-run posting

- update the watcher design and automation guide to treat --categorize as sequential ATVM sub-runs rather than one parent run with internal phases - document that categorized runs should send one Mattermost status per completed grouped sub-run instead of one parent-only final post - add a --categorize option to the watcher start helper so categorized mode is explicit in watcher startup - update the watcher implementation to track categorized sub-runs separately, write per-subrun state, and post each completed grouped run once
2026-03-26 11:00:39 -04:00
parent 68cd428733
commit d60b8b9b18
6 changed files with 399 additions and 89 deletions
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -43,6 +43,9 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
 - Execute ATVM run commands only after explicit approval.
 - Treat `approve` as approval to run without the watcher service.
 - Treat `approve with watcher` as approval to run and also start the per-run watcher service for that build.
+- When `--categorize` is used with watcher enabled, treat the watcher as a sequential grouped-run watcher:
+  - it must post one final Mattermost status per completed categorized group/sub-run
+  - it must not wait and replace those with one single parent-only post
 - After execution, report immediate success/failure only.
 - Do not actively monitor completion unless explicitly requested.
 - If monitoring is requested, allow long runtime windows (15-30+ minutes) and continue until completion unless operator instructs otherwise.
@@ -154,13 +157,14 @@ Before any new automation request:
 4. When the watcher is available, present the watcher-start command separately from the core run commands.
 5. Treat `approve` as approval to execute the ATVM run without starting the watcher.
 6. Treat `approve with watcher` as approval to execute the ATVM run and start the watcher for that build.
-7. Run only approved command(s), no extra options and no silent substitutions.
-8. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
-9. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
-10. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
-11. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
-12. If monitoring was not requested, report immediate success/failure for each command.
-13. If monitoring was requested, keep monitoring until completion and report final outcome.
+7. If the run uses `--categorize` and the watcher is requested, include `--categorize` on the watcher start command too so the watcher tracks sequential categorized sub-runs correctly.
+8. Run only approved command(s), no extra options and no silent substitutions.
+9. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
+10. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
+11. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
+12. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
+13. If monitoring was not requested, report immediate success/failure for each command.
+14. If monitoring was requested, keep monitoring until completion and report final outcome.

 ## Requested Test Style
 When asked for one VM or a VM set:
@@ -193,6 +197,7 @@ When asked for one VM or a VM set:
 - Use the same ATVM status layout that would be shown to the operator locally when posting to Mattermost.
 - Default status template: `/home/aw/code/cds/atvm/docs/automation/status-template.md`
 - Do not post to Mattermost unless the operator explicitly asks for the run status to be sent there.
+- For categorized execution with watcher enabled, send one Mattermost status per completed categorized sub-run/group after that grouped run fully finishes.

 ## Status Reporting Format
 When the operator asks for the status of an ATVM automation run, report in this order:
--- a/atvm/docs/automation/mattermost-watcher-design.md
+++ b/atvm/docs/automation/mattermost-watcher-design.md
@@ -1,7 +1,7 @@
 # ATVM Mattermost Watcher Design

 ## Purpose
-Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
+Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts final run status to Mattermost only after the watched scope has fully completed.

 This watcher must continue working even if the local operator machine is offline.

@@ -9,9 +9,10 @@ This watcher must continue working even if the local operator machine is offline
 Use a `systemd`-managed watcher on the ATVM Cypress controller.

 Recommended structure:
- one watcher script that evaluates the state of a specific ATVM run
+- one watcher script that evaluates a specific ATVM run request
 - one `systemd` service to execute the watcher
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
+- no always-on daemon
+- for categorized ATVM runs, one watcher instance tracks the parent request and posts each categorized sub-run separately as those grouped runs complete

 Preferred deployment target:
 - controller host: `192.168.3.190`
@@ -26,14 +27,23 @@ Expected variables:
 - `MATTERMOST_ATVM_CHANNEL`

 ## Run Completion Rule
-The watcher must send Mattermost results only after the ATVM run has fully completed.
+The watcher must send Mattermost results only after the watched scope has fully completed.

-A run is considered fully completed only when:
+A non-categorized run is considered fully completed only when:
 - there are no active runner processes for the run
 - the expected machine scope has final result artifacts
 - no machine remains in `RUNNING` or `NOT STARTED`
 - final reporter artifacts confirm the run has ended

+A categorized run must be treated differently:
+- `--categorize` splits the request into sequential ATVM sub-runs
+- each categorized group is its own run/job
+- the watcher must detect each grouped sub-run in order
+- the watcher must wait for that grouped sub-run to complete
+- then send that grouped sub-run's final Mattermost status
+- then continue watching for the next grouped sub-run
+- the watcher must not wait until the very end to send one single parent-only post
+
 Evidence sources:
 - live runner processes on `192.168.3.190`
 - `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
@@ -70,7 +80,7 @@ Definitions:
  - the run is still active and not yet complete

 ## Mattermost Posting Rule
-Post to Mattermost only when the run has fully completed.
+Post to Mattermost only when the watched scope has fully completed.

 Send Mattermost status for:
 - `COMPLETED`
@@ -86,6 +96,9 @@ Do not send Mattermost status for:
 Important clarification:
 - a completed run with failed hosts should still be posted
 - a cancelled, terminated, hung, or unknown run should not be posted
+- for categorized execution, this rule applies per categorized sub-run
+- one categorized group completion should produce one Mattermost post
+- do not send one parent-level aggregate post in place of the per-group posts

 ## Required Cancellation / Termination Handling
 If a run is cancelled or terminated, the watcher must:
@@ -106,33 +119,47 @@ For each run, keep durable state such as:
 - last observed machine summary
 - timestamps for first seen, last seen, closed

+For categorized runs, keep durable state for:
+- the parent request build name
+- each detected categorized sub-run
+- whether each categorized sub-run has already been posted
+
 ## Duplicate-Post Prevention
 The watcher must prevent duplicate Mattermost posts.

 Required behavior:
- only one final post per run
- if a run is already marked as posted, do not send again
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
+- for non-categorized execution, only one final post per run
+- for categorized execution, only one final post per categorized sub-run
+- if a watched scope is already marked as posted, do not send again
+- if a run or categorized sub-run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow

 ## Recommended State Files
 Use a durable controller-local state directory, for example:
 - `/var/lib/atvm-run-watcher/`

 Possible contents:
- one state file per run id
- one posted marker per run id
- one cancellation marker per run id
+- one parent state file per requested build name
+- one posted marker per non-categorized run
+- one subdirectory per categorized sub-run with its own state and posted marker
+- one cancellation marker per parent run id
 - optional lock file to prevent multiple watcher instances from racing

 ## Recommended Operator Workflow
 Normal completion workflow:
 1. ATVM run starts.
-2. Watcher tracks the run id / build name.
+2. Watcher tracks the requested build name.
 3. Watcher polls run state and artifacts.
-4. Run fully completes.
-5. Watcher builds final status summary.
-6. Watcher posts final status to Mattermost once.
-7. Watcher marks the run as posted and closed.
+4. For non-categorized execution:
+   - wait for the run to fully complete
+   - build one final status summary
+   - post one final Mattermost status
+5. For categorized execution:
+   - detect each grouped sub-run in order
+   - wait for that grouped sub-run to fully complete
+   - build that grouped sub-run's final status summary
+   - post that grouped sub-run's final Mattermost status
+   - continue to the next grouped sub-run
+6. Watcher marks the completed watched scope as posted and closed.

 Cancellation / termination workflow:
 1. Operator stops the ATVM run.
@@ -173,7 +200,9 @@ This watcher design must satisfy all of the following:
 - survive local operator machine downtime
 - use `systemd`
 - distinguish run states clearly
- send Mattermost only after full completion
+- send Mattermost only after full completion of the watched scope
 - send completion results whether hosts passed or failed
 - never send Mattermost for cancelled, terminated, hung, or unknown runs
 - prevent duplicate or misleading posts
+- treat `--categorize` as sequential ATVM sub-runs, not as one parent run with internal phases
+- send one Mattermost post per completed categorized sub-run
--- a/atvm/watcher-service/INSTALL.md
+++ b/atvm/watcher-service/INSTALL.md
@@ -8,8 +8,9 @@ This is a deployment plan only. It does not perform the installation.

 Install the local watcher package so the controller can:

- watch one ATVM run per watcher instance
- send final Mattermost status only for `COMPLETED` or `FAILED`
+- watch one requested ATVM run per watcher instance
+- for non-categorized runs, send one final Mattermost status only for `COMPLETED` or `FAILED`
+- for categorized runs, send one final Mattermost status per completed categorized sub-run/group
 - suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
 - stop automatically after the watched run reaches a terminal state

@@ -116,7 +117,9 @@ Recommended permissions:
 9. Do a real ATVM run test.
   - launch a real run
   - start the watcher for that build name
+   - if the run uses `--categorize`, also pass `--categorize` to the watcher start helper
   - confirm final Mattermost delivery for a completed run
+   - confirm categorized execution sends one post per completed grouped sub-run

 ## Recommended Validation Commands

@@ -163,6 +166,7 @@ Example:
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
+  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
 ```

@@ -184,9 +188,11 @@ The cancel helper should:

 - This is not a daemon.
 - One watcher instance is started per ATVM run.
+- Categorized execution is treated as one watcher instance tracking sequential grouped ATVM sub-runs.
 - The watcher exits after the run reaches a terminal state.
 - The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
- The watcher prevents duplicate Mattermost posts by writing a posted marker.
+- The watcher prevents duplicate Mattermost posts by writing posted markers.
+- Categorized sub-run state is written under `/var/lib/atvm-run-watcher/<build-name>/subruns/<subrun-key>/`.

 ## Failure Handling

@@ -200,6 +206,10 @@ Expected terminal behavior:
  - post to Mattermost
  - verify `ok`
  - exit
+- categorized `COMPLETED` / `FAILED`
+  - post once for that grouped sub-run
+  - verify `ok`
+  - continue until the parent request finishes
 - `CANCELLED`
  - write final `CANCELLED` state to `state.json`
  - do not post
--- a/atvm/watcher-service/README.md
+++ b/atvm/watcher-service/README.md
@@ -4,10 +4,14 @@ This folder contains a per-run ATVM watcher service package that is intended to

 ## Purpose

-Watch a single ATVM automation run until it reaches a terminal state, then:
+Watch an ATVM automation request until it reaches a terminal state, then:

- post the final status to Mattermost if the run state is `COMPLETED` or `FAILED`
- verify the Mattermost post succeeded
+- for non-categorized runs:
+  - post one final status to Mattermost if the run state is `COMPLETED` or `FAILED`
+- for categorized runs:
+  - detect each sequential categorized sub-run
+  - post one final status per completed categorized sub-run if that grouped run state is `COMPLETED` or `FAILED`
+- verify each Mattermost post succeeded
 - write durable watcher state
 - exit cleanly so the service stops

@@ -38,14 +42,14 @@ Do not treat `/root/atvm-watcher-service` as the preferred long-term install loc

 ## Per-Run Behavior

-Each watcher instance is tied to one build name.
+Each watcher instance is tied to one requested build name.

 Typical workflow:

 1. Launch the ATVM run.
 2. Start the watcher for that run.
 3. The watcher polls the run log, process state, and `cmcReporter` artifacts.
-4. When the run reaches a terminal state:
+4. For non-categorized runs, when the run reaches a terminal state:
   - `COMPLETED` or `FAILED`
     - build the final ATVM status
     - send the status to Mattermost
@@ -56,6 +60,12 @@ Typical workflow:
     - do not post
     - mark the final state
     - exit
+5. For categorized runs:
+   - detect each grouped sub-run in sequence from the parent run log
+   - wait for that grouped sub-run to finish
+   - send one Mattermost post for that grouped sub-run if it reached `COMPLETED` or `FAILED`
+   - continue to the next grouped sub-run
+   - exit after the parent request reaches a terminal state

 ## Required Environment

@@ -71,6 +81,7 @@ Optional metadata for better status formatting:
 - `ATVM_WATCHER_MIGRATION_STYLE`
 - `ATVM_WATCHER_INTEGRATION_PLUGIN`
 - `ATVM_WATCHER_SCOPE_DESCRIPTION`
+- `ATVM_WATCHER_CATEGORIZED`

 ## Start Example

@@ -83,6 +94,7 @@ This helper writes a per-run environment file and starts the matching instance:
  --config-family gold \
  --migration-style "ATVM end-to-end migration validation" \
  --integration-plugin "pure with fc" \
+  --categorize \
  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
 ```

@@ -105,5 +117,7 @@ This writes a cancellation marker, updates `state.json` to `CANCELLED`, and stop

 - The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`.
 - Kernel values are resolved from `atvm/inventory/vm-inventory.md`.
+- Categorized execution is treated as sequential grouped ATVM sub-runs, not as one parent run with internal phases.
+- In categorized mode, the watcher writes per-subrun state under `subruns/` and posts each completed grouped run separately.
 - Best-practice controller install path: `/opt/atvm-watcher-service`.
 - This package is local-only right now. Nothing here is installed on the controller yet.
--- a/atvm/watcher-service/atvm_run_watcher.py
+++ b/atvm/watcher-service/atvm_run_watcher.py
@@ -40,6 +40,17 @@ class HostResult:
    timestamp: Optional[datetime] = None


+@dataclass
+class SubRun:
+    key: str
+    display_name: str
+    started_at: datetime
+    expected_hosts: List[str]
+    completed: bool
+    currents_url: Optional[str]
+    notes: List[str]
+
+
 def now_utc() -> datetime:
    return datetime.now(timezone.utc)

@@ -152,6 +163,13 @@ def parse_xml_timestamp(raw: Optional[str]) -> Optional[datetime]:
        return None


+def parse_log_timestamp(raw: str) -> Optional[datetime]:
+    try:
+        return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S,%f").replace(tzinfo=timezone.utc)
+    except ValueError:
+        return None
+
+
 def parse_host_xml(xml_path: Path) -> Optional[Tuple[str, HostResult]]:
    try:
        tree = ET.parse(xml_path)
@@ -194,6 +212,7 @@ def collect_host_results(
    expected_hosts: List[str],
    kernels: Dict[str, str],
    run_started_at: datetime,
+    run_ended_at: Optional[datetime] = None,
 ) -> Dict[str, HostResult]:
    xml_dir = reporter_root / "xml"
    results: Dict[str, HostResult] = {}
@@ -203,6 +222,8 @@ def collect_host_results(
        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
        if xml_mtime < run_started_at:
            continue
+        if run_ended_at and xml_mtime >= run_ended_at:
+            continue
        parsed = parse_host_xml(xml_path)
        if not parsed:
            continue
@@ -214,21 +235,46 @@ def collect_host_results(
    return results


-def find_current_running_host(log_text: str, completed_hosts: List[str]) -> Optional[str]:
-    matches = re.findall(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text)
-    for host in reversed(matches):
-        if host not in completed_hosts:
-            return host
-    return None
+def find_check_xml_end(
+    reporter_root: Path,
+    started_at: datetime,
+    ended_at: Optional[datetime] = None,
+) -> Optional[datetime]:
+    xml_dir = reporter_root / "xml"
+    if not xml_dir.exists():
+        return None
+    latest: Optional[datetime] = None
+    for xml_path in sorted(xml_dir.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime):
+        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
+        if xml_mtime < started_at:
+            continue
+        if ended_at and xml_mtime >= ended_at:
+            continue
+        text = read_text(xml_path)
+        if "check-xml-files.ts" not in text:
+            continue
+        try:
+            tree = ET.parse(xml_path)
+            root = tree.getroot()
+            suite = root.find("testsuite")
+            if suite is None:
+                continue
+            ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
+            if ts:
+                latest = ts
+        except ET.ParseError:
+            continue
+    return latest


-def infer_metadata() -> Dict[str, str]:
+def infer_metadata() -> Dict[str, object]:
    return {
        "template": os.environ.get("ATVM_WATCHER_TEMPLATE", "unknown"),
        "config_family": os.environ.get("ATVM_WATCHER_CONFIG_FAMILY", "unknown"),
        "migration_style": os.environ.get("ATVM_WATCHER_MIGRATION_STYLE", "ATVM automation validation"),
        "integration_plugin": os.environ.get("ATVM_WATCHER_INTEGRATION_PLUGIN", "unknown"),
        "scope_description": os.environ.get("ATVM_WATCHER_SCOPE_DESCRIPTION", "requested ATVM run scope"),
+        "categorized": os.environ.get("ATVM_WATCHER_CATEGORIZED", "false").lower() == "true",
    }


@@ -253,7 +299,7 @@ def format_timestamp_local(ts: Optional[datetime]) -> str:

 def build_status_markdown(
    build_name: str,
-    metadata: Dict[str, str],
+    metadata: Dict[str, object],
    host_results: Dict[str, HostResult],
    run_state: str,
    currents_url: Optional[str],
@@ -348,80 +394,225 @@ def post_to_mattermost(text: str) -> str:
        return response.read().decode().strip()


+def sanitize_key(raw: str) -> str:
+    return re.sub(r"[^A-Za-z0-9_.-]+", "-", raw).strip("-") or "subrun"
+
+
+def infer_group_label(hosts: List[str], index: int) -> str:
+    if not hosts:
+        return f"group{index}"
+    labels: List[str] = []
+    for host in hosts:
+        short = host.split("-", 1)[-1]
+        if short.startswith("w2k"):
+            label = "windows"
+        else:
+            label = re.sub(r"\d.*$", "", short) or short
+        if label not in labels:
+            labels.append(label)
+    return "-".join(labels) if labels else f"group{index}"
+
+
+def extract_segment_build_name(segment_text: str, parent_build_name: str) -> Optional[str]:
+    patterns = [
+        rf"({re.escape(parent_build_name)}-[A-Za-z0-9_.-]*batch\d+_\d+)",
+        r"([A-Za-z0-9_.-]+-batch\d+_\d+)",
+    ]
+    for pattern in patterns:
+        match = re.search(pattern, segment_text)
+        if match:
+            return match.group(1)
+    return None
+
+
+def split_log_segments(log_text: str, parent_build_name: str, categorized: bool, default_started_at: datetime) -> List[SubRun]:
+    if not categorized:
+        return [
+            SubRun(
+                key=sanitize_key(parent_build_name),
+                display_name=parent_build_name,
+                started_at=default_started_at,
+                expected_hosts=extract_expected_hosts(log_text),
+                completed=False,
+                currents_url=extract_currents_url(log_text),
+                notes=[],
+            )
+        ]
+
+    segment_starts: List[Tuple[int, Optional[datetime]]] = []
+    for match in re.finditer(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) - INFO - Extracted specPattern:", log_text, re.M):
+        segment_starts.append((match.start(), parse_log_timestamp(match.group(1))))
+
+    if not segment_starts:
+        return [
+            SubRun(
+                key=sanitize_key(parent_build_name),
+                display_name=parent_build_name,
+                started_at=default_started_at,
+                expected_hosts=extract_expected_hosts(log_text),
+                completed=False,
+                currents_url=extract_currents_url(log_text),
+                notes=["Categorized mode was requested but no sub-run segment has appeared in the log yet."],
+            )
+        ]
+
+    segments: List[SubRun] = []
+    for index, (start_offset, start_ts) in enumerate(segment_starts, start=1):
+        end_offset = segment_starts[index][0] if index < len(segment_starts) else len(log_text)
+        segment_text = log_text[start_offset:end_offset]
+        expected_hosts = extract_expected_hosts(segment_text)
+        display_name = extract_segment_build_name(segment_text, parent_build_name)
+        if not display_name:
+            display_name = f"{parent_build_name}-{infer_group_label(expected_hosts, index)}"
+        segments.append(
+            SubRun(
+                key=sanitize_key(display_name),
+                display_name=display_name,
+                started_at=start_ts or default_started_at,
+                expected_hosts=expected_hosts,
+                completed=index < len(segment_starts),
+                currents_url=extract_currents_url(segment_text),
+                notes=[f"Categorized sub-run {index} of {len(segment_starts)}."],
+            )
+        )
+    return segments
+
+
+def evaluate_subrun(
+    subrun: SubRun,
+    reporter_root: Path,
+    inventory: Dict[str, str],
+    end_boundary: Optional[datetime],
+    parent_active: bool,
+    cancelled: bool,
+) -> Tuple[str, Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
+    notes = list(subrun.notes)
+    host_results = collect_host_results(
+        reporter_root=reporter_root,
+        expected_hosts=subrun.expected_hosts,
+        kernels=inventory,
+        run_started_at=subrun.started_at,
+        run_ended_at=end_boundary,
+    )
+    check_end = find_check_xml_end(reporter_root, subrun.started_at, end_boundary)
+    start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    if check_end:
+        end_candidates.append(check_end)
+    start_ts = min(start_candidates) if start_candidates else subrun.started_at
+    end_ts = max(end_candidates) if end_candidates else None
+
+    if cancelled:
+        notes.append("Cancellation marker detected.")
+        return "CANCELLED", host_results, start_ts, end_ts, subrun.currents_url, notes
+
+    if subrun.completed:
+        if not host_results:
+            notes.append("This categorized sub-run ended but no host results were detected.")
+            return "UNKNOWN", host_results, start_ts, end_ts, subrun.currents_url, notes
+        notes.append("Categorized sub-run completed and the next grouped run was launched.")
+        if check_end:
+            notes.append("Final `check-xml-files.ts` validation passed.")
+        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
+        return state, host_results, start_ts, end_ts, subrun.currents_url, notes
+
+    if parent_active:
+        current_host = next((host for host in subrun.expected_hosts if host not in host_results), None)
+        if current_host and current_host not in host_results:
+            host_results[current_host] = HostResult(
+                host=current_host,
+                kernel=inventory.get(current_host, "unknown"),
+                status="RUN",
+                detail="in progress",
+            )
+        return "RUNNING", host_results, start_ts, end_ts, subrun.currents_url, notes
+
+    if host_results:
+        notes.append("Categorized sub-run completed after the parent runner exited.")
+        if check_end:
+            notes.append("Final `check-xml-files.ts` validation passed.")
+        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
+        return state, host_results, start_ts, end_ts, subrun.currents_url, notes
+
+    notes.append("Parent run exited before this categorized sub-run produced host results.")
+    return "TERMINATED", host_results, start_ts, end_ts, subrun.currents_url, notes
+
+
 def determine_state(
    build_name: str,
    build_dir: Path,
    run_log: Path,
    reporter_root: Path,
    inventory: Dict[str, str],
+    metadata: Dict[str, object],
    started_at: datetime,
    process_gone_since: Optional[datetime],
    process_exit_grace_seconds: int,
-) -> Tuple[str, Dict[str, HostResult], str, Optional[datetime], Optional[datetime], Optional[str], List[str]]:
+) -> Tuple[str, List[Dict[str, object]], Dict[str, HostResult], Optional[datetime], Optional[datetime], Optional[str], List[str]]:
    cancelled_marker = build_dir / "cancelled.marker"
    log_text = read_text(run_log)
-    expected_hosts = extract_expected_hosts(log_text)
-    host_results = collect_host_results(reporter_root, expected_hosts, inventory, started_at)
    active = process_active(build_name)
-    currents_url = extract_currents_url(log_text)
+    cancelled = cancelled_marker.exists()
    notes: List[str] = []
+    subrun_states: List[Dict[str, object]] = []
+    parent_host_results: Dict[str, HostResult] = {}

-    current_host = find_current_running_host(log_text, list(host_results.keys()))
-    if current_host and current_host not in host_results:
-        host_results[current_host] = HostResult(
-            host=current_host,
-            kernel=inventory.get(current_host, "unknown"),
-            status="RUN",
-            detail="in progress",
+    subruns = split_log_segments(log_text, build_name, bool(metadata.get("categorized")), started_at)
+    for index, subrun in enumerate(subruns):
+        next_started_at = subruns[index + 1].started_at if index + 1 < len(subruns) else None
+        state, host_results, start_ts, end_ts, currents_url, subrun_notes = evaluate_subrun(
+            subrun=subrun,
+            reporter_root=reporter_root,
+            inventory=inventory,
+            end_boundary=next_started_at,
+            parent_active=active,
+            cancelled=cancelled,
+        )
+        for host, result in host_results.items():
+            parent_host_results[host] = result
+        subrun_states.append(
+            {
+                "key": subrun.key,
+                "display_name": subrun.display_name,
+                "state": state,
+                "host_results": host_results,
+                "start_ts": start_ts,
+                "end_ts": end_ts,
+                "currents_url": currents_url,
+                "notes": subrun_notes,
+            }
        )

-    start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
-    end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
-    check_xml = reporter_root / "xml"
-    for xml_path in sorted(check_xml.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime, reverse=True):
-        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
-        if xml_mtime < started_at:
-            continue
-        text = read_text(xml_path)
-        if "check-xml-files.ts" in text:
-            try:
-                tree = ET.parse(xml_path)
-                root = tree.getroot()
-                suite = root.find("testsuite")
-                if suite is not None:
-                    ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
-                    if ts:
-                        end_candidates.append(ts)
-            except ET.ParseError:
-                pass
-            break
+    parent_start_candidates = [subrun["start_ts"] for subrun in subrun_states if subrun["start_ts"]]
+    parent_end_candidates = [subrun["end_ts"] for subrun in subrun_states if subrun["end_ts"]]
+    start_ts = min(parent_start_candidates) if parent_start_candidates else started_at
+    end_ts = max(parent_end_candidates) if parent_end_candidates else find_check_xml_end(reporter_root, started_at)
+    currents_url = extract_currents_url(log_text)

-    start_ts = min(start_candidates) if start_candidates else started_at
-    end_ts = max(end_candidates) if end_candidates else None
-
-    if cancelled_marker.exists():
+    if cancelled:
        notes.append("Cancellation marker detected.")
-        return "CANCELLED", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "CANCELLED", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes

    if active:
        elapsed = (now_utc() - started_at).total_seconds()
        if elapsed > args.max_watch_seconds:
            notes.append("Watcher exceeded max watch duration while the run still appears active.")
-            return "HUNG", host_results, log_text, start_ts, end_ts, currents_url, notes
-        return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+            return "HUNG", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes
+        return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes

-    if "Cloud Run Finished" in log_text or currents_url:
-        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
-        notes.append("Run finished and final reporting artifacts were detected.")
-        if any("check-xml-files.ts" in line for line in log_text.splitlines()):
-            notes.append("Final `check-xml-files.ts` validation passed.")
-        return state, host_results, log_text, start_ts, end_ts, currents_url, notes
+    terminal_subruns = [subrun for subrun in subrun_states if subrun["state"] in {"COMPLETED", "FAILED"}]
+    if terminal_subruns:
+        state = "FAILED" if any(result.failures for result in parent_host_results.values()) else "COMPLETED"
+        notes.append("Run finished and one or more sub-run result artifacts were detected.")
+        if end_ts:
+            notes.append("Final reporting artifacts were detected.")
+        return state, subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes

    if process_gone_since and (now_utc() - process_gone_since).total_seconds() >= process_exit_grace_seconds:
        notes.append("Run process exited without a clean completion signal.")
-        return "TERMINATED", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "TERMINATED", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes

-    return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+    return "RUNNING", subrun_states, parent_host_results, start_ts, end_ts, currents_url, notes


 if __name__ == "__main__":
@@ -455,12 +646,13 @@ if __name__ == "__main__":
        if active:
            process_gone_since = None

-        run_state, host_results, log_text, start_ts, end_ts, currents_url, notes = determine_state(
+        run_state, subrun_states, host_results, start_ts, end_ts, currents_url, notes = determine_state(
            build_name=build_name,
            build_dir=build_dir,
            run_log=run_log,
            reporter_root=reporter_root,
            inventory=inventory,
+            metadata=metadata,
            started_at=started_at,
            process_gone_since=process_gone_since,
            process_exit_grace_seconds=args.process_exit_grace_seconds,
@@ -478,8 +670,64 @@ if __name__ == "__main__":
            }
            for host, result in host_results.items()
        }
+        state["subruns"] = {
+            subrun["display_name"]: {
+                "state": subrun["state"],
+                "hosts": sorted(subrun["host_results"].keys()),
+                "start_ts": subrun["start_ts"].isoformat() if subrun["start_ts"] else None,
+                "end_ts": subrun["end_ts"].isoformat() if subrun["end_ts"] else None,
+                "currents_url": subrun["currents_url"],
+                "notes": subrun["notes"],
+            }
+            for subrun in subrun_states
+        }
        write_state(state_file, state)

+        for subrun in subrun_states:
+            subrun_dir = build_dir / "subruns" / subrun["key"]
+            ensure_dir(subrun_dir)
+            subrun_state_file = subrun_dir / "state.json"
+            subrun_posted_marker = subrun_dir / "posted.marker"
+            subrun_state = {
+                "display_name": subrun["display_name"],
+                "last_state": subrun["state"],
+                "last_seen_at": now_utc().isoformat(),
+                "host_results": {
+                    host: {
+                        "status": result.status,
+                        "detail": result.detail,
+                        "kernel": result.kernel,
+                        "tests": result.tests,
+                        "failures": result.failures,
+                    }
+                    for host, result in subrun["host_results"].items()
+                },
+                "notes": subrun["notes"],
+                "currents_url": subrun["currents_url"],
+                "started_at": subrun["start_ts"].isoformat() if subrun["start_ts"] else None,
+                "ended_at": subrun["end_ts"].isoformat() if subrun["end_ts"] else None,
+            }
+            if subrun["state"] in {"COMPLETED", "FAILED"} and not subrun_posted_marker.exists():
+                status_text = build_status_markdown(
+                    build_name=subrun["display_name"],
+                    metadata=metadata,
+                    host_results=dict(sorted(subrun["host_results"].items())),
+                    run_state=subrun["state"],
+                    currents_url=subrun["currents_url"],
+                    start_ts=subrun["start_ts"],
+                    end_ts=subrun["end_ts"],
+                    notes=subrun["notes"],
+                )
+                print(status_text)
+                response = post_to_mattermost(status_text)
+                if response != "ok":
+                    raise SystemExit(f"Mattermost webhook did not return ok for {subrun['display_name']}: {response!r}")
+                subrun_posted_marker.write_text("ok\n", encoding="utf-8")
+                subrun_state["mattermost_posted"] = True
+                subrun_state["mattermost_response"] = response
+                print(f"[watcher] Mattermost post confirmed for {subrun['display_name']}.")
+            write_state(subrun_state_file, subrun_state)
+
        if run_state == "RUNNING":
            print(f"[watcher] {build_name}: RUNNING")
            time.sleep(args.poll_interval)
@@ -497,7 +745,7 @@ if __name__ == "__main__":
        )
        print(status_text)

-        if run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
+        if not metadata.get("categorized") and run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
            response = post_to_mattermost(status_text)
            if response != "ok":
                raise SystemExit(f"Mattermost webhook did not return ok: {response!r}")
--- a/atvm/watcher-service/start-atvm-run-watcher.sh
+++ b/atvm/watcher-service/start-atvm-run-watcher.sh
@@ -13,6 +13,7 @@ Options:
  --migration-style <text>
  --integration-plugin <text>
  --scope-description <text>
+  --categorize
  --state-root <path>   Default: /var/lib/atvm-run-watcher
 EOF
 }
@@ -23,6 +24,7 @@ CONFIG_FAMILY=""
 MIGRATION_STYLE=""
 INTEGRATION_PLUGIN=""
 SCOPE_DESCRIPTION=""
+WATCHER_CATEGORIZED="false"
 STATE_ROOT="/var/lib/atvm-run-watcher"

 while [[ $# -gt 0 ]]; do
@@ -33,6 +35,7 @@ while [[ $# -gt 0 ]]; do
    --migration-style) MIGRATION_STYLE="${2:-}"; shift 2 ;;
    --integration-plugin) INTEGRATION_PLUGIN="${2:-}"; shift 2 ;;
    --scope-description) SCOPE_DESCRIPTION="${2:-}"; shift 2 ;;
+    --categorize) WATCHER_CATEGORIZED="true"; shift ;;
    --state-root) STATE_ROOT="${2:-}"; shift 2 ;;
    -h|--help) usage; exit 0 ;;
    *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
@@ -54,6 +57,7 @@ ATVM_WATCHER_CONFIG_FAMILY=${CONFIG_FAMILY@Q}
 ATVM_WATCHER_MIGRATION_STYLE=${MIGRATION_STYLE@Q}
 ATVM_WATCHER_INTEGRATION_PLUGIN=${INTEGRATION_PLUGIN@Q}
 ATVM_WATCHER_SCOPE_DESCRIPTION=${SCOPE_DESCRIPTION@Q}
+ATVM_WATCHER_CATEGORIZED=${WATCHER_CATEGORIZED@Q}
 EOF

 systemctl start "atvm-run-watcher@${BUILD_NAME}.service"