Add ATVM watcher service and explicit watcher approval flow

- add the per-run ATVM watcher service package under atvm/watcher-service, including the Python watcher, systemd template unit, helper scripts, and deployment docs - document the watcher-service install and operating model, including one-run-per-instance behavior, Mattermost posting rules, and the best-practice /opt/atvm-watcher-service install path - clarify ATVM run approval semantics so `approve` means run without watcher and `approve with watcher` means run and start the watcher - update the ATVM automation guide and AGENTS rules so watcher usage and approval behavior are explicit and consistent
2026-03-25 17:41:50 -04:00
parent fe228ff0e9
commit ba8354b95c
9 changed files with 962 additions and 8 deletions
--- a/atvm/AGENTS.md
+++ b/atvm/AGENTS.md
@@ -58,6 +58,7 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
 - Before any automation run, always check whether automation is already running.
 - Always show exact planned ATVM commands before execution.
 - Never execute setup or automation commands that require approval until the operator explicitly approves them.
+- For ATVM run approvals, treat `approve` as run-without-watcher and `approve with watcher` as run-with-watcher.
 - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
 - If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
 - If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -38,7 +38,9 @@ Run ATVM CMC automation tests on the designated automation VM without unintended
 - Never execute `cmc-templates.py`, `run-sorry-cypress.py`, or any other ATVM run command until the operator explicitly approves the displayed command(s).
 - Approval is required even for preparation-only steps such as template generation.
 - If the operator changes any part of the request after commands are displayed, rebuild the commands, show the updated commands, and wait for fresh approval before executing anything.
- Execute only after explicit approval (for example `approve`).
+- Execute ATVM run commands only after explicit approval.
+- Treat `approve` as approval to run without the watcher service.
+- Treat `approve with watcher` as approval to run and also start the per-run watcher service for that build.
 - After execution, report immediate success/failure only.
 - Do not actively monitor completion unless explicitly requested.
 - If monitoring is requested, allow long runtime windows (15-30+ minutes) and continue until completion unless operator instructs otherwise.
@@ -147,13 +149,16 @@ Before any new automation request:
 1. Build exact command(s) for the request.
 2. Present them verbatim as planned commands before running anything.
 3. Wait for explicit approval.
-4. Run only approved command(s), no extra options and no silent substitutions.
-5. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
-6. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
-7. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
-8. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
-9. If monitoring was not requested, report immediate success/failure for each command.
-10. If monitoring was requested, keep monitoring until completion and report final outcome.
+4. When the watcher is available, present the watcher-start command separately from the core run commands.
+5. Treat `approve` as approval to execute the ATVM run without starting the watcher.
+6. Treat `approve with watcher` as approval to execute the ATVM run and start the watcher for that build.
+7. Run only approved command(s), no extra options and no silent substitutions.
+8. When both template generation and the Cypress runner are requested, run them sequentially, not in parallel.
+9. Do not launch `run-sorry-cypress.py` until `cmc-templates.py` has exited successfully and finished updating the intended config/spec files.
+10. Treat displayed commands as a review gate: do not execute either command until the operator has had a chance to review them and explicitly approve.
+11. If the operator asks to change plugin, config, filters, build name, Gold Disk, or scope after commands are shown, discard the old plan, show the revised commands, and wait for new approval.
+12. If monitoring was not requested, report immediate success/failure for each command.
+13. If monitoring was requested, keep monitoring until completion and report final outcome.

 ## Requested Test Style
 When asked for one VM or a VM set:
--- a/atvm/watcher-service/INSTALL.md
+++ b/atvm/watcher-service/INSTALL.md
@@ -0,0 +1,220 @@
+# ATVM Watcher Service Install Plan
+
+This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`.
+
+This is a deployment plan only. It does not perform the installation.
+
+## Goal
+
+Install the local watcher package so the controller can:
+
+- watch one ATVM run per watcher instance
+- send final Mattermost status only for `COMPLETED` or `FAILED`
+- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
+- stop automatically after the watched run reaches a terminal state
+
+## Controller Target Layout
+
+Recommended controller paths:
+
+- package root:
+  - `/opt/atvm-watcher-service`
+- service unit:
+  - `/etc/systemd/system/atvm-run-watcher@.service`
+- global environment file:
+  - `/etc/atvm-run-watcher.env`
+- state root:
+  - `/var/lib/atvm-run-watcher`
+- ATVM automation root:
+  - `/root/cdc-e2e-cyp-12.17.4`
+
+Best-practice rule:
+
+- install the watcher service package under `/opt/atvm-watcher-service`
+- do not use `/root/atvm-watcher-service` as the standard install location
+- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
+
+## Files To Install
+
+From the local workspace:
+
+- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
+- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
+- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
+- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
+- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
+
+Optional reference docs:
+
+- `/home/aw/code/cds/atvm/watcher-service/README.md`
+- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
+
+## Required Controller Environment
+
+The controller must have:
+
+- `python3`
+- `systemd`
+- outbound network access to the Mattermost webhook
+- read access to:
+  - `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
+  - `/tmp/<build-name>.log`
+
+## Required Secrets
+
+The controller needs a watcher environment file with:
+
+- `MATTERMOST_ATVM_WEBHOOK`
+- `MATTERMOST_ATVM_CHANNEL`
+
+Recommended file:
+
+- `/etc/atvm-run-watcher.env`
+
+Recommended permissions:
+
+- owner: `root`
+- mode: `0600`
+
+## Deployment Steps
+
+1. Create controller directories.
+   - `/opt/atvm-watcher-service`
+   - `/var/lib/atvm-run-watcher`
+
+2. Copy package files to the controller.
+   - copy the Python watcher
+   - copy the `systemd` unit file
+   - copy the helper scripts
+   - copy `vm-inventory.md`
+
+3. Set executable permissions.
+   - `atvm_run_watcher.py`
+   - `start-atvm-run-watcher.sh`
+   - `cancel-atvm-run-watcher.sh`
+
+4. Create `/etc/atvm-run-watcher.env`.
+   - add Mattermost webhook/channel
+   - keep permissions restricted
+
+5. Install the `systemd` unit file.
+   - copy to `/etc/systemd/system/atvm-run-watcher@.service`
+
+6. Reload `systemd`.
+   - `systemctl daemon-reload`
+
+7. Run a syntax/smoke validation.
+   - check Python import/launch
+   - check helper script usage
+   - verify the unit resolves
+
+8. Do a non-production test.
+   - start a watcher for a fake or completed build name
+   - confirm state directory creation
+   - confirm the watcher exits as expected
+
+9. Do a real ATVM run test.
+   - launch a real run
+   - start the watcher for that build name
+   - confirm final Mattermost delivery for a completed run
+
+## Recommended Validation Commands
+
+Examples for later execution on the controller:
+
+```bash
+mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
+```
+
+```bash
+chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
+chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
+chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
+```
+
+```bash
+systemctl daemon-reload
+systemctl cat atvm-run-watcher@.service
+```
+
+```bash
+python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
+```
+
+```bash
+/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
+```
+
+## Per-Run Usage After Install
+
+Once installed, the intended workflow is:
+
+1. Launch the ATVM run as usual.
+2. Start the watcher for that build name.
+3. Let the watcher run on the controller.
+4. The watcher exits on terminal state.
+
+Example:
+
+```bash
+/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
+  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
+  --template cmc-e2e \
+  --config-family gold \
+  --migration-style "ATVM end-to-end migration validation" \
+  --integration-plugin "pure with fc" \
+  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
+```
+
+Cancel example:
+
+```bash
+/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
+  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
+```
+
+## Operational Notes
+
+- This is not a daemon.
+- One watcher instance is started per ATVM run.
+- The watcher exits after the run reaches a terminal state.
+- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
+- The watcher prevents duplicate Mattermost posts by writing a posted marker.
+
+## Failure Handling
+
+Expected terminal behavior:
+
+- `COMPLETED`
+  - post to Mattermost
+  - verify `ok`
+  - exit
+- `FAILED`
+  - post to Mattermost
+  - verify `ok`
+  - exit
+- `CANCELLED`
+  - do not post
+  - exit
+- `TERMINATED`
+  - do not post
+  - exit
+- `HUNG`
+  - do not post
+  - exit
+- `UNKNOWN`
+  - do not post
+  - exit
+
+## Answer To "Do We Need An Installer README?"
+
+Not strictly, but yes, it is useful.
+
+Why:
+
+- it gives a repeatable controller deployment procedure
+- it separates local package design from controller installation steps
+- it makes later install/reinstall safer
+- it gives you a review checkpoint before anything is installed on `192.168.3.190`
+
+That is the purpose of this file.
--- a/atvm/watcher-service/README.md
+++ b/atvm/watcher-service/README.md
@@ -0,0 +1,109 @@
+# ATVM Watcher Service
+
+This folder contains a per-run ATVM watcher service package that is intended to be reviewed locally first and installed on the ATVM Cypress controller later only when explicitly requested.
+
+## Purpose
+
+Watch a single ATVM automation run until it reaches a terminal state, then:
+
+- post the final status to Mattermost if the run state is `COMPLETED` or `FAILED`
+- verify the Mattermost post succeeded
+- write durable watcher state
+- exit cleanly so the service stops
+
+The watcher does not run indefinitely. It is designed for one run per service instance.
+
+## Files
+
+- `atvm_run_watcher.py`
+  - main watcher implementation
+- `atvm-run-watcher@.service`
+  - `systemd` template unit for one watcher instance per build name
+- `start-atvm-run-watcher.sh`
+  - helper to write per-run environment data and start a watcher instance
+- `cancel-atvm-run-watcher.sh`
+  - helper to mark a run cancelled and stop the watcher instance
+
+## Intended Controller Paths
+
+These are the default install targets assumed by the included unit file:
+
+- service package root: `/opt/atvm-watcher-service`
+- watcher state root: `/var/lib/atvm-run-watcher`
+- controller ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
+- watcher environment file: `/etc/atvm-run-watcher.env`
+
+Use `/opt/atvm-watcher-service` as the controller install root for future installs and reinstalls.
+Do not treat `/root/atvm-watcher-service` as the preferred long-term install location.
+
+## Per-Run Behavior
+
+Each watcher instance is tied to one build name.
+
+Typical workflow:
+
+1. Launch the ATVM run.
+2. Start the watcher for that run.
+3. The watcher polls the run log, process state, and `cmcReporter` artifacts.
+4. When the run reaches a terminal state:
+   - `COMPLETED` or `FAILED`
+     - build the final ATVM status
+     - send the status to Mattermost
+     - verify Mattermost returned `ok`
+     - mark the run as posted
+     - exit
+   - `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`
+     - do not post
+     - mark the final state
+     - exit
+
+## Required Environment
+
+The service expects the local credentials file values to be made available on the controller through the service environment:
+
+- `MATTERMOST_ATVM_WEBHOOK`
+- `MATTERMOST_ATVM_CHANNEL`
+
+Optional metadata for better status formatting:
+
+- `ATVM_WATCHER_TEMPLATE`
+- `ATVM_WATCHER_CONFIG_FAMILY`
+- `ATVM_WATCHER_MIGRATION_STYLE`
+- `ATVM_WATCHER_INTEGRATION_PLUGIN`
+- `ATVM_WATCHER_SCOPE_DESCRIPTION`
+
+## Start Example
+
+This helper writes a per-run environment file and starts the matching instance:
+
+```bash
+./start-atvm-run-watcher.sh \
+  --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
+  --template cmc-e2e \
+  --config-family gold \
+  --migration-style "ATVM end-to-end migration validation" \
+  --integration-plugin "pure with fc" \
+  --scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
+```
+
+That results in:
+
+- state dir:
+  - `/var/lib/atvm-run-watcher/e2e-redhat9.6-ubuntu24.04-w2k25-fc`
+- service instance:
+  - `atvm-run-watcher@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service`
+
+## Cancel Example
+
+```bash
+./cancel-atvm-run-watcher.sh --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
+```
+
+This writes a cancellation marker and stops the watcher instance. The watcher will not send Mattermost results for that run.
+
+## Notes
+
+- The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`.
+- Kernel values are resolved from `atvm/inventory/vm-inventory.md`.
+- Best-practice controller install path: `/opt/atvm-watcher-service`.
+- This package is local-only right now. Nothing here is installed on the controller yet.
--- a/atvm/watcher-service/pycache/atvm_run_watcher.cpython-312.pyc
+++ b/atvm/watcher-service/pycache/atvm_run_watcher.cpython-312.pyc
--- a/atvm/watcher-service/atvm-run-watcher@.service
+++ b/atvm/watcher-service/atvm-run-watcher@.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=ATVM run watcher for %i
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+WorkingDirectory=/opt/atvm-watcher-service
+EnvironmentFile=-/etc/atvm-run-watcher.env
+EnvironmentFile=-/var/lib/atvm-run-watcher/%i/watch.env
+ExecStart=/usr/bin/env python3 /opt/atvm-watcher-service/atvm_run_watcher.py --build-name %i --run-log /tmp/%i.log --reporter-root /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter --inventory-file /opt/atvm-watcher-service/vm-inventory.md --state-dir /var/lib/atvm-run-watcher
+Restart=no
+
+[Install]
+WantedBy=multi-user.target
--- a/atvm/watcher-service/atvm_run_watcher.py
+++ b/atvm/watcher-service/atvm_run_watcher.py
@@ -0,0 +1,512 @@
+#!/usr/bin/env python3
+from __future__ import annotations
+
+import argparse
+import ast
+import json
+import os
+import re
+import subprocess
+import sys
+import time
+import urllib.request
+import xml.etree.ElementTree as ET
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+
+
+RUN_STATES = {
+    "COMPLETED",
+    "FAILED",
+    "CANCELLED",
+    "TERMINATED",
+    "HUNG",
+    "UNKNOWN",
+    "RUNNING",
+}
+
+
+@dataclass
+class HostResult:
+    host: str
+    kernel: str
+    status: str
+    detail: str
+    tests: int = 0
+    failures: int = 0
+    duration_seconds: Optional[float] = None
+    timestamp: Optional[datetime] = None
+
+
+def now_utc() -> datetime:
+    return datetime.now(timezone.utc)
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--build-name", required=True)
+    parser.add_argument("--run-log", required=True)
+    parser.add_argument("--reporter-root", required=True)
+    parser.add_argument("--inventory-file", required=True)
+    parser.add_argument("--state-dir", required=True)
+    parser.add_argument("--poll-interval", type=int, default=30)
+    parser.add_argument("--max-watch-seconds", type=int, default=6 * 60 * 60)
+    parser.add_argument("--process-exit-grace-seconds", type=int, default=120)
+    return parser.parse_args()
+
+
+def ensure_dir(path: Path) -> None:
+    path.mkdir(parents=True, exist_ok=True)
+
+
+def load_inventory(path: Path) -> Dict[str, str]:
+    kernels: Dict[str, str] = {}
+    if not path.exists():
+        return kernels
+    for line in path.read_text(encoding="utf-8").splitlines():
+        if not line.startswith("|"):
+            continue
+        parts = [part.strip() for part in line.strip().strip("|").split("|")]
+        if len(parts) < 3 or parts[0] in {"OS", "---"}:
+            continue
+        host = parts[1]
+        kernel = parts[2]
+        kernels[host] = kernel or "unknown"
+    return kernels
+
+
+def run_ps() -> str:
+    proc = subprocess.run(
+        ["ps", "-eo", "pid=,args="],
+        capture_output=True,
+        text=True,
+        check=True,
+    )
+    return proc.stdout
+
+
+def process_active(build_name: str) -> bool:
+    output = run_ps()
+    for line in output.splitlines():
+        if "run-sorry-cypress.py" in line and f"--build_name {build_name}" in line:
+            return True
+    return False
+
+
+def read_text(path: Path) -> str:
+    try:
+        return path.read_text(encoding="utf-8", errors="replace")
+    except FileNotFoundError:
+        return ""
+
+
+def extract_expected_hosts(log_text: str) -> List[str]:
+    hosts: List[str] = []
+    spec_match = re.search(r'Extracted specPattern:\s*(\[[^\n]+\])', log_text)
+    if spec_match:
+        try:
+            spec_list = ast.literal_eval(spec_match.group(1))
+        except (SyntaxError, ValueError):
+            spec_list = []
+        for entry in spec_list:
+            if not isinstance(entry, str):
+                continue
+            match = re.search(r"(atvm[^/\s]+)\.ts$", entry)
+            if match:
+                host = match.group(1)
+                if host not in hosts:
+                    hosts.append(host)
+    for match in re.finditer(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text):
+        host = match.group(1)
+        if host not in hosts:
+            hosts.append(host)
+    return hosts
+
+
+def extract_currents_url(log_text: str) -> Optional[str]:
+    match = re.search(r"(https://\S+/run/\S+)", log_text)
+    return match.group(1) if match else None
+
+
+def load_state(state_file: Path) -> Dict[str, object]:
+    if not state_file.exists():
+        return {}
+    try:
+        return json.loads(state_file.read_text(encoding="utf-8"))
+    except json.JSONDecodeError:
+        return {}
+
+
+def write_state(state_file: Path, state: Dict[str, object]) -> None:
+    state_file.write_text(json.dumps(state, indent=2, sort_keys=True), encoding="utf-8")
+
+
+def parse_xml_timestamp(raw: Optional[str]) -> Optional[datetime]:
+    if not raw:
+        return None
+    try:
+        return datetime.fromisoformat(raw).replace(tzinfo=timezone.utc)
+    except ValueError:
+        return None
+
+
+def parse_host_xml(xml_path: Path) -> Optional[Tuple[str, HostResult]]:
+    try:
+        tree = ET.parse(xml_path)
+    except ET.ParseError:
+        return None
+    root = tree.getroot()
+    suites = root.findall("testsuite")
+    file_name = None
+    tests = int(float(root.attrib.get("tests", "0")))
+    failures = int(float(root.attrib.get("failures", "0")))
+    total_time = float(root.attrib.get("time", "0"))
+    timestamp = None
+    for suite in suites:
+        file_attr = suite.attrib.get("file", "")
+        if file_attr.startswith("cypress/cmcRegressionTest/atvm") and file_attr.endswith(".ts"):
+            file_name = Path(file_attr).stem
+            timestamp = parse_xml_timestamp(suite.attrib.get("timestamp"))
+            tests = int(float(suite.attrib.get("tests", root.attrib.get("tests", "0"))))
+            failures = int(float(suite.attrib.get("failures", root.attrib.get("failures", "0"))))
+            total_time = float(suite.attrib.get("time", root.attrib.get("time", "0")))
+            break
+    if not file_name:
+        return None
+    detail = f"{tests} tests, {failures} failures"
+    status = "FAIL" if failures else "PASS"
+    return file_name, HostResult(
+        host=file_name,
+        kernel="unknown",
+        status=status,
+        detail=detail,
+        tests=tests,
+        failures=failures,
+        duration_seconds=total_time,
+        timestamp=timestamp,
+    )
+
+
+def collect_host_results(
+    reporter_root: Path,
+    expected_hosts: List[str],
+    kernels: Dict[str, str],
+    run_started_at: datetime,
+) -> Dict[str, HostResult]:
+    xml_dir = reporter_root / "xml"
+    results: Dict[str, HostResult] = {}
+    if not xml_dir.exists():
+        return results
+    for xml_path in sorted(xml_dir.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime):
+        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
+        if xml_mtime < run_started_at:
+            continue
+        parsed = parse_host_xml(xml_path)
+        if not parsed:
+            continue
+        host, result = parsed
+        if expected_hosts and host not in expected_hosts:
+            continue
+        result.kernel = kernels.get(host, "unknown")
+        results[host] = result
+    return results
+
+
+def find_current_running_host(log_text: str, completed_hosts: List[str]) -> Optional[str]:
+    matches = re.findall(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text)
+    for host in reversed(matches):
+        if host not in completed_hosts:
+            return host
+    return None
+
+
+def infer_metadata() -> Dict[str, str]:
+    return {
+        "template": os.environ.get("ATVM_WATCHER_TEMPLATE", "unknown"),
+        "config_family": os.environ.get("ATVM_WATCHER_CONFIG_FAMILY", "unknown"),
+        "migration_style": os.environ.get("ATVM_WATCHER_MIGRATION_STYLE", "ATVM automation validation"),
+        "integration_plugin": os.environ.get("ATVM_WATCHER_INTEGRATION_PLUGIN", "unknown"),
+        "scope_description": os.environ.get("ATVM_WATCHER_SCOPE_DESCRIPTION", "requested ATVM run scope"),
+    }
+
+
+def format_duration(seconds: Optional[float]) -> str:
+    if seconds is None:
+        return "n/a"
+    minutes, secs = divmod(seconds, 60)
+    hours, minutes = divmod(int(minutes), 60)
+    if hours:
+        return f"{hours}h {minutes:02d}m {secs:05.2f}s"
+    if minutes:
+        return f"{minutes}m {secs:05.2f}s"
+    return f"{secs:.3f}s"
+
+
+def format_timestamp_local(ts: Optional[datetime]) -> str:
+    if not ts:
+        return "n/a"
+    local = ts.astimezone()
+    return local.strftime("%Y-%m-%d %H:%M:%S %Z")
+
+
+def build_status_markdown(
+    build_name: str,
+    metadata: Dict[str, str],
+    host_results: Dict[str, HostResult],
+    run_state: str,
+    currents_url: Optional[str],
+    start_ts: Optional[datetime],
+    end_ts: Optional[datetime],
+    notes: List[str],
+) -> str:
+    ordered_hosts = list(host_results.values())
+    finished = len([h for h in ordered_hosts if h.status in {"PASS", "FAIL"}])
+    passed = len([h for h in ordered_hosts if h.status == "PASS"])
+    failed = len([h for h in ordered_hosts if h.status == "FAIL"])
+    skipped = len([h for h in ordered_hosts if h.status == "SKIP"])
+    durations = [h.duration_seconds for h in ordered_hosts if h.duration_seconds is not None]
+    quickest = min((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
+    longest = max((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
+    average = (sum(durations) / len(durations)) if durations else None
+
+    host_lines = ["| Host | Kernel | Status | Detail |", "| --- | --- | --- | --- |"]
+    for host in ordered_hosts:
+        icon = {
+            "PASS": "✅ PASS",
+            "FAIL": "⚠️ FAIL",
+            "RUN": "⏳ RUN",
+            "SKIP": "⏭️ SKIP",
+            "NOT STARTED": "⏳ RUN",
+        }.get(host.status, host.status)
+        host_lines.append(f"| {host.host} | {host.kernel} | {icon} | {host.detail} |")
+
+    if currents_url:
+        notes = notes + [f"Currents recorded run: `{currents_url}`"]
+
+    notes_block = "\n".join(f"- {note}" for note in notes) if notes else "- none"
+
+    lines = [
+        "## ATVM Run Status",
+        f"### {build_name}",
+        "",
+        "**COVERAGE:**",
+        f"- template: `{metadata['template']}`",
+        f"- datastore/config family: `{metadata['config_family']}`",
+        f"- migration style: {metadata['migration_style']}",
+        f"- integration/plugin path: `{metadata['integration_plugin']}`",
+        f"- scope of this run: {metadata['scope_description']}",
+        "",
+        "**FUNCTIONALLY:**",
+        "- verify VM setup and power state",
+        "- power on, obtain IP address, and verify hostname reachability",
+        "- uninstall existing CMC if present",
+        "- prepare source and destination disks and validate source-side data",
+        "- install CMC and execute the requested ATVM migration workflow",
+        "- finalize reporting, cleanup, and the final `check-xml-files.ts` validation step",
+        "",
+        "**SUMMARY:**",
+        "",
+        "| Metric | Value |",
+        "| --- | --- |",
+        f"| finished | {finished} |",
+        f"| passed | {passed} |",
+        f"| failed | {failed} |",
+        f"| skipped | {skipped} |",
+        "",
+        "**HOSTS:**",
+        "",
+        *host_lines,
+        "",
+        "**TIMING:**",
+        "",
+        "| Metric | Value |",
+        "| --- | --- |",
+        f"| start | {format_timestamp_local(start_ts)} |",
+        f"| end | {format_timestamp_local(end_ts)} |",
+        f"| total | {format_duration((end_ts - start_ts).total_seconds()) if start_ts and end_ts else 'n/a'} |",
+        f"| quickest | {f'{quickest.host} - {format_duration(quickest.duration_seconds)}' if quickest else 'n/a'} |",
+        f"| longest | {f'{longest.host} - {format_duration(longest.duration_seconds)}' if longest else 'n/a'} |",
+        f"| average | {format_duration(average) if average is not None else 'n/a'} |",
+        "",
+        "**NOTES:**",
+        notes_block,
+    ]
+    return "\n".join(lines)
+
+
+def post_to_mattermost(text: str) -> str:
+    webhook = os.environ["MATTERMOST_ATVM_WEBHOOK"]
+    payload = {"text": text}
+    channel = os.environ.get("MATTERMOST_ATVM_CHANNEL")
+    if channel:
+        payload["channel"] = channel
+    data = json.dumps(payload).encode()
+    request = urllib.request.Request(webhook, data=data, headers={"Content-Type": "application/json"})
+    with urllib.request.urlopen(request) as response:
+        return response.read().decode().strip()
+
+
+def determine_state(
+    build_name: str,
+    build_dir: Path,
+    run_log: Path,
+    reporter_root: Path,
+    inventory: Dict[str, str],
+    started_at: datetime,
+    process_gone_since: Optional[datetime],
+    process_exit_grace_seconds: int,
+) -> Tuple[str, Dict[str, HostResult], str, Optional[datetime], Optional[datetime], Optional[str], List[str]]:
+    cancelled_marker = build_dir / "cancelled.marker"
+    log_text = read_text(run_log)
+    expected_hosts = extract_expected_hosts(log_text)
+    host_results = collect_host_results(reporter_root, expected_hosts, inventory, started_at)
+    active = process_active(build_name)
+    currents_url = extract_currents_url(log_text)
+    notes: List[str] = []
+
+    current_host = find_current_running_host(log_text, list(host_results.keys()))
+    if current_host and current_host not in host_results:
+        host_results[current_host] = HostResult(
+            host=current_host,
+            kernel=inventory.get(current_host, "unknown"),
+            status="RUN",
+            detail="in progress",
+        )
+
+    start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
+    check_xml = reporter_root / "xml"
+    for xml_path in sorted(check_xml.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime, reverse=True):
+        xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
+        if xml_mtime < started_at:
+            continue
+        text = read_text(xml_path)
+        if "check-xml-files.ts" in text:
+            try:
+                tree = ET.parse(xml_path)
+                root = tree.getroot()
+                suite = root.find("testsuite")
+                if suite is not None:
+                    ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
+                    if ts:
+                        end_candidates.append(ts)
+            except ET.ParseError:
+                pass
+            break
+
+    start_ts = min(start_candidates) if start_candidates else started_at
+    end_ts = max(end_candidates) if end_candidates else None
+
+    if cancelled_marker.exists():
+        notes.append("Cancellation marker detected.")
+        return "CANCELLED", host_results, log_text, start_ts, end_ts, currents_url, notes
+
+    if active:
+        elapsed = (now_utc() - started_at).total_seconds()
+        if elapsed > args.max_watch_seconds:
+            notes.append("Watcher exceeded max watch duration while the run still appears active.")
+            return "HUNG", host_results, log_text, start_ts, end_ts, currents_url, notes
+        return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+
+    if "Cloud Run Finished" in log_text or currents_url:
+        state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
+        notes.append("Run finished and final reporting artifacts were detected.")
+        if any("check-xml-files.ts" in line for line in log_text.splitlines()):
+            notes.append("Final `check-xml-files.ts` validation passed.")
+        return state, host_results, log_text, start_ts, end_ts, currents_url, notes
+
+    if process_gone_since and (now_utc() - process_gone_since).total_seconds() >= process_exit_grace_seconds:
+        notes.append("Run process exited without a clean completion signal.")
+        return "TERMINATED", host_results, log_text, start_ts, end_ts, currents_url, notes
+
+    return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    build_name = args.build_name
+    run_log = Path(args.run_log)
+    reporter_root = Path(args.reporter_root)
+    inventory_file = Path(args.inventory_file)
+    state_root = Path(args.state_dir)
+    build_dir = state_root / build_name
+    ensure_dir(build_dir)
+    state_file = build_dir / "state.json"
+    posted_marker = build_dir / "posted.marker"
+
+    inventory = load_inventory(inventory_file)
+    metadata = infer_metadata()
+
+    state = load_state(state_file)
+    default_started_at = datetime.fromtimestamp(run_log.stat().st_mtime, tz=timezone.utc) if run_log.exists() else now_utc()
+    started_at = parse_xml_timestamp(state.get("started_at")) or default_started_at
+    state.setdefault("build_name", build_name)
+    state.setdefault("started_at", started_at.isoformat())
+    write_state(state_file, state)
+
+    process_gone_since: Optional[datetime] = None
+
+    while True:
+        active = process_active(build_name)
+        if not active and process_gone_since is None:
+            process_gone_since = now_utc()
+        if active:
+            process_gone_since = None
+
+        run_state, host_results, log_text, start_ts, end_ts, currents_url, notes = determine_state(
+            build_name=build_name,
+            build_dir=build_dir,
+            run_log=run_log,
+            reporter_root=reporter_root,
+            inventory=inventory,
+            started_at=started_at,
+            process_gone_since=process_gone_since,
+            process_exit_grace_seconds=args.process_exit_grace_seconds,
+        )
+
+        state["last_state"] = run_state
+        state["last_seen_at"] = now_utc().isoformat()
+        state["host_results"] = {
+            host: {
+                "status": result.status,
+                "detail": result.detail,
+                "kernel": result.kernel,
+                "tests": result.tests,
+                "failures": result.failures,
+            }
+            for host, result in host_results.items()
+        }
+        write_state(state_file, state)
+
+        if run_state == "RUNNING":
+            print(f"[watcher] {build_name}: RUNNING")
+            time.sleep(args.poll_interval)
+            continue
+
+        status_text = build_status_markdown(
+            build_name=build_name,
+            metadata=metadata,
+            host_results=dict(sorted(host_results.items())),
+            run_state=run_state,
+            currents_url=currents_url,
+            start_ts=start_ts,
+            end_ts=end_ts,
+            notes=notes,
+        )
+        print(status_text)
+
+        if run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
+            response = post_to_mattermost(status_text)
+            if response != "ok":
+                raise SystemExit(f"Mattermost webhook did not return ok: {response!r}")
+            posted_marker.write_text("ok\n", encoding="utf-8")
+            state["mattermost_posted"] = True
+            state["mattermost_response"] = response
+            write_state(state_file, state)
+            print(f"[watcher] Mattermost post confirmed for {build_name}.")
+
+        state["closed_at"] = now_utc().isoformat()
+        write_state(state_file, state)
+        sys.exit(0)
--- a/atvm/watcher-service/cancel-atvm-run-watcher.sh
+++ b/atvm/watcher-service/cancel-atvm-run-watcher.sh
@@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage:
+  cancel-atvm-run-watcher.sh --build-name <name> [--state-root <path>]
+EOF
+}
+
+BUILD_NAME=""
+STATE_ROOT="/var/lib/atvm-run-watcher"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --build-name) BUILD_NAME="${2:-}"; shift 2 ;;
+    --state-root) STATE_ROOT="${2:-}"; shift 2 ;;
+    -h|--help) usage; exit 0 ;;
+    *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
+  esac
+done
+
+if [[ -z "$BUILD_NAME" ]]; then
+  echo "--build-name is required" >&2
+  usage >&2
+  exit 1
+fi
+
+RUN_DIR="${STATE_ROOT}/${BUILD_NAME}"
+mkdir -p "$RUN_DIR"
+touch "${RUN_DIR}/cancelled.marker"
+systemctl stop "atvm-run-watcher@${BUILD_NAME}.service" || true
--- a/atvm/watcher-service/start-atvm-run-watcher.sh
+++ b/atvm/watcher-service/start-atvm-run-watcher.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage:
+  start-atvm-run-watcher.sh --build-name <name> [options]
+
+Options:
+  --build-name <name>
+  --template <name>
+  --config-family <name>
+  --migration-style <text>
+  --integration-plugin <text>
+  --scope-description <text>
+  --state-root <path>   Default: /var/lib/atvm-run-watcher
+EOF
+}
+
+BUILD_NAME=""
+TEMPLATE=""
+CONFIG_FAMILY=""
+MIGRATION_STYLE=""
+INTEGRATION_PLUGIN=""
+SCOPE_DESCRIPTION=""
+STATE_ROOT="/var/lib/atvm-run-watcher"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --build-name) BUILD_NAME="${2:-}"; shift 2 ;;
+    --template) TEMPLATE="${2:-}"; shift 2 ;;
+    --config-family) CONFIG_FAMILY="${2:-}"; shift 2 ;;
+    --migration-style) MIGRATION_STYLE="${2:-}"; shift 2 ;;
+    --integration-plugin) INTEGRATION_PLUGIN="${2:-}"; shift 2 ;;
+    --scope-description) SCOPE_DESCRIPTION="${2:-}"; shift 2 ;;
+    --state-root) STATE_ROOT="${2:-}"; shift 2 ;;
+    -h|--help) usage; exit 0 ;;
+    *) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
+  esac
+done
+
+if [[ -z "$BUILD_NAME" ]]; then
+  echo "--build-name is required" >&2
+  usage >&2
+  exit 1
+fi
+
+RUN_DIR="${STATE_ROOT}/${BUILD_NAME}"
+mkdir -p "$RUN_DIR"
+
+cat >"${RUN_DIR}/watch.env" <<EOF
+ATVM_WATCHER_TEMPLATE=${TEMPLATE@Q}
+ATVM_WATCHER_CONFIG_FAMILY=${CONFIG_FAMILY@Q}
+ATVM_WATCHER_MIGRATION_STYLE=${MIGRATION_STYLE@Q}
+ATVM_WATCHER_INTEGRATION_PLUGIN=${INTEGRATION_PLUGIN@Q}
+ATVM_WATCHER_SCOPE_DESCRIPTION=${SCOPE_DESCRIPTION@Q}
+EOF
+
+systemctl start "atvm-run-watcher@${BUILD_NAME}.service"
+systemctl status --no-pager "atvm-run-watcher@${BUILD_NAME}.service" || true