Add ATVM watcher service and explicit watcher approval flow
- add the per-run ATVM watcher service package under atvm/watcher-service, including the Python watcher, systemd template unit, helper scripts, and deployment docs - document the watcher-service install and operating model, including one-run-per-instance behavior, Mattermost posting rules, and the best-practice /opt/atvm-watcher-service install path - clarify ATVM run approval semantics so `approve` means run without watcher and `approve with watcher` means run and start the watcher - update the ATVM automation guide and AGENTS rules so watcher usage and approval behavior are explicit and consistent
This commit is contained in:
220
atvm/watcher-service/INSTALL.md
Normal file
220
atvm/watcher-service/INSTALL.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# ATVM Watcher Service Install Plan
|
||||
|
||||
This document describes how to deploy the ATVM per-run watcher service to the ATVM Cypress controller at `192.168.3.190`.
|
||||
|
||||
This is a deployment plan only. It does not perform the installation.
|
||||
|
||||
## Goal
|
||||
|
||||
Install the local watcher package so the controller can:
|
||||
|
||||
- watch one ATVM run per watcher instance
|
||||
- send final Mattermost status only for `COMPLETED` or `FAILED`
|
||||
- suppress Mattermost posts for `CANCELLED`, `TERMINATED`, `HUNG`, and `UNKNOWN`
|
||||
- stop automatically after the watched run reaches a terminal state
|
||||
|
||||
## Controller Target Layout
|
||||
|
||||
Recommended controller paths:
|
||||
|
||||
- package root:
|
||||
- `/opt/atvm-watcher-service`
|
||||
- service unit:
|
||||
- `/etc/systemd/system/atvm-run-watcher@.service`
|
||||
- global environment file:
|
||||
- `/etc/atvm-run-watcher.env`
|
||||
- state root:
|
||||
- `/var/lib/atvm-run-watcher`
|
||||
- ATVM automation root:
|
||||
- `/root/cdc-e2e-cyp-12.17.4`
|
||||
|
||||
Best-practice rule:
|
||||
|
||||
- install the watcher service package under `/opt/atvm-watcher-service`
|
||||
- do not use `/root/atvm-watcher-service` as the standard install location
|
||||
- if a temporary `/root/atvm-watcher-service` install exists, replace it with a clean `/opt/atvm-watcher-service` install
|
||||
|
||||
## Files To Install
|
||||
|
||||
From the local workspace:
|
||||
|
||||
- `/home/aw/code/cds/atvm/watcher-service/atvm_run_watcher.py`
|
||||
- `/home/aw/code/cds/atvm/watcher-service/atvm-run-watcher@.service`
|
||||
- `/home/aw/code/cds/atvm/watcher-service/start-atvm-run-watcher.sh`
|
||||
- `/home/aw/code/cds/atvm/watcher-service/cancel-atvm-run-watcher.sh`
|
||||
- `/home/aw/code/cds/atvm/inventory/vm-inventory.md`
|
||||
|
||||
Optional reference docs:
|
||||
|
||||
- `/home/aw/code/cds/atvm/watcher-service/README.md`
|
||||
- `/home/aw/code/cds/atvm/watcher-service/INSTALL.md`
|
||||
|
||||
## Required Controller Environment
|
||||
|
||||
The controller must have:
|
||||
|
||||
- `python3`
|
||||
- `systemd`
|
||||
- outbound network access to the Mattermost webhook
|
||||
- read access to:
|
||||
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
|
||||
- `/tmp/<build-name>.log`
|
||||
|
||||
## Required Secrets
|
||||
|
||||
The controller needs a watcher environment file with:
|
||||
|
||||
- `MATTERMOST_ATVM_WEBHOOK`
|
||||
- `MATTERMOST_ATVM_CHANNEL`
|
||||
|
||||
Recommended file:
|
||||
|
||||
- `/etc/atvm-run-watcher.env`
|
||||
|
||||
Recommended permissions:
|
||||
|
||||
- owner: `root`
|
||||
- mode: `0600`
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
1. Create controller directories.
|
||||
- `/opt/atvm-watcher-service`
|
||||
- `/var/lib/atvm-run-watcher`
|
||||
|
||||
2. Copy package files to the controller.
|
||||
- copy the Python watcher
|
||||
- copy the `systemd` unit file
|
||||
- copy the helper scripts
|
||||
- copy `vm-inventory.md`
|
||||
|
||||
3. Set executable permissions.
|
||||
- `atvm_run_watcher.py`
|
||||
- `start-atvm-run-watcher.sh`
|
||||
- `cancel-atvm-run-watcher.sh`
|
||||
|
||||
4. Create `/etc/atvm-run-watcher.env`.
|
||||
- add Mattermost webhook/channel
|
||||
- keep permissions restricted
|
||||
|
||||
5. Install the `systemd` unit file.
|
||||
- copy to `/etc/systemd/system/atvm-run-watcher@.service`
|
||||
|
||||
6. Reload `systemd`.
|
||||
- `systemctl daemon-reload`
|
||||
|
||||
7. Run a syntax/smoke validation.
|
||||
- check Python import/launch
|
||||
- check helper script usage
|
||||
- verify the unit resolves
|
||||
|
||||
8. Do a non-production test.
|
||||
- start a watcher for a fake or completed build name
|
||||
- confirm state directory creation
|
||||
- confirm the watcher exits as expected
|
||||
|
||||
9. Do a real ATVM run test.
|
||||
- launch a real run
|
||||
- start the watcher for that build name
|
||||
- confirm final Mattermost delivery for a completed run
|
||||
|
||||
## Recommended Validation Commands
|
||||
|
||||
Examples for later execution on the controller:
|
||||
|
||||
```bash
|
||||
mkdir -p /opt/atvm-watcher-service /var/lib/atvm-run-watcher
|
||||
```
|
||||
|
||||
```bash
|
||||
chmod 755 /opt/atvm-watcher-service/atvm_run_watcher.py
|
||||
chmod 755 /opt/atvm-watcher-service/start-atvm-run-watcher.sh
|
||||
chmod 755 /opt/atvm-watcher-service/cancel-atvm-run-watcher.sh
|
||||
```
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl cat atvm-run-watcher@.service
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 /opt/atvm-watcher-service/atvm_run_watcher.py --help
|
||||
```
|
||||
|
||||
```bash
|
||||
/opt/atvm-watcher-service/start-atvm-run-watcher.sh --help
|
||||
```
|
||||
|
||||
## Per-Run Usage After Install
|
||||
|
||||
Once installed, the intended workflow is:
|
||||
|
||||
1. Launch the ATVM run as usual.
|
||||
2. Start the watcher for that build name.
|
||||
3. Let the watcher run on the controller.
|
||||
4. The watcher exits on terminal state.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
/opt/atvm-watcher-service/start-atvm-run-watcher.sh \
|
||||
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
||||
--template cmc-e2e \
|
||||
--config-family gold \
|
||||
--migration-style "ATVM end-to-end migration validation" \
|
||||
--integration-plugin "pure with fc" \
|
||||
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
|
||||
```
|
||||
|
||||
Cancel example:
|
||||
|
||||
```bash
|
||||
/opt/atvm-watcher-service/cancel-atvm-run-watcher.sh \
|
||||
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
|
||||
```
|
||||
|
||||
## Operational Notes
|
||||
|
||||
- This is not a daemon.
|
||||
- One watcher instance is started per ATVM run.
|
||||
- The watcher exits after the run reaches a terminal state.
|
||||
- The watcher writes state under `/var/lib/atvm-run-watcher/<build-name>`.
|
||||
- The watcher prevents duplicate Mattermost posts by writing a posted marker.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
Expected terminal behavior:
|
||||
|
||||
- `COMPLETED`
|
||||
- post to Mattermost
|
||||
- verify `ok`
|
||||
- exit
|
||||
- `FAILED`
|
||||
- post to Mattermost
|
||||
- verify `ok`
|
||||
- exit
|
||||
- `CANCELLED`
|
||||
- do not post
|
||||
- exit
|
||||
- `TERMINATED`
|
||||
- do not post
|
||||
- exit
|
||||
- `HUNG`
|
||||
- do not post
|
||||
- exit
|
||||
- `UNKNOWN`
|
||||
- do not post
|
||||
- exit
|
||||
|
||||
## Answer To "Do We Need An Installer README?"
|
||||
|
||||
Not strictly, but yes, it is useful.
|
||||
|
||||
Why:
|
||||
|
||||
- it gives a repeatable controller deployment procedure
|
||||
- it separates local package design from controller installation steps
|
||||
- it makes later install/reinstall safer
|
||||
- it gives you a review checkpoint before anything is installed on `192.168.3.190`
|
||||
|
||||
That is the purpose of this file.
|
||||
109
atvm/watcher-service/README.md
Normal file
109
atvm/watcher-service/README.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# ATVM Watcher Service
|
||||
|
||||
This folder contains a per-run ATVM watcher service package that is intended to be reviewed locally first and installed on the ATVM Cypress controller later only when explicitly requested.
|
||||
|
||||
## Purpose
|
||||
|
||||
Watch a single ATVM automation run until it reaches a terminal state, then:
|
||||
|
||||
- post the final status to Mattermost if the run state is `COMPLETED` or `FAILED`
|
||||
- verify the Mattermost post succeeded
|
||||
- write durable watcher state
|
||||
- exit cleanly so the service stops
|
||||
|
||||
The watcher does not run indefinitely. It is designed for one run per service instance.
|
||||
|
||||
## Files
|
||||
|
||||
- `atvm_run_watcher.py`
|
||||
- main watcher implementation
|
||||
- `atvm-run-watcher@.service`
|
||||
- `systemd` template unit for one watcher instance per build name
|
||||
- `start-atvm-run-watcher.sh`
|
||||
- helper to write per-run environment data and start a watcher instance
|
||||
- `cancel-atvm-run-watcher.sh`
|
||||
- helper to mark a run cancelled and stop the watcher instance
|
||||
|
||||
## Intended Controller Paths
|
||||
|
||||
These are the default install targets assumed by the included unit file:
|
||||
|
||||
- service package root: `/opt/atvm-watcher-service`
|
||||
- watcher state root: `/var/lib/atvm-run-watcher`
|
||||
- controller ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
|
||||
- watcher environment file: `/etc/atvm-run-watcher.env`
|
||||
|
||||
Use `/opt/atvm-watcher-service` as the controller install root for future installs and reinstalls.
|
||||
Do not treat `/root/atvm-watcher-service` as the preferred long-term install location.
|
||||
|
||||
## Per-Run Behavior
|
||||
|
||||
Each watcher instance is tied to one build name.
|
||||
|
||||
Typical workflow:
|
||||
|
||||
1. Launch the ATVM run.
|
||||
2. Start the watcher for that run.
|
||||
3. The watcher polls the run log, process state, and `cmcReporter` artifacts.
|
||||
4. When the run reaches a terminal state:
|
||||
- `COMPLETED` or `FAILED`
|
||||
- build the final ATVM status
|
||||
- send the status to Mattermost
|
||||
- verify Mattermost returned `ok`
|
||||
- mark the run as posted
|
||||
- exit
|
||||
- `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`
|
||||
- do not post
|
||||
- mark the final state
|
||||
- exit
|
||||
|
||||
## Required Environment
|
||||
|
||||
The service expects the local credentials file values to be made available on the controller through the service environment:
|
||||
|
||||
- `MATTERMOST_ATVM_WEBHOOK`
|
||||
- `MATTERMOST_ATVM_CHANNEL`
|
||||
|
||||
Optional metadata for better status formatting:
|
||||
|
||||
- `ATVM_WATCHER_TEMPLATE`
|
||||
- `ATVM_WATCHER_CONFIG_FAMILY`
|
||||
- `ATVM_WATCHER_MIGRATION_STYLE`
|
||||
- `ATVM_WATCHER_INTEGRATION_PLUGIN`
|
||||
- `ATVM_WATCHER_SCOPE_DESCRIPTION`
|
||||
|
||||
## Start Example
|
||||
|
||||
This helper writes a per-run environment file and starts the matching instance:
|
||||
|
||||
```bash
|
||||
./start-atvm-run-watcher.sh \
|
||||
--build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc \
|
||||
--template cmc-e2e \
|
||||
--config-family gold \
|
||||
--migration-style "ATVM end-to-end migration validation" \
|
||||
--integration-plugin "pure with fc" \
|
||||
--scope-description "mixed Linux and Windows FC E2E validation on the gold datastore set"
|
||||
```
|
||||
|
||||
That results in:
|
||||
|
||||
- state dir:
|
||||
- `/var/lib/atvm-run-watcher/e2e-redhat9.6-ubuntu24.04-w2k25-fc`
|
||||
- service instance:
|
||||
- `atvm-run-watcher@e2e-redhat9.6-ubuntu24.04-w2k25-fc.service`
|
||||
|
||||
## Cancel Example
|
||||
|
||||
```bash
|
||||
./cancel-atvm-run-watcher.sh --build-name e2e-redhat9.6-ubuntu24.04-w2k25-fc
|
||||
```
|
||||
|
||||
This writes a cancellation marker and stops the watcher instance. The watcher will not send Mattermost results for that run.
|
||||
|
||||
## Notes
|
||||
|
||||
- The watcher uses the same ATVM status layout documented in `atvm/docs/automation/status-template.md`.
|
||||
- Kernel values are resolved from `atvm/inventory/vm-inventory.md`.
|
||||
- Best-practice controller install path: `/opt/atvm-watcher-service`.
|
||||
- This package is local-only right now. Nothing here is installed on the controller yet.
|
||||
Binary file not shown.
15
atvm/watcher-service/atvm-run-watcher@.service
Normal file
15
atvm/watcher-service/atvm-run-watcher@.service
Normal file
@@ -0,0 +1,15 @@
|
||||
[Unit]
|
||||
Description=ATVM run watcher for %i
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
WorkingDirectory=/opt/atvm-watcher-service
|
||||
EnvironmentFile=-/etc/atvm-run-watcher.env
|
||||
EnvironmentFile=-/var/lib/atvm-run-watcher/%i/watch.env
|
||||
ExecStart=/usr/bin/env python3 /opt/atvm-watcher-service/atvm_run_watcher.py --build-name %i --run-log /tmp/%i.log --reporter-root /root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter --inventory-file /opt/atvm-watcher-service/vm-inventory.md --state-dir /var/lib/atvm-run-watcher
|
||||
Restart=no
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
512
atvm/watcher-service/atvm_run_watcher.py
Normal file
512
atvm/watcher-service/atvm_run_watcher.py
Normal file
@@ -0,0 +1,512 @@
|
||||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import ast
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import urllib.request
|
||||
import xml.etree.ElementTree as ET
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
|
||||
RUN_STATES = {
|
||||
"COMPLETED",
|
||||
"FAILED",
|
||||
"CANCELLED",
|
||||
"TERMINATED",
|
||||
"HUNG",
|
||||
"UNKNOWN",
|
||||
"RUNNING",
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class HostResult:
|
||||
host: str
|
||||
kernel: str
|
||||
status: str
|
||||
detail: str
|
||||
tests: int = 0
|
||||
failures: int = 0
|
||||
duration_seconds: Optional[float] = None
|
||||
timestamp: Optional[datetime] = None
|
||||
|
||||
|
||||
def now_utc() -> datetime:
|
||||
return datetime.now(timezone.utc)
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--build-name", required=True)
|
||||
parser.add_argument("--run-log", required=True)
|
||||
parser.add_argument("--reporter-root", required=True)
|
||||
parser.add_argument("--inventory-file", required=True)
|
||||
parser.add_argument("--state-dir", required=True)
|
||||
parser.add_argument("--poll-interval", type=int, default=30)
|
||||
parser.add_argument("--max-watch-seconds", type=int, default=6 * 60 * 60)
|
||||
parser.add_argument("--process-exit-grace-seconds", type=int, default=120)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def ensure_dir(path: Path) -> None:
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def load_inventory(path: Path) -> Dict[str, str]:
|
||||
kernels: Dict[str, str] = {}
|
||||
if not path.exists():
|
||||
return kernels
|
||||
for line in path.read_text(encoding="utf-8").splitlines():
|
||||
if not line.startswith("|"):
|
||||
continue
|
||||
parts = [part.strip() for part in line.strip().strip("|").split("|")]
|
||||
if len(parts) < 3 or parts[0] in {"OS", "---"}:
|
||||
continue
|
||||
host = parts[1]
|
||||
kernel = parts[2]
|
||||
kernels[host] = kernel or "unknown"
|
||||
return kernels
|
||||
|
||||
|
||||
def run_ps() -> str:
|
||||
proc = subprocess.run(
|
||||
["ps", "-eo", "pid=,args="],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=True,
|
||||
)
|
||||
return proc.stdout
|
||||
|
||||
|
||||
def process_active(build_name: str) -> bool:
|
||||
output = run_ps()
|
||||
for line in output.splitlines():
|
||||
if "run-sorry-cypress.py" in line and f"--build_name {build_name}" in line:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def read_text(path: Path) -> str:
|
||||
try:
|
||||
return path.read_text(encoding="utf-8", errors="replace")
|
||||
except FileNotFoundError:
|
||||
return ""
|
||||
|
||||
|
||||
def extract_expected_hosts(log_text: str) -> List[str]:
|
||||
hosts: List[str] = []
|
||||
spec_match = re.search(r'Extracted specPattern:\s*(\[[^\n]+\])', log_text)
|
||||
if spec_match:
|
||||
try:
|
||||
spec_list = ast.literal_eval(spec_match.group(1))
|
||||
except (SyntaxError, ValueError):
|
||||
spec_list = []
|
||||
for entry in spec_list:
|
||||
if not isinstance(entry, str):
|
||||
continue
|
||||
match = re.search(r"(atvm[^/\s]+)\.ts$", entry)
|
||||
if match:
|
||||
host = match.group(1)
|
||||
if host not in hosts:
|
||||
hosts.append(host)
|
||||
for match in re.finditer(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text):
|
||||
host = match.group(1)
|
||||
if host not in hosts:
|
||||
hosts.append(host)
|
||||
return hosts
|
||||
|
||||
|
||||
def extract_currents_url(log_text: str) -> Optional[str]:
|
||||
match = re.search(r"(https://\S+/run/\S+)", log_text)
|
||||
return match.group(1) if match else None
|
||||
|
||||
|
||||
def load_state(state_file: Path) -> Dict[str, object]:
|
||||
if not state_file.exists():
|
||||
return {}
|
||||
try:
|
||||
return json.loads(state_file.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError:
|
||||
return {}
|
||||
|
||||
|
||||
def write_state(state_file: Path, state: Dict[str, object]) -> None:
|
||||
state_file.write_text(json.dumps(state, indent=2, sort_keys=True), encoding="utf-8")
|
||||
|
||||
|
||||
def parse_xml_timestamp(raw: Optional[str]) -> Optional[datetime]:
|
||||
if not raw:
|
||||
return None
|
||||
try:
|
||||
return datetime.fromisoformat(raw).replace(tzinfo=timezone.utc)
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def parse_host_xml(xml_path: Path) -> Optional[Tuple[str, HostResult]]:
|
||||
try:
|
||||
tree = ET.parse(xml_path)
|
||||
except ET.ParseError:
|
||||
return None
|
||||
root = tree.getroot()
|
||||
suites = root.findall("testsuite")
|
||||
file_name = None
|
||||
tests = int(float(root.attrib.get("tests", "0")))
|
||||
failures = int(float(root.attrib.get("failures", "0")))
|
||||
total_time = float(root.attrib.get("time", "0"))
|
||||
timestamp = None
|
||||
for suite in suites:
|
||||
file_attr = suite.attrib.get("file", "")
|
||||
if file_attr.startswith("cypress/cmcRegressionTest/atvm") and file_attr.endswith(".ts"):
|
||||
file_name = Path(file_attr).stem
|
||||
timestamp = parse_xml_timestamp(suite.attrib.get("timestamp"))
|
||||
tests = int(float(suite.attrib.get("tests", root.attrib.get("tests", "0"))))
|
||||
failures = int(float(suite.attrib.get("failures", root.attrib.get("failures", "0"))))
|
||||
total_time = float(suite.attrib.get("time", root.attrib.get("time", "0")))
|
||||
break
|
||||
if not file_name:
|
||||
return None
|
||||
detail = f"{tests} tests, {failures} failures"
|
||||
status = "FAIL" if failures else "PASS"
|
||||
return file_name, HostResult(
|
||||
host=file_name,
|
||||
kernel="unknown",
|
||||
status=status,
|
||||
detail=detail,
|
||||
tests=tests,
|
||||
failures=failures,
|
||||
duration_seconds=total_time,
|
||||
timestamp=timestamp,
|
||||
)
|
||||
|
||||
|
||||
def collect_host_results(
|
||||
reporter_root: Path,
|
||||
expected_hosts: List[str],
|
||||
kernels: Dict[str, str],
|
||||
run_started_at: datetime,
|
||||
) -> Dict[str, HostResult]:
|
||||
xml_dir = reporter_root / "xml"
|
||||
results: Dict[str, HostResult] = {}
|
||||
if not xml_dir.exists():
|
||||
return results
|
||||
for xml_path in sorted(xml_dir.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime):
|
||||
xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
|
||||
if xml_mtime < run_started_at:
|
||||
continue
|
||||
parsed = parse_host_xml(xml_path)
|
||||
if not parsed:
|
||||
continue
|
||||
host, result = parsed
|
||||
if expected_hosts and host not in expected_hosts:
|
||||
continue
|
||||
result.kernel = kernels.get(host, "unknown")
|
||||
results[host] = result
|
||||
return results
|
||||
|
||||
|
||||
def find_current_running_host(log_text: str, completed_hosts: List[str]) -> Optional[str]:
|
||||
matches = re.findall(r"Running:\s+(?:cypress/cmcRegressionTest/)?(atvm[^/\s]+)\.ts", log_text)
|
||||
for host in reversed(matches):
|
||||
if host not in completed_hosts:
|
||||
return host
|
||||
return None
|
||||
|
||||
|
||||
def infer_metadata() -> Dict[str, str]:
|
||||
return {
|
||||
"template": os.environ.get("ATVM_WATCHER_TEMPLATE", "unknown"),
|
||||
"config_family": os.environ.get("ATVM_WATCHER_CONFIG_FAMILY", "unknown"),
|
||||
"migration_style": os.environ.get("ATVM_WATCHER_MIGRATION_STYLE", "ATVM automation validation"),
|
||||
"integration_plugin": os.environ.get("ATVM_WATCHER_INTEGRATION_PLUGIN", "unknown"),
|
||||
"scope_description": os.environ.get("ATVM_WATCHER_SCOPE_DESCRIPTION", "requested ATVM run scope"),
|
||||
}
|
||||
|
||||
|
||||
def format_duration(seconds: Optional[float]) -> str:
|
||||
if seconds is None:
|
||||
return "n/a"
|
||||
minutes, secs = divmod(seconds, 60)
|
||||
hours, minutes = divmod(int(minutes), 60)
|
||||
if hours:
|
||||
return f"{hours}h {minutes:02d}m {secs:05.2f}s"
|
||||
if minutes:
|
||||
return f"{minutes}m {secs:05.2f}s"
|
||||
return f"{secs:.3f}s"
|
||||
|
||||
|
||||
def format_timestamp_local(ts: Optional[datetime]) -> str:
|
||||
if not ts:
|
||||
return "n/a"
|
||||
local = ts.astimezone()
|
||||
return local.strftime("%Y-%m-%d %H:%M:%S %Z")
|
||||
|
||||
|
||||
def build_status_markdown(
|
||||
build_name: str,
|
||||
metadata: Dict[str, str],
|
||||
host_results: Dict[str, HostResult],
|
||||
run_state: str,
|
||||
currents_url: Optional[str],
|
||||
start_ts: Optional[datetime],
|
||||
end_ts: Optional[datetime],
|
||||
notes: List[str],
|
||||
) -> str:
|
||||
ordered_hosts = list(host_results.values())
|
||||
finished = len([h for h in ordered_hosts if h.status in {"PASS", "FAIL"}])
|
||||
passed = len([h for h in ordered_hosts if h.status == "PASS"])
|
||||
failed = len([h for h in ordered_hosts if h.status == "FAIL"])
|
||||
skipped = len([h for h in ordered_hosts if h.status == "SKIP"])
|
||||
durations = [h.duration_seconds for h in ordered_hosts if h.duration_seconds is not None]
|
||||
quickest = min((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
|
||||
longest = max((h for h in ordered_hosts if h.duration_seconds is not None), key=lambda h: h.duration_seconds, default=None)
|
||||
average = (sum(durations) / len(durations)) if durations else None
|
||||
|
||||
host_lines = ["| Host | Kernel | Status | Detail |", "| --- | --- | --- | --- |"]
|
||||
for host in ordered_hosts:
|
||||
icon = {
|
||||
"PASS": "✅ PASS",
|
||||
"FAIL": "⚠️ FAIL",
|
||||
"RUN": "⏳ RUN",
|
||||
"SKIP": "⏭️ SKIP",
|
||||
"NOT STARTED": "⏳ RUN",
|
||||
}.get(host.status, host.status)
|
||||
host_lines.append(f"| {host.host} | {host.kernel} | {icon} | {host.detail} |")
|
||||
|
||||
if currents_url:
|
||||
notes = notes + [f"Currents recorded run: `{currents_url}`"]
|
||||
|
||||
notes_block = "\n".join(f"- {note}" for note in notes) if notes else "- none"
|
||||
|
||||
lines = [
|
||||
"## ATVM Run Status",
|
||||
f"### {build_name}",
|
||||
"",
|
||||
"**COVERAGE:**",
|
||||
f"- template: `{metadata['template']}`",
|
||||
f"- datastore/config family: `{metadata['config_family']}`",
|
||||
f"- migration style: {metadata['migration_style']}",
|
||||
f"- integration/plugin path: `{metadata['integration_plugin']}`",
|
||||
f"- scope of this run: {metadata['scope_description']}",
|
||||
"",
|
||||
"**FUNCTIONALLY:**",
|
||||
"- verify VM setup and power state",
|
||||
"- power on, obtain IP address, and verify hostname reachability",
|
||||
"- uninstall existing CMC if present",
|
||||
"- prepare source and destination disks and validate source-side data",
|
||||
"- install CMC and execute the requested ATVM migration workflow",
|
||||
"- finalize reporting, cleanup, and the final `check-xml-files.ts` validation step",
|
||||
"",
|
||||
"**SUMMARY:**",
|
||||
"",
|
||||
"| Metric | Value |",
|
||||
"| --- | --- |",
|
||||
f"| finished | {finished} |",
|
||||
f"| passed | {passed} |",
|
||||
f"| failed | {failed} |",
|
||||
f"| skipped | {skipped} |",
|
||||
"",
|
||||
"**HOSTS:**",
|
||||
"",
|
||||
*host_lines,
|
||||
"",
|
||||
"**TIMING:**",
|
||||
"",
|
||||
"| Metric | Value |",
|
||||
"| --- | --- |",
|
||||
f"| start | {format_timestamp_local(start_ts)} |",
|
||||
f"| end | {format_timestamp_local(end_ts)} |",
|
||||
f"| total | {format_duration((end_ts - start_ts).total_seconds()) if start_ts and end_ts else 'n/a'} |",
|
||||
f"| quickest | {f'{quickest.host} - {format_duration(quickest.duration_seconds)}' if quickest else 'n/a'} |",
|
||||
f"| longest | {f'{longest.host} - {format_duration(longest.duration_seconds)}' if longest else 'n/a'} |",
|
||||
f"| average | {format_duration(average) if average is not None else 'n/a'} |",
|
||||
"",
|
||||
"**NOTES:**",
|
||||
notes_block,
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def post_to_mattermost(text: str) -> str:
|
||||
webhook = os.environ["MATTERMOST_ATVM_WEBHOOK"]
|
||||
payload = {"text": text}
|
||||
channel = os.environ.get("MATTERMOST_ATVM_CHANNEL")
|
||||
if channel:
|
||||
payload["channel"] = channel
|
||||
data = json.dumps(payload).encode()
|
||||
request = urllib.request.Request(webhook, data=data, headers={"Content-Type": "application/json"})
|
||||
with urllib.request.urlopen(request) as response:
|
||||
return response.read().decode().strip()
|
||||
|
||||
|
||||
def determine_state(
|
||||
build_name: str,
|
||||
build_dir: Path,
|
||||
run_log: Path,
|
||||
reporter_root: Path,
|
||||
inventory: Dict[str, str],
|
||||
started_at: datetime,
|
||||
process_gone_since: Optional[datetime],
|
||||
process_exit_grace_seconds: int,
|
||||
) -> Tuple[str, Dict[str, HostResult], str, Optional[datetime], Optional[datetime], Optional[str], List[str]]:
|
||||
cancelled_marker = build_dir / "cancelled.marker"
|
||||
log_text = read_text(run_log)
|
||||
expected_hosts = extract_expected_hosts(log_text)
|
||||
host_results = collect_host_results(reporter_root, expected_hosts, inventory, started_at)
|
||||
active = process_active(build_name)
|
||||
currents_url = extract_currents_url(log_text)
|
||||
notes: List[str] = []
|
||||
|
||||
current_host = find_current_running_host(log_text, list(host_results.keys()))
|
||||
if current_host and current_host not in host_results:
|
||||
host_results[current_host] = HostResult(
|
||||
host=current_host,
|
||||
kernel=inventory.get(current_host, "unknown"),
|
||||
status="RUN",
|
||||
detail="in progress",
|
||||
)
|
||||
|
||||
start_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
|
||||
end_candidates = [result.timestamp for result in host_results.values() if result.timestamp]
|
||||
check_xml = reporter_root / "xml"
|
||||
for xml_path in sorted(check_xml.glob("test-result-*.xml"), key=lambda p: p.stat().st_mtime, reverse=True):
|
||||
xml_mtime = datetime.fromtimestamp(xml_path.stat().st_mtime, tz=timezone.utc)
|
||||
if xml_mtime < started_at:
|
||||
continue
|
||||
text = read_text(xml_path)
|
||||
if "check-xml-files.ts" in text:
|
||||
try:
|
||||
tree = ET.parse(xml_path)
|
||||
root = tree.getroot()
|
||||
suite = root.find("testsuite")
|
||||
if suite is not None:
|
||||
ts = parse_xml_timestamp(suite.attrib.get("timestamp"))
|
||||
if ts:
|
||||
end_candidates.append(ts)
|
||||
except ET.ParseError:
|
||||
pass
|
||||
break
|
||||
|
||||
start_ts = min(start_candidates) if start_candidates else started_at
|
||||
end_ts = max(end_candidates) if end_candidates else None
|
||||
|
||||
if cancelled_marker.exists():
|
||||
notes.append("Cancellation marker detected.")
|
||||
return "CANCELLED", host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
|
||||
if active:
|
||||
elapsed = (now_utc() - started_at).total_seconds()
|
||||
if elapsed > args.max_watch_seconds:
|
||||
notes.append("Watcher exceeded max watch duration while the run still appears active.")
|
||||
return "HUNG", host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
|
||||
if "Cloud Run Finished" in log_text or currents_url:
|
||||
state = "FAILED" if any(result.failures for result in host_results.values()) else "COMPLETED"
|
||||
notes.append("Run finished and final reporting artifacts were detected.")
|
||||
if any("check-xml-files.ts" in line for line in log_text.splitlines()):
|
||||
notes.append("Final `check-xml-files.ts` validation passed.")
|
||||
return state, host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
|
||||
if process_gone_since and (now_utc() - process_gone_since).total_seconds() >= process_exit_grace_seconds:
|
||||
notes.append("Run process exited without a clean completion signal.")
|
||||
return "TERMINATED", host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
|
||||
return "RUNNING", host_results, log_text, start_ts, end_ts, currents_url, notes
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
args = parse_args()
|
||||
build_name = args.build_name
|
||||
run_log = Path(args.run_log)
|
||||
reporter_root = Path(args.reporter_root)
|
||||
inventory_file = Path(args.inventory_file)
|
||||
state_root = Path(args.state_dir)
|
||||
build_dir = state_root / build_name
|
||||
ensure_dir(build_dir)
|
||||
state_file = build_dir / "state.json"
|
||||
posted_marker = build_dir / "posted.marker"
|
||||
|
||||
inventory = load_inventory(inventory_file)
|
||||
metadata = infer_metadata()
|
||||
|
||||
state = load_state(state_file)
|
||||
default_started_at = datetime.fromtimestamp(run_log.stat().st_mtime, tz=timezone.utc) if run_log.exists() else now_utc()
|
||||
started_at = parse_xml_timestamp(state.get("started_at")) or default_started_at
|
||||
state.setdefault("build_name", build_name)
|
||||
state.setdefault("started_at", started_at.isoformat())
|
||||
write_state(state_file, state)
|
||||
|
||||
process_gone_since: Optional[datetime] = None
|
||||
|
||||
while True:
|
||||
active = process_active(build_name)
|
||||
if not active and process_gone_since is None:
|
||||
process_gone_since = now_utc()
|
||||
if active:
|
||||
process_gone_since = None
|
||||
|
||||
run_state, host_results, log_text, start_ts, end_ts, currents_url, notes = determine_state(
|
||||
build_name=build_name,
|
||||
build_dir=build_dir,
|
||||
run_log=run_log,
|
||||
reporter_root=reporter_root,
|
||||
inventory=inventory,
|
||||
started_at=started_at,
|
||||
process_gone_since=process_gone_since,
|
||||
process_exit_grace_seconds=args.process_exit_grace_seconds,
|
||||
)
|
||||
|
||||
state["last_state"] = run_state
|
||||
state["last_seen_at"] = now_utc().isoformat()
|
||||
state["host_results"] = {
|
||||
host: {
|
||||
"status": result.status,
|
||||
"detail": result.detail,
|
||||
"kernel": result.kernel,
|
||||
"tests": result.tests,
|
||||
"failures": result.failures,
|
||||
}
|
||||
for host, result in host_results.items()
|
||||
}
|
||||
write_state(state_file, state)
|
||||
|
||||
if run_state == "RUNNING":
|
||||
print(f"[watcher] {build_name}: RUNNING")
|
||||
time.sleep(args.poll_interval)
|
||||
continue
|
||||
|
||||
status_text = build_status_markdown(
|
||||
build_name=build_name,
|
||||
metadata=metadata,
|
||||
host_results=dict(sorted(host_results.items())),
|
||||
run_state=run_state,
|
||||
currents_url=currents_url,
|
||||
start_ts=start_ts,
|
||||
end_ts=end_ts,
|
||||
notes=notes,
|
||||
)
|
||||
print(status_text)
|
||||
|
||||
if run_state in {"COMPLETED", "FAILED"} and not posted_marker.exists():
|
||||
response = post_to_mattermost(status_text)
|
||||
if response != "ok":
|
||||
raise SystemExit(f"Mattermost webhook did not return ok: {response!r}")
|
||||
posted_marker.write_text("ok\n", encoding="utf-8")
|
||||
state["mattermost_posted"] = True
|
||||
state["mattermost_response"] = response
|
||||
write_state(state_file, state)
|
||||
print(f"[watcher] Mattermost post confirmed for {build_name}.")
|
||||
|
||||
state["closed_at"] = now_utc().isoformat()
|
||||
write_state(state_file, state)
|
||||
sys.exit(0)
|
||||
32
atvm/watcher-service/cancel-atvm-run-watcher.sh
Normal file
32
atvm/watcher-service/cancel-atvm-run-watcher.sh
Normal file
@@ -0,0 +1,32 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage:
|
||||
cancel-atvm-run-watcher.sh --build-name <name> [--state-root <path>]
|
||||
EOF
|
||||
}
|
||||
|
||||
BUILD_NAME=""
|
||||
STATE_ROOT="/var/lib/atvm-run-watcher"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--build-name) BUILD_NAME="${2:-}"; shift 2 ;;
|
||||
--state-root) STATE_ROOT="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$BUILD_NAME" ]]; then
|
||||
echo "--build-name is required" >&2
|
||||
usage >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
RUN_DIR="${STATE_ROOT}/${BUILD_NAME}"
|
||||
mkdir -p "$RUN_DIR"
|
||||
touch "${RUN_DIR}/cancelled.marker"
|
||||
systemctl stop "atvm-run-watcher@${BUILD_NAME}.service" || true
|
||||
60
atvm/watcher-service/start-atvm-run-watcher.sh
Normal file
60
atvm/watcher-service/start-atvm-run-watcher.sh
Normal file
@@ -0,0 +1,60 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage:
|
||||
start-atvm-run-watcher.sh --build-name <name> [options]
|
||||
|
||||
Options:
|
||||
--build-name <name>
|
||||
--template <name>
|
||||
--config-family <name>
|
||||
--migration-style <text>
|
||||
--integration-plugin <text>
|
||||
--scope-description <text>
|
||||
--state-root <path> Default: /var/lib/atvm-run-watcher
|
||||
EOF
|
||||
}
|
||||
|
||||
BUILD_NAME=""
|
||||
TEMPLATE=""
|
||||
CONFIG_FAMILY=""
|
||||
MIGRATION_STYLE=""
|
||||
INTEGRATION_PLUGIN=""
|
||||
SCOPE_DESCRIPTION=""
|
||||
STATE_ROOT="/var/lib/atvm-run-watcher"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--build-name) BUILD_NAME="${2:-}"; shift 2 ;;
|
||||
--template) TEMPLATE="${2:-}"; shift 2 ;;
|
||||
--config-family) CONFIG_FAMILY="${2:-}"; shift 2 ;;
|
||||
--migration-style) MIGRATION_STYLE="${2:-}"; shift 2 ;;
|
||||
--integration-plugin) INTEGRATION_PLUGIN="${2:-}"; shift 2 ;;
|
||||
--scope-description) SCOPE_DESCRIPTION="${2:-}"; shift 2 ;;
|
||||
--state-root) STATE_ROOT="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) echo "Unknown argument: $1" >&2; usage >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$BUILD_NAME" ]]; then
|
||||
echo "--build-name is required" >&2
|
||||
usage >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
RUN_DIR="${STATE_ROOT}/${BUILD_NAME}"
|
||||
mkdir -p "$RUN_DIR"
|
||||
|
||||
cat >"${RUN_DIR}/watch.env" <<EOF
|
||||
ATVM_WATCHER_TEMPLATE=${TEMPLATE@Q}
|
||||
ATVM_WATCHER_CONFIG_FAMILY=${CONFIG_FAMILY@Q}
|
||||
ATVM_WATCHER_MIGRATION_STYLE=${MIGRATION_STYLE@Q}
|
||||
ATVM_WATCHER_INTEGRATION_PLUGIN=${INTEGRATION_PLUGIN@Q}
|
||||
ATVM_WATCHER_SCOPE_DESCRIPTION=${SCOPE_DESCRIPTION@Q}
|
||||
EOF
|
||||
|
||||
systemctl start "atvm-run-watcher@${BUILD_NAME}.service"
|
||||
systemctl status --no-pager "atvm-run-watcher@${BUILD_NAME}.service" || true
|
||||
Reference in New Issue
Block a user