Update ATVM status reporting and credential handling docs

- change ATVM status formatting to the approved Markdown-table template with SUMMARY:, HOSTS:, TIMING:, and NOTES: - document that normal status requests print locally only unless explicitly asked to send to Mattermost - document Mattermost defaults and posting rules, including only sending after full run completion - document the controller-side systemd watcher design for future automation - add the secrets migration/cleanup review doc - ignore .env.credentials.local in git and reflect the move toward using that local credentials file instead of hardcoded secrets
2026-03-24 14:24:10 -04:00
parent b1960b7dd4
commit fa97ce5ad0
6 changed files with 450 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
 log/
+.env.credentials.local
--- a/atvm/AGENTS.md
+++ b/atvm/AGENTS.md
@@ -47,6 +47,7 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
 - Controller IP: `192.168.3.190`
 - Controller credentials: `root / atvmcdsi2012`
 - Detailed test artifact root on controller: `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
+- Default Mattermost status destination config: `/home/aw/code/cds/.env.credentials.local`
 - Default plugin: `--use_specified_plugin iscsi`
 - Always include `--ignore_force_shutdown` unless explicitly told not to.
 - Default config family: `gold`
@@ -58,6 +59,8 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
 - Always show exact planned ATVM commands before execution.
 - Never execute setup or automation commands that require approval until the operator explicitly approves them.
 - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
+- If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
+- If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
 - Treat `docs/automation/examples.md` as reference-only, not default operator intent.
 - Put reusable workflow rules in `guide.md` files.
 - Put dated lessons only in `run-learnings.md` files.
--- a/atvm/docs/automation/guide.md
+++ b/atvm/docs/automation/guide.md
@@ -174,6 +174,19 @@ When asked for one VM or a VM set:
 - If monitoring was not requested, run commands and report execution success/failure and any errors.
 - If monitoring was requested, do not terminate processes automatically; only terminate if the operator explicitly instructs termination.

+## Mattermost Status Posting
+- Treat a normal ATVM status request as local-only output by default.
+- When the operator asks to send ATVM automation run status to Mattermost, use the local defaults from `/home/aw/code/cds/.env.credentials.local`.
+- Default Mattermost variables:
+  - `MATTERMOST_ATVM_WEBHOOK`
+  - `MATTERMOST_ATVM_CHANNEL`
+- Treat these as the default destination for ATVM automation run-status posts unless the operator explicitly overrides them.
+- Send the final ATVM run status only after the run has fully completed, regardless of whether the run passed or failed.
+- Do not send interim or in-progress ATVM run status updates to Mattermost unless the operator explicitly asks for that.
+- Use the same ATVM status layout that would be shown to the operator locally when posting to Mattermost.
+- Default status template: `/home/aw/code/cds/atvm/docs/automation/status-template.md`
+- Do not post to Mattermost unless the operator explicitly asks for the run status to be sent there.
+
 ## Status Reporting Format
 When the operator asks for the status of an ATVM automation run, report in this order:
 1. Heading/title using the run `build_name`.
@@ -193,8 +206,11 @@ When the operator asks for the status of an ATVM automation run, report in this

 Status-report expectations:
 - Use the same display layout for every ATVM automation status response regardless of test type (`e2e`, `systemOS`, `reboot`, `migrateops`, and others).
+- Use `/home/aw/code/cds/atvm/docs/automation/status-template.md` as the default template for both local status output and Mattermost status posts.
+- The default ATVM status template uses Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:` and uses `NOTES:` for flat operator-facing notes.
 - Treat references to the "ATVM automation run" or "automation run" as referring to this ATVM folder workflow and the automation VM at `192.168.3.190`, not to Cirrus project operations such as the `atvm - cypress` project.
 - Treat a status request as a request for live status by default.
+- Unless the operator explicitly asks to send the status to Mattermost, print the status only in the local terminal response.
 - Use the live automation VM state when available.
 - If no automation is currently running, fall back to the most recent historical run artifacts and logs.
 - Prefer local automation evidence in this order: active runner processes, live automation-VM files, shell history for the last launch command, then historical reporter artifacts.
@@ -228,3 +244,4 @@ Status-report expectations:
 - Use `Notes` for extra context beyond the machine-specific same-line failure description.
 - Base the completion estimate on the full remaining machine count and recent per-machine runtime visible in the run log.
 - Make the estimate explicitly refer to completion of the entire remaining run, not only the current machine/spec.
+- When the operator also asks to send the status to Mattermost, send this same final status output to the configured Mattermost destination only after the run has fully completed.
--- a/atvm/docs/automation/mattermost-watcher-design.md
+++ b/atvm/docs/automation/mattermost-watcher-design.md
@@ -0,0 +1,179 @@
+# ATVM Mattermost Watcher Design
+
+## Purpose
+Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
+
+This watcher must continue working even if the local operator machine is offline.
+
+## Implementation Approach
+Use a `systemd`-managed watcher on the ATVM Cypress controller.
+
+Recommended structure:
+- one watcher script that evaluates the state of a specific ATVM run
+- one `systemd` service to execute the watcher
+- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
+
+Preferred deployment target:
+- controller host: `192.168.3.190`
+- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
+
+## Mattermost Destination
+Use the local credential file in this workspace as the source of defaults:
+- `/home/aw/code/cds/.env.credentials.local`
+
+Expected variables:
+- `MATTERMOST_ATVM_WEBHOOK`
+- `MATTERMOST_ATVM_CHANNEL`
+
+## Run Completion Rule
+The watcher must send Mattermost results only after the ATVM run has fully completed.
+
+A run is considered fully completed only when:
+- there are no active runner processes for the run
+- the expected machine scope has final result artifacts
+- no machine remains in `RUNNING` or `NOT STARTED`
+- final reporter artifacts confirm the run has ended
+
+Evidence sources:
+- live runner processes on `192.168.3.190`
+- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
+- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/`
+- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/`
+
+## Required Run States
+The watcher must distinguish these run-level states:
+- `COMPLETED`
+- `FAILED`
+- `CANCELLED`
+- `TERMINATED`
+- `HUNG`
+- `UNKNOWN`
+- `RUNNING`
+
+Definitions:
+- `COMPLETED`
+  - the run finished normally
+  - all machines have final results
+  - no run-level failure state blocks completion
+- `FAILED`
+  - the run finished, but one or more hosts failed
+  - this is still a completed run
+- `CANCELLED`
+  - the run was intentionally cancelled through an explicit cancellation path
+- `TERMINATED`
+  - the run was manually killed or stopped before normal completion
+- `HUNG`
+  - the run appears stuck and does not meet completion rules within the expected policy window
+- `UNKNOWN`
+  - the watcher cannot safely determine the true state
+- `RUNNING`
+  - the run is still active and not yet complete
+
+## Mattermost Posting Rule
+Post to Mattermost only when the run has fully completed.
+
+Send Mattermost status for:
+- `COMPLETED`
+- `FAILED`
+
+Do not send Mattermost status for:
+- `CANCELLED`
+- `TERMINATED`
+- `HUNG`
+- `UNKNOWN`
+- `RUNNING`
+
+Important clarification:
+- a completed run with failed hosts should still be posted
+- a cancelled, terminated, hung, or unknown run should not be posted
+
+## Required Cancellation / Termination Handling
+If a run is cancelled or terminated, the watcher must:
+- detect that the run was cancelled or manually killed
+- stop waiting for normal completion
+- mark the run as closed without posting final Mattermost status
+- prevent any later success/failure post for that same run
+
+## State Tracking Requirements
+The watcher must track each monitored run by run id or build name.
+
+For each run, keep durable state such as:
+- tracked run id / build name
+- controller-side watcher state
+- completion marker
+- cancellation / termination marker
+- Mattermost posted marker
+- last observed machine summary
+- timestamps for first seen, last seen, closed
+
+## Duplicate-Post Prevention
+The watcher must prevent duplicate Mattermost posts.
+
+Required behavior:
+- only one final post per run
+- if a run is already marked as posted, do not send again
+- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
+
+## Recommended State Files
+Use a durable controller-local state directory, for example:
+- `/var/lib/atvm-run-watcher/`
+
+Possible contents:
+- one state file per run id
+- one posted marker per run id
+- one cancellation marker per run id
+- optional lock file to prevent multiple watcher instances from racing
+
+## Recommended Operator Workflow
+Normal completion workflow:
+1. ATVM run starts.
+2. Watcher tracks the run id / build name.
+3. Watcher polls run state and artifacts.
+4. Run fully completes.
+5. Watcher builds final status summary.
+6. Watcher posts final status to Mattermost once.
+7. Watcher marks the run as posted and closed.
+
+Cancellation / termination workflow:
+1. Operator stops the ATVM run.
+2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
+3. Watcher marks the run `CANCELLED` or `TERMINATED`.
+4. Watcher exits cleanly without posting to Mattermost.
+5. Watcher prevents later duplicate or misleading final-post behavior.
+
+## Failure Semantics
+Host-level failures do not suppress Mattermost posting.
+
+If:
+- the run has fully completed
+- and one or more hosts failed
+
+Then:
+- final Mattermost status should still be sent
+- final run-level state should be treated as completed-with-failures
+
+## Hang / Unknown Semantics
+If the run cannot be safely classified as completed, failed, cancelled, or terminated:
+- classify it as `HUNG` or `UNKNOWN`
+- do not post to Mattermost
+- require operator review
+
+## Logging Requirements
+The watcher should log:
+- the run id / build name being monitored
+- each state transition
+- posting decisions
+- reasons for suppressing a Mattermost post
+- duplicate-post prevention decisions
+- final closed state
+
+## Summary
+This watcher design must satisfy all of the following:
+- run on the ATVM Cypress controller
+- survive local operator machine downtime
+- use `systemd`
+- distinguish run states clearly
+- send Mattermost only after full completion
+- send completion results whether hosts passed or failed
+- never send Mattermost for cancelled, terminated, hung, or unknown runs
+- prevent duplicate or misleading posts
--- a/atvm/docs/automation/status-template.md
+++ b/atvm/docs/automation/status-template.md
@@ -0,0 +1,62 @@
+# ATVM Status Template
+
+Use this as the default ATVM automation run-status template for:
+- local status responses in the terminal
+- Mattermost status posts after a completed run
+
+## Layout
+
+```md
+## ATVM Run Status
+### <build_name>
+
+**SUMMARY:**
+
+| Metric | Value |
+|---|---:|
+| finished | <n> |
+| passed | <n> |
+| failed | <n> |
+| skipped | <n> |
+
+**HOSTS:**
+
+| Host | Status | Detail |
+|---|---|---|
+| <host-name> | ✅ PASS | completed |
+| <host-name> | ⚠️ FAIL | <useful failure description> |
+| <host-name> | ⏳ RUN | in progress |
+| <host-name> | ⏭️ SKIP | <skip reason> |
+
+**TIMING:**
+
+| Metric | Value |
+|---|---|
+| start | <start time> |
+| end | <end time or n/a> |
+| total | <total or elapsed runtime> |
+| quickest | <host> - <runtime> or n/a |
+| longest | <host> - <runtime> or n/a |
+| average | <runtime> or n/a |
+
+**NOTES:**
+- <note>
+- <note>
+```
+
+## Rules
+- Keep `SUMMARY:`, `HOSTS:`, `TIMING:`, and `NOTES:` in that order.
+- Use the title format:
+  - `## ATVM Run Status`
+  - `### <build_name>`
+- Use Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:`.
+- Use one host per row in the `HOSTS:` section.
+- For completed hosts, prefer:
+  - `✅ PASS`
+  - `⚠️ FAIL`
+- For in-progress or skipped hosts, use:
+  - `⏳ RUN`
+  - `⏭️ SKIP`
+- Keep `Detail` concise.
+- Put broader context under `NOTES:`, not in the host table.
+- Use the same template for Mattermost and local operator-visible status output.
--- a/atvm/docs/workflow/secrets-migration-and-cleanup.md
+++ b/atvm/docs/workflow/secrets-migration-and-cleanup.md
@@ -0,0 +1,188 @@
+# Secrets Migration And Cleanup
+
+## Purpose
+This document explains:
+- whether the workspace can be cleaned up to stop storing credentials and tokens in tracked files
+- how `.env.credentials.local` should be used
+- what has to happen to remove already-committed secrets from git history and the remote repository
+
+## 1. Can the workspace be cleaned up to stop referencing raw secrets in tracked files?
+Yes.
+
+The intended cleanup is:
+- remove hardcoded credentials, API tokens, webhook URLs, and similar secrets from tracked docs and files
+- replace those values with references to `/home/aw/code/cds/.env.credentials.local`
+- keep only non-secret metadata in tracked files, such as:
+  - hostnames
+  - IPs
+  - usernames when acceptable
+  - variable names
+  - usage instructions
+
+Examples of what tracked docs should say instead of storing raw values:
+- `Use ATVM_CONTROLLER_PASSWORD from /home/aw/code/cds/.env.credentials.local`
+- `Use VCENTER_USER and VCENTER_PASSWORD from /home/aw/code/cds/.env.credentials.local`
+- `Use MATTERMOST_ATVM_WEBHOOK from /home/aw/code/cds/.env.credentials.local`
+
+Recommended scope of cleanup:
+- `atvm/inventory/accounts-and-credentials.md`
+- `atvm/inventory/infrastructure.md`
+- any other tracked docs or scripts that contain:
+  - passwords
+  - API tokens
+  - TOTP secrets
+  - webhook URLs
+  - install codes or secrets that should not remain in git
+
+## 2. What do I need to do for the assistant to use `.env.credentials.local`?
+The file exists on disk, but the assistant does not automatically import shell environment files unless one of the following is done:
+
+### Option A: Explicitly source it in the shell session
+Example:
+
+```bash
+source /home/aw/code/cds/.env.credentials.local
+```
+
+This is the simplest and most reliable option for interactive terminal work.
+
+### Option B: Scripts explicitly read it
+A script can do:
+
+```bash
+source /home/aw/code/cds/.env.credentials.local
+```
+
+before using any secret-backed variables.
+
+### Option C: The workflow documentation tells the assistant to load it
+The workspace docs can instruct the assistant to use `/home/aw/code/cds/.env.credentials.local` when credentials are required, but the assistant still needs an execution path that actually loads those variables into the shell or reads them directly from the file.
+
+## Practical rule
+If you want the assistant to reliably use these values during execution, the safest approach is:
+- either explicitly source the file first
+- or instruct the assistant to source it as part of the command/script it runs
+
+## Important limitation
+The existence of `.env.credentials.local` does not automatically make every shell command aware of those variables.
+
+The assistant needs one of these:
+- the current shell environment already contains the exported variables
+- the command explicitly sources the file
+- the script being executed explicitly sources the file
+
+## 3. What do I need to do if secrets were already committed and pushed to the remote repository?
+If secrets were already committed to git history and pushed, `.gitignore` does not fix that.
+
+You need to treat those secrets as exposed.
+
+## Required response
+Do these in this order:
+
+### Step 1: Rotate or revoke the exposed secrets
+This is the most important step.
+
+Examples:
+- regenerate Mattermost webhook URLs
+- replace API tokens
+- rotate passwords
+- regenerate TOTP/shared secrets if applicable
+- replace any service registration or install tokens that should be considered exposed
+
+Even if you later remove them from git history, assume they were already copied.
+
+### Step 2: Remove secrets from the current tracked files
+Edit the tracked docs and scripts so they no longer contain raw secrets.
+
+Replace them with:
+- references to `.env.credentials.local`
+- redacted placeholders
+- variable names
+
+### Step 3: Rewrite git history to remove the secrets from all commits
+This is a history-rewrite operation.
+
+Typical tools:
+- `git filter-repo` (preferred)
+- BFG Repo-Cleaner
+
+High-level workflow:
+1. identify all tracked files and literal secrets that must be removed
+2. rewrite repository history to remove or replace them
+3. verify the secrets no longer exist in any commit
+4. force-push the rewritten history to the remote
+
+### Step 4: Force-push the cleaned history
+After rewriting history, the remote must be updated with a force push.
+
+That usually means:
+- `git push --force-with-lease origin <branch>`
+
+### Step 5: Coordinate with anyone else using the repo
+Anyone with an old clone will still have the old history unless they reset or reclone.
+
+They need instructions to:
+- stop using the old history
+- fetch the rewritten branch
+- hard reset or reclone as appropriate
+
+## Important caution about remote cleanup
+Cleaning the git remote history does not guarantee that every copy is gone.
+
+Secrets may still exist in:
+- old clones
+- forks
+- CI logs
+- code review systems
+- backups
+- screenshots or pasted chat logs
+
+That is why secret rotation must happen first.
+
+## Recommended cleanup policy for this workspace
+For this workspace, the correct policy should be:
+- keep real secrets only in `/home/aw/code/cds/.env.credentials.local`
+- keep that file gitignored
+- remove raw secrets from tracked docs
+- document variable names and usage instead of values
+- rotate any secrets that were ever committed
+- rewrite history if the repository should no longer retain those secret values
+
+## Proposed next implementation work
+When approved, the cleanup work would likely be:
+1. inventory all tracked files containing secrets
+2. patch those files to reference `.env.credentials.local`
+3. update docs so the credential source is explicit
+4. prepare a history-rewrite plan
+5. prepare exact git commands for review before any destructive git action
+
+## Git-history cleanup note
+History rewriting is disruptive and should not be done casually.
+
+Before doing it, prepare:
+- the list of files and secrets to purge
+- the exact rewrite tool and command
+- the exact verification commands
+- the exact force-push command
+- the operator communication plan for other users of the repo
+
+## Summary
+Answers to the three direct questions:
+
+### Question 1
+Yes, the workspace can be cleaned up to stop storing secrets in tracked files and instead reference `/home/aw/code/cds/.env.credentials.local`.
+
+### Question 2
+To have the assistant reliably use `.env.credentials.local`, either:
+- explicitly source it
+- or ensure the script/command being run sources it
+
+The assistant does not automatically inherit its contents just because the file exists.
+
+### Question 3
+If secrets were already committed and pushed:
+- rotate them first
+- remove them from current files
+- rewrite git history
+- force-push the cleaned history
+- coordinate with anyone else who has a clone