Update ATVM status reporting and credential handling docs

- change ATVM status formatting to the approved Markdown-table template with SUMMARY:, HOSTS:, TIMING:, and NOTES:
- document that normal status requests print locally only unless explicitly asked to send to Mattermost
- document Mattermost defaults and posting rules, including only sending after full run completion
- document the controller-side systemd watcher design for future automation
- add the secrets migration/cleanup review doc
- ignore .env.credentials.local in git and reflect the move toward using that local credentials file instead of hardcoded secrets
This commit is contained in:
2026-03-24 14:24:10 -04:00
parent b1960b7dd4
commit fa97ce5ad0
6 changed files with 450 additions and 0 deletions

1
.gitignore vendored
View File

@@ -1 +1,2 @@
log/
.env.credentials.local

View File

@@ -47,6 +47,7 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
- Controller IP: `192.168.3.190`
- Controller credentials: `root / atvmcdsi2012`
- Detailed test artifact root on controller: `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`
- Default Mattermost status destination config: `/home/aw/code/cds/.env.credentials.local`
- Default plugin: `--use_specified_plugin iscsi`
- Always include `--ignore_force_shutdown` unless explicitly told not to.
- Default config family: `gold`
@@ -58,6 +59,8 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co
- Always show exact planned ATVM commands before execution.
- Never execute setup or automation commands that require approval until the operator explicitly approves them.
- For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`.
- If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally.
- If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed.
- Treat `docs/automation/examples.md` as reference-only, not default operator intent.
- Put reusable workflow rules in `guide.md` files.
- Put dated lessons only in `run-learnings.md` files.

View File

@@ -174,6 +174,19 @@ When asked for one VM or a VM set:
- If monitoring was not requested, run commands and report execution success/failure and any errors.
- If monitoring was requested, do not terminate processes automatically; only terminate if the operator explicitly instructs termination.
## Mattermost Status Posting
- Treat a normal ATVM status request as local-only output by default.
- When the operator asks to send ATVM automation run status to Mattermost, use the local defaults from `/home/aw/code/cds/.env.credentials.local`.
- Default Mattermost variables:
- `MATTERMOST_ATVM_WEBHOOK`
- `MATTERMOST_ATVM_CHANNEL`
- Treat these as the default destination for ATVM automation run-status posts unless the operator explicitly overrides them.
- Send the final ATVM run status only after the run has fully completed, regardless of whether the run passed or failed.
- Do not send interim or in-progress ATVM run status updates to Mattermost unless the operator explicitly asks for that.
- Use the same ATVM status layout that would be shown to the operator locally when posting to Mattermost.
- Default status template: `/home/aw/code/cds/atvm/docs/automation/status-template.md`
- Do not post to Mattermost unless the operator explicitly asks for the run status to be sent there.
## Status Reporting Format
When the operator asks for the status of an ATVM automation run, report in this order:
1. Heading/title using the run `build_name`.
@@ -193,8 +206,11 @@ When the operator asks for the status of an ATVM automation run, report in this
Status-report expectations:
- Use the same display layout for every ATVM automation status response regardless of test type (`e2e`, `systemOS`, `reboot`, `migrateops`, and others).
- Use `/home/aw/code/cds/atvm/docs/automation/status-template.md` as the default template for both local status output and Mattermost status posts.
- The default ATVM status template uses Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:` and uses `NOTES:` for flat operator-facing notes.
- Treat references to the "ATVM automation run" or "automation run" as referring to this ATVM folder workflow and the automation VM at `192.168.3.190`, not to Cirrus project operations such as the `atvm - cypress` project.
- Treat a status request as a request for live status by default.
- Unless the operator explicitly asks to send the status to Mattermost, print the status only in the local terminal response.
- Use the live automation VM state when available.
- If no automation is currently running, fall back to the most recent historical run artifacts and logs.
- Prefer local automation evidence in this order: active runner processes, live automation-VM files, shell history for the last launch command, then historical reporter artifacts.
@@ -228,3 +244,4 @@ Status-report expectations:
- Use `Notes` for extra context beyond the machine-specific same-line failure description.
- Base the completion estimate on the full remaining machine count and recent per-machine runtime visible in the run log.
- Make the estimate explicitly refer to completion of the entire remaining run, not only the current machine/spec.
- When the operator also asks to send the status to Mattermost, send this same final status output to the configured Mattermost destination only after the run has fully completed.

View File

@@ -0,0 +1,179 @@
# ATVM Mattermost Watcher Design
## Purpose
Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed.
This watcher must continue working even if the local operator machine is offline.
## Implementation Approach
Use a `systemd`-managed watcher on the ATVM Cypress controller.
Recommended structure:
- one watcher script that evaluates the state of a specific ATVM run
- one `systemd` service to execute the watcher
- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process
Preferred deployment target:
- controller host: `192.168.3.190`
- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4`
## Mattermost Destination
Use the local credential file in this workspace as the source of defaults:
- `/home/aw/code/cds/.env.credentials.local`
Expected variables:
- `MATTERMOST_ATVM_WEBHOOK`
- `MATTERMOST_ATVM_CHANNEL`
## Run Completion Rule
The watcher must send Mattermost results only after the ATVM run has fully completed.
A run is considered fully completed only when:
- there are no active runner processes for the run
- the expected machine scope has final result artifacts
- no machine remains in `RUNNING` or `NOT STARTED`
- final reporter artifacts confirm the run has ended
Evidence sources:
- live runner processes on `192.168.3.190`
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/`
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/`
- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/`
## Required Run States
The watcher must distinguish these run-level states:
- `COMPLETED`
- `FAILED`
- `CANCELLED`
- `TERMINATED`
- `HUNG`
- `UNKNOWN`
- `RUNNING`
Definitions:
- `COMPLETED`
- the run finished normally
- all machines have final results
- no run-level failure state blocks completion
- `FAILED`
- the run finished, but one or more hosts failed
- this is still a completed run
- `CANCELLED`
- the run was intentionally cancelled through an explicit cancellation path
- `TERMINATED`
- the run was manually killed or stopped before normal completion
- `HUNG`
- the run appears stuck and does not meet completion rules within the expected policy window
- `UNKNOWN`
- the watcher cannot safely determine the true state
- `RUNNING`
- the run is still active and not yet complete
## Mattermost Posting Rule
Post to Mattermost only when the run has fully completed.
Send Mattermost status for:
- `COMPLETED`
- `FAILED`
Do not send Mattermost status for:
- `CANCELLED`
- `TERMINATED`
- `HUNG`
- `UNKNOWN`
- `RUNNING`
Important clarification:
- a completed run with failed hosts should still be posted
- a cancelled, terminated, hung, or unknown run should not be posted
## Required Cancellation / Termination Handling
If a run is cancelled or terminated, the watcher must:
- detect that the run was cancelled or manually killed
- stop waiting for normal completion
- mark the run as closed without posting final Mattermost status
- prevent any later success/failure post for that same run
## State Tracking Requirements
The watcher must track each monitored run by run id or build name.
For each run, keep durable state such as:
- tracked run id / build name
- controller-side watcher state
- completion marker
- cancellation / termination marker
- Mattermost posted marker
- last observed machine summary
- timestamps for first seen, last seen, closed
## Duplicate-Post Prevention
The watcher must prevent duplicate Mattermost posts.
Required behavior:
- only one final post per run
- if a run is already marked as posted, do not send again
- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow
## Recommended State Files
Use a durable controller-local state directory, for example:
- `/var/lib/atvm-run-watcher/`
Possible contents:
- one state file per run id
- one posted marker per run id
- one cancellation marker per run id
- optional lock file to prevent multiple watcher instances from racing
## Recommended Operator Workflow
Normal completion workflow:
1. ATVM run starts.
2. Watcher tracks the run id / build name.
3. Watcher polls run state and artifacts.
4. Run fully completes.
5. Watcher builds final status summary.
6. Watcher posts final status to Mattermost once.
7. Watcher marks the run as posted and closed.
Cancellation / termination workflow:
1. Operator stops the ATVM run.
2. Watcher detects cancellation / termination, or an explicit cancellation marker is written.
3. Watcher marks the run `CANCELLED` or `TERMINATED`.
4. Watcher exits cleanly without posting to Mattermost.
5. Watcher prevents later duplicate or misleading final-post behavior.
## Failure Semantics
Host-level failures do not suppress Mattermost posting.
If:
- the run has fully completed
- and one or more hosts failed
Then:
- final Mattermost status should still be sent
- final run-level state should be treated as completed-with-failures
## Hang / Unknown Semantics
If the run cannot be safely classified as completed, failed, cancelled, or terminated:
- classify it as `HUNG` or `UNKNOWN`
- do not post to Mattermost
- require operator review
## Logging Requirements
The watcher should log:
- the run id / build name being monitored
- each state transition
- posting decisions
- reasons for suppressing a Mattermost post
- duplicate-post prevention decisions
- final closed state
## Summary
This watcher design must satisfy all of the following:
- run on the ATVM Cypress controller
- survive local operator machine downtime
- use `systemd`
- distinguish run states clearly
- send Mattermost only after full completion
- send completion results whether hosts passed or failed
- never send Mattermost for cancelled, terminated, hung, or unknown runs
- prevent duplicate or misleading posts

View File

@@ -0,0 +1,62 @@
# ATVM Status Template
Use this as the default ATVM automation run-status template for:
- local status responses in the terminal
- Mattermost status posts after a completed run
## Layout
```md
## ATVM Run Status
### <build_name>
**SUMMARY:**
| Metric | Value |
|---|---:|
| finished | <n> |
| passed | <n> |
| failed | <n> |
| skipped | <n> |
**HOSTS:**
| Host | Status | Detail |
|---|---|---|
| <host-name> | ✅ PASS | completed |
| <host-name> | ⚠️ FAIL | <useful failure description> |
| <host-name> | ⏳ RUN | in progress |
| <host-name> | ⏭️ SKIP | <skip reason> |
**TIMING:**
| Metric | Value |
|---|---|
| start | <start time> |
| end | <end time or n/a> |
| total | <total or elapsed runtime> |
| quickest | <host> - <runtime> or n/a |
| longest | <host> - <runtime> or n/a |
| average | <runtime> or n/a |
**NOTES:**
- <note>
- <note>
```
## Rules
- Keep `SUMMARY:`, `HOSTS:`, `TIMING:`, and `NOTES:` in that order.
- Use the title format:
- `## ATVM Run Status`
- `### <build_name>`
- Use Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:`.
- Use one host per row in the `HOSTS:` section.
- For completed hosts, prefer:
- `✅ PASS`
- `⚠️ FAIL`
- For in-progress or skipped hosts, use:
- `⏳ RUN`
- `⏭️ SKIP`
- Keep `Detail` concise.
- Put broader context under `NOTES:`, not in the host table.
- Use the same template for Mattermost and local operator-visible status output.

View File

@@ -0,0 +1,188 @@
# Secrets Migration And Cleanup
## Purpose
This document explains:
- whether the workspace can be cleaned up to stop storing credentials and tokens in tracked files
- how `.env.credentials.local` should be used
- what has to happen to remove already-committed secrets from git history and the remote repository
## 1. Can the workspace be cleaned up to stop referencing raw secrets in tracked files?
Yes.
The intended cleanup is:
- remove hardcoded credentials, API tokens, webhook URLs, and similar secrets from tracked docs and files
- replace those values with references to `/home/aw/code/cds/.env.credentials.local`
- keep only non-secret metadata in tracked files, such as:
- hostnames
- IPs
- usernames when acceptable
- variable names
- usage instructions
Examples of what tracked docs should say instead of storing raw values:
- `Use ATVM_CONTROLLER_PASSWORD from /home/aw/code/cds/.env.credentials.local`
- `Use VCENTER_USER and VCENTER_PASSWORD from /home/aw/code/cds/.env.credentials.local`
- `Use MATTERMOST_ATVM_WEBHOOK from /home/aw/code/cds/.env.credentials.local`
Recommended scope of cleanup:
- `atvm/inventory/accounts-and-credentials.md`
- `atvm/inventory/infrastructure.md`
- any other tracked docs or scripts that contain:
- passwords
- API tokens
- TOTP secrets
- webhook URLs
- install codes or secrets that should not remain in git
## 2. What do I need to do for the assistant to use `.env.credentials.local`?
The file exists on disk, but the assistant does not automatically import shell environment files unless one of the following is done:
### Option A: Explicitly source it in the shell session
Example:
```bash
source /home/aw/code/cds/.env.credentials.local
```
This is the simplest and most reliable option for interactive terminal work.
### Option B: Scripts explicitly read it
A script can do:
```bash
source /home/aw/code/cds/.env.credentials.local
```
before using any secret-backed variables.
### Option C: The workflow documentation tells the assistant to load it
The workspace docs can instruct the assistant to use `/home/aw/code/cds/.env.credentials.local` when credentials are required, but the assistant still needs an execution path that actually loads those variables into the shell or reads them directly from the file.
## Practical rule
If you want the assistant to reliably use these values during execution, the safest approach is:
- either explicitly source the file first
- or instruct the assistant to source it as part of the command/script it runs
## Important limitation
The existence of `.env.credentials.local` does not automatically make every shell command aware of those variables.
The assistant needs one of these:
- the current shell environment already contains the exported variables
- the command explicitly sources the file
- the script being executed explicitly sources the file
## 3. What do I need to do if secrets were already committed and pushed to the remote repository?
If secrets were already committed to git history and pushed, `.gitignore` does not fix that.
You need to treat those secrets as exposed.
## Required response
Do these in this order:
### Step 1: Rotate or revoke the exposed secrets
This is the most important step.
Examples:
- regenerate Mattermost webhook URLs
- replace API tokens
- rotate passwords
- regenerate TOTP/shared secrets if applicable
- replace any service registration or install tokens that should be considered exposed
Even if you later remove them from git history, assume they were already copied.
### Step 2: Remove secrets from the current tracked files
Edit the tracked docs and scripts so they no longer contain raw secrets.
Replace them with:
- references to `.env.credentials.local`
- redacted placeholders
- variable names
### Step 3: Rewrite git history to remove the secrets from all commits
This is a history-rewrite operation.
Typical tools:
- `git filter-repo` (preferred)
- BFG Repo-Cleaner
High-level workflow:
1. identify all tracked files and literal secrets that must be removed
2. rewrite repository history to remove or replace them
3. verify the secrets no longer exist in any commit
4. force-push the rewritten history to the remote
### Step 4: Force-push the cleaned history
After rewriting history, the remote must be updated with a force push.
That usually means:
- `git push --force-with-lease origin <branch>`
### Step 5: Coordinate with anyone else using the repo
Anyone with an old clone will still have the old history unless they reset or reclone.
They need instructions to:
- stop using the old history
- fetch the rewritten branch
- hard reset or reclone as appropriate
## Important caution about remote cleanup
Cleaning the git remote history does not guarantee that every copy is gone.
Secrets may still exist in:
- old clones
- forks
- CI logs
- code review systems
- backups
- screenshots or pasted chat logs
That is why secret rotation must happen first.
## Recommended cleanup policy for this workspace
For this workspace, the correct policy should be:
- keep real secrets only in `/home/aw/code/cds/.env.credentials.local`
- keep that file gitignored
- remove raw secrets from tracked docs
- document variable names and usage instead of values
- rotate any secrets that were ever committed
- rewrite history if the repository should no longer retain those secret values
## Proposed next implementation work
When approved, the cleanup work would likely be:
1. inventory all tracked files containing secrets
2. patch those files to reference `.env.credentials.local`
3. update docs so the credential source is explicit
4. prepare a history-rewrite plan
5. prepare exact git commands for review before any destructive git action
## Git-history cleanup note
History rewriting is disruptive and should not be done casually.
Before doing it, prepare:
- the list of files and secrets to purge
- the exact rewrite tool and command
- the exact verification commands
- the exact force-push command
- the operator communication plan for other users of the repo
## Summary
Answers to the three direct questions:
### Question 1
Yes, the workspace can be cleaned up to stop storing secrets in tracked files and instead reference `/home/aw/code/cds/.env.credentials.local`.
### Question 2
To have the assistant reliably use `.env.credentials.local`, either:
- explicitly source it
- or ensure the script/command being run sources it
The assistant does not automatically inherit its contents just because the file exists.
### Question 3
If secrets were already committed and pushed:
- rotate them first
- remove them from current files
- rewrite git history
- force-push the cleaned history
- coordinate with anyone else who has a clone