diff --git a/.gitignore b/.gitignore index 80dd262..5612761 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ log/ +.env.credentials.local diff --git a/atvm/AGENTS.md b/atvm/AGENTS.md index 7cb4469..c7f0a6e 100644 --- a/atvm/AGENTS.md +++ b/atvm/AGENTS.md @@ -47,6 +47,7 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co - Controller IP: `192.168.3.190` - Controller credentials: `root / atvmcdsi2012` - Detailed test artifact root on controller: `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter` +- Default Mattermost status destination config: `/home/aw/code/cds/.env.credentials.local` - Default plugin: `--use_specified_plugin iscsi` - Always include `--ignore_force_shutdown` unless explicitly told not to. - Default config family: `gold` @@ -58,6 +59,8 @@ This file defines how to operate and maintain the ATVM workspace in `/home/aw/co - Always show exact planned ATVM commands before execution. - Never execute setup or automation commands that require approval until the operator explicitly approves them. - For host-level test detail and failed-test investigation, use `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter`, especially `logs/`, `xml/`, and `mochawesome/`. +- If the operator asks for ATVM run status without mentioning Mattermost, respond locally only and do not post externally. +- If the operator asks to send ATVM run status to Mattermost, use `MATTERMOST_ATVM_WEBHOOK` and `MATTERMOST_ATVM_CHANNEL` from `/home/aw/code/cds/.env.credentials.local` by default and send the final status only after the run has fully completed, whether the run passed or failed. - Treat `docs/automation/examples.md` as reference-only, not default operator intent. - Put reusable workflow rules in `guide.md` files. - Put dated lessons only in `run-learnings.md` files. diff --git a/atvm/docs/automation/guide.md b/atvm/docs/automation/guide.md index 8abb633..3b10b9f 100644 --- a/atvm/docs/automation/guide.md +++ b/atvm/docs/automation/guide.md @@ -174,6 +174,19 @@ When asked for one VM or a VM set: - If monitoring was not requested, run commands and report execution success/failure and any errors. - If monitoring was requested, do not terminate processes automatically; only terminate if the operator explicitly instructs termination. +## Mattermost Status Posting +- Treat a normal ATVM status request as local-only output by default. +- When the operator asks to send ATVM automation run status to Mattermost, use the local defaults from `/home/aw/code/cds/.env.credentials.local`. +- Default Mattermost variables: + - `MATTERMOST_ATVM_WEBHOOK` + - `MATTERMOST_ATVM_CHANNEL` +- Treat these as the default destination for ATVM automation run-status posts unless the operator explicitly overrides them. +- Send the final ATVM run status only after the run has fully completed, regardless of whether the run passed or failed. +- Do not send interim or in-progress ATVM run status updates to Mattermost unless the operator explicitly asks for that. +- Use the same ATVM status layout that would be shown to the operator locally when posting to Mattermost. +- Default status template: `/home/aw/code/cds/atvm/docs/automation/status-template.md` +- Do not post to Mattermost unless the operator explicitly asks for the run status to be sent there. + ## Status Reporting Format When the operator asks for the status of an ATVM automation run, report in this order: 1. Heading/title using the run `build_name`. @@ -193,8 +206,11 @@ When the operator asks for the status of an ATVM automation run, report in this Status-report expectations: - Use the same display layout for every ATVM automation status response regardless of test type (`e2e`, `systemOS`, `reboot`, `migrateops`, and others). +- Use `/home/aw/code/cds/atvm/docs/automation/status-template.md` as the default template for both local status output and Mattermost status posts. +- The default ATVM status template uses Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:` and uses `NOTES:` for flat operator-facing notes. - Treat references to the "ATVM automation run" or "automation run" as referring to this ATVM folder workflow and the automation VM at `192.168.3.190`, not to Cirrus project operations such as the `atvm - cypress` project. - Treat a status request as a request for live status by default. +- Unless the operator explicitly asks to send the status to Mattermost, print the status only in the local terminal response. - Use the live automation VM state when available. - If no automation is currently running, fall back to the most recent historical run artifacts and logs. - Prefer local automation evidence in this order: active runner processes, live automation-VM files, shell history for the last launch command, then historical reporter artifacts. @@ -228,3 +244,4 @@ Status-report expectations: - Use `Notes` for extra context beyond the machine-specific same-line failure description. - Base the completion estimate on the full remaining machine count and recent per-machine runtime visible in the run log. - Make the estimate explicitly refer to completion of the entire remaining run, not only the current machine/spec. +- When the operator also asks to send the status to Mattermost, send this same final status output to the configured Mattermost destination only after the run has fully completed. diff --git a/atvm/docs/automation/mattermost-watcher-design.md b/atvm/docs/automation/mattermost-watcher-design.md new file mode 100644 index 0000000..6375041 --- /dev/null +++ b/atvm/docs/automation/mattermost-watcher-design.md @@ -0,0 +1,179 @@ +# ATVM Mattermost Watcher Design + +## Purpose +Design a controller-local watcher on the ATVM Cypress machine (`192.168.3.190`) that monitors an ATVM automation run and posts the final run status to Mattermost only after the run has fully completed. + +This watcher must continue working even if the local operator machine is offline. + +## Implementation Approach +Use a `systemd`-managed watcher on the ATVM Cypress controller. + +Recommended structure: +- one watcher script that evaluates the state of a specific ATVM run +- one `systemd` service to execute the watcher +- optionally one `systemd` timer for periodic polling if the watcher is not implemented as a long-running process + +Preferred deployment target: +- controller host: `192.168.3.190` +- ATVM automation root: `/root/cdc-e2e-cyp-12.17.4` + +## Mattermost Destination +Use the local credential file in this workspace as the source of defaults: +- `/home/aw/code/cds/.env.credentials.local` + +Expected variables: +- `MATTERMOST_ATVM_WEBHOOK` +- `MATTERMOST_ATVM_CHANNEL` + +## Run Completion Rule +The watcher must send Mattermost results only after the ATVM run has fully completed. + +A run is considered fully completed only when: +- there are no active runner processes for the run +- the expected machine scope has final result artifacts +- no machine remains in `RUNNING` or `NOT STARTED` +- final reporter artifacts confirm the run has ended + +Evidence sources: +- live runner processes on `192.168.3.190` +- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/logs/` +- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/xml/` +- `/root/cdc-e2e-cyp-12.17.4/cypress/cmcReporter/mochawesome/` + +## Required Run States +The watcher must distinguish these run-level states: +- `COMPLETED` +- `FAILED` +- `CANCELLED` +- `TERMINATED` +- `HUNG` +- `UNKNOWN` +- `RUNNING` + +Definitions: +- `COMPLETED` + - the run finished normally + - all machines have final results + - no run-level failure state blocks completion +- `FAILED` + - the run finished, but one or more hosts failed + - this is still a completed run +- `CANCELLED` + - the run was intentionally cancelled through an explicit cancellation path +- `TERMINATED` + - the run was manually killed or stopped before normal completion +- `HUNG` + - the run appears stuck and does not meet completion rules within the expected policy window +- `UNKNOWN` + - the watcher cannot safely determine the true state +- `RUNNING` + - the run is still active and not yet complete + +## Mattermost Posting Rule +Post to Mattermost only when the run has fully completed. + +Send Mattermost status for: +- `COMPLETED` +- `FAILED` + +Do not send Mattermost status for: +- `CANCELLED` +- `TERMINATED` +- `HUNG` +- `UNKNOWN` +- `RUNNING` + +Important clarification: +- a completed run with failed hosts should still be posted +- a cancelled, terminated, hung, or unknown run should not be posted + +## Required Cancellation / Termination Handling +If a run is cancelled or terminated, the watcher must: +- detect that the run was cancelled or manually killed +- stop waiting for normal completion +- mark the run as closed without posting final Mattermost status +- prevent any later success/failure post for that same run + +## State Tracking Requirements +The watcher must track each monitored run by run id or build name. + +For each run, keep durable state such as: +- tracked run id / build name +- controller-side watcher state +- completion marker +- cancellation / termination marker +- Mattermost posted marker +- last observed machine summary +- timestamps for first seen, last seen, closed + +## Duplicate-Post Prevention +The watcher must prevent duplicate Mattermost posts. + +Required behavior: +- only one final post per run +- if a run is already marked as posted, do not send again +- if a run is marked `CANCELLED`, `TERMINATED`, `HUNG`, or `UNKNOWN`, do not later convert it into a posted completion unless explicitly reset by an operator workflow + +## Recommended State Files +Use a durable controller-local state directory, for example: +- `/var/lib/atvm-run-watcher/` + +Possible contents: +- one state file per run id +- one posted marker per run id +- one cancellation marker per run id +- optional lock file to prevent multiple watcher instances from racing + +## Recommended Operator Workflow +Normal completion workflow: +1. ATVM run starts. +2. Watcher tracks the run id / build name. +3. Watcher polls run state and artifacts. +4. Run fully completes. +5. Watcher builds final status summary. +6. Watcher posts final status to Mattermost once. +7. Watcher marks the run as posted and closed. + +Cancellation / termination workflow: +1. Operator stops the ATVM run. +2. Watcher detects cancellation / termination, or an explicit cancellation marker is written. +3. Watcher marks the run `CANCELLED` or `TERMINATED`. +4. Watcher exits cleanly without posting to Mattermost. +5. Watcher prevents later duplicate or misleading final-post behavior. + +## Failure Semantics +Host-level failures do not suppress Mattermost posting. + +If: +- the run has fully completed +- and one or more hosts failed + +Then: +- final Mattermost status should still be sent +- final run-level state should be treated as completed-with-failures + +## Hang / Unknown Semantics +If the run cannot be safely classified as completed, failed, cancelled, or terminated: +- classify it as `HUNG` or `UNKNOWN` +- do not post to Mattermost +- require operator review + +## Logging Requirements +The watcher should log: +- the run id / build name being monitored +- each state transition +- posting decisions +- reasons for suppressing a Mattermost post +- duplicate-post prevention decisions +- final closed state + +## Summary +This watcher design must satisfy all of the following: +- run on the ATVM Cypress controller +- survive local operator machine downtime +- use `systemd` +- distinguish run states clearly +- send Mattermost only after full completion +- send completion results whether hosts passed or failed +- never send Mattermost for cancelled, terminated, hung, or unknown runs +- prevent duplicate or misleading posts diff --git a/atvm/docs/automation/status-template.md b/atvm/docs/automation/status-template.md new file mode 100644 index 0000000..5b7da5f --- /dev/null +++ b/atvm/docs/automation/status-template.md @@ -0,0 +1,62 @@ +# ATVM Status Template + +Use this as the default ATVM automation run-status template for: +- local status responses in the terminal +- Mattermost status posts after a completed run + +## Layout + +```md +## ATVM Run Status +### + +**SUMMARY:** + +| Metric | Value | +|---|---:| +| finished | | +| passed | | +| failed | | +| skipped | | + +**HOSTS:** + +| Host | Status | Detail | +|---|---|---| +| | ✅ PASS | completed | +| | ⚠️ FAIL | | +| | ⏳ RUN | in progress | +| | ⏭️ SKIP | | + +**TIMING:** + +| Metric | Value | +|---|---| +| start | | +| end | | +| total | | +| quickest | - or n/a | +| longest | - or n/a | +| average | or n/a | + +**NOTES:** +- +- +``` + +## Rules +- Keep `SUMMARY:`, `HOSTS:`, `TIMING:`, and `NOTES:` in that order. +- Use the title format: + - `## ATVM Run Status` + - `### ` +- Use Markdown tables for `SUMMARY:`, `HOSTS:`, and `TIMING:`. +- Use one host per row in the `HOSTS:` section. +- For completed hosts, prefer: + - `✅ PASS` + - `⚠️ FAIL` +- For in-progress or skipped hosts, use: + - `⏳ RUN` + - `⏭️ SKIP` +- Keep `Detail` concise. +- Put broader context under `NOTES:`, not in the host table. +- Use the same template for Mattermost and local operator-visible status output. diff --git a/atvm/docs/workflow/secrets-migration-and-cleanup.md b/atvm/docs/workflow/secrets-migration-and-cleanup.md new file mode 100644 index 0000000..2961a50 --- /dev/null +++ b/atvm/docs/workflow/secrets-migration-and-cleanup.md @@ -0,0 +1,188 @@ +# Secrets Migration And Cleanup + +## Purpose +This document explains: +- whether the workspace can be cleaned up to stop storing credentials and tokens in tracked files +- how `.env.credentials.local` should be used +- what has to happen to remove already-committed secrets from git history and the remote repository + +## 1. Can the workspace be cleaned up to stop referencing raw secrets in tracked files? +Yes. + +The intended cleanup is: +- remove hardcoded credentials, API tokens, webhook URLs, and similar secrets from tracked docs and files +- replace those values with references to `/home/aw/code/cds/.env.credentials.local` +- keep only non-secret metadata in tracked files, such as: + - hostnames + - IPs + - usernames when acceptable + - variable names + - usage instructions + +Examples of what tracked docs should say instead of storing raw values: +- `Use ATVM_CONTROLLER_PASSWORD from /home/aw/code/cds/.env.credentials.local` +- `Use VCENTER_USER and VCENTER_PASSWORD from /home/aw/code/cds/.env.credentials.local` +- `Use MATTERMOST_ATVM_WEBHOOK from /home/aw/code/cds/.env.credentials.local` + +Recommended scope of cleanup: +- `atvm/inventory/accounts-and-credentials.md` +- `atvm/inventory/infrastructure.md` +- any other tracked docs or scripts that contain: + - passwords + - API tokens + - TOTP secrets + - webhook URLs + - install codes or secrets that should not remain in git + +## 2. What do I need to do for the assistant to use `.env.credentials.local`? +The file exists on disk, but the assistant does not automatically import shell environment files unless one of the following is done: + +### Option A: Explicitly source it in the shell session +Example: + +```bash +source /home/aw/code/cds/.env.credentials.local +``` + +This is the simplest and most reliable option for interactive terminal work. + +### Option B: Scripts explicitly read it +A script can do: + +```bash +source /home/aw/code/cds/.env.credentials.local +``` + +before using any secret-backed variables. + +### Option C: The workflow documentation tells the assistant to load it +The workspace docs can instruct the assistant to use `/home/aw/code/cds/.env.credentials.local` when credentials are required, but the assistant still needs an execution path that actually loads those variables into the shell or reads them directly from the file. + +## Practical rule +If you want the assistant to reliably use these values during execution, the safest approach is: +- either explicitly source the file first +- or instruct the assistant to source it as part of the command/script it runs + +## Important limitation +The existence of `.env.credentials.local` does not automatically make every shell command aware of those variables. + +The assistant needs one of these: +- the current shell environment already contains the exported variables +- the command explicitly sources the file +- the script being executed explicitly sources the file + +## 3. What do I need to do if secrets were already committed and pushed to the remote repository? +If secrets were already committed to git history and pushed, `.gitignore` does not fix that. + +You need to treat those secrets as exposed. + +## Required response +Do these in this order: + +### Step 1: Rotate or revoke the exposed secrets +This is the most important step. + +Examples: +- regenerate Mattermost webhook URLs +- replace API tokens +- rotate passwords +- regenerate TOTP/shared secrets if applicable +- replace any service registration or install tokens that should be considered exposed + +Even if you later remove them from git history, assume they were already copied. + +### Step 2: Remove secrets from the current tracked files +Edit the tracked docs and scripts so they no longer contain raw secrets. + +Replace them with: +- references to `.env.credentials.local` +- redacted placeholders +- variable names + +### Step 3: Rewrite git history to remove the secrets from all commits +This is a history-rewrite operation. + +Typical tools: +- `git filter-repo` (preferred) +- BFG Repo-Cleaner + +High-level workflow: +1. identify all tracked files and literal secrets that must be removed +2. rewrite repository history to remove or replace them +3. verify the secrets no longer exist in any commit +4. force-push the rewritten history to the remote + +### Step 4: Force-push the cleaned history +After rewriting history, the remote must be updated with a force push. + +That usually means: +- `git push --force-with-lease origin ` + +### Step 5: Coordinate with anyone else using the repo +Anyone with an old clone will still have the old history unless they reset or reclone. + +They need instructions to: +- stop using the old history +- fetch the rewritten branch +- hard reset or reclone as appropriate + +## Important caution about remote cleanup +Cleaning the git remote history does not guarantee that every copy is gone. + +Secrets may still exist in: +- old clones +- forks +- CI logs +- code review systems +- backups +- screenshots or pasted chat logs + +That is why secret rotation must happen first. + +## Recommended cleanup policy for this workspace +For this workspace, the correct policy should be: +- keep real secrets only in `/home/aw/code/cds/.env.credentials.local` +- keep that file gitignored +- remove raw secrets from tracked docs +- document variable names and usage instead of values +- rotate any secrets that were ever committed +- rewrite history if the repository should no longer retain those secret values + +## Proposed next implementation work +When approved, the cleanup work would likely be: +1. inventory all tracked files containing secrets +2. patch those files to reference `.env.credentials.local` +3. update docs so the credential source is explicit +4. prepare a history-rewrite plan +5. prepare exact git commands for review before any destructive git action + +## Git-history cleanup note +History rewriting is disruptive and should not be done casually. + +Before doing it, prepare: +- the list of files and secrets to purge +- the exact rewrite tool and command +- the exact verification commands +- the exact force-push command +- the operator communication plan for other users of the repo + +## Summary +Answers to the three direct questions: + +### Question 1 +Yes, the workspace can be cleaned up to stop storing secrets in tracked files and instead reference `/home/aw/code/cds/.env.credentials.local`. + +### Question 2 +To have the assistant reliably use `.env.credentials.local`, either: +- explicitly source it +- or ensure the script/command being run sources it + +The assistant does not automatically inherit its contents just because the file exists. + +### Question 3 +If secrets were already committed and pushed: +- rotate them first +- remove them from current files +- rewrite git history +- force-push the cleaned history +- coordinate with anyone else who has a clone