293 lines
20 KiB
Markdown
293 lines
20 KiB
Markdown
# Run ATVM Automation Runs
|
|
|
|
This file stores run-specific examples only when a run produced a new learning relevant to future automation tasks.
|
|
|
|
## Entry Rule
|
|
- Add an entry only when a run changed workflow behavior, exposed a failure mode, or confirmed a required new check.
|
|
- Do not add routine runs with no new learning.
|
|
|
|
## Current State
|
|
- No run-learning entries recorded yet from `guide.md` source material.
|
|
|
|
## Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)
|
|
- Request:
|
|
- template: `cmc-e2e`
|
|
- filter: `--containsVm redhat9.7`
|
|
- integration: `--integration_type pure`
|
|
- plugin: `--use_specified_plugin fc`
|
|
- Observed result:
|
|
- Cypress spec execution passed (`1` test, `1` passing, `0` failing).
|
|
- Cloud run URL was produced and marked uploaded.
|
|
- `run-sorry-cypress.py` remained running afterward with a defunct `npm exec cypress-cloud` child process and did not exit cleanly on its own.
|
|
- Action for future runs:
|
|
- If pass/upload is confirmed but `run-sorry-cypress.py` does not exit, treat it as a runner hang condition.
|
|
- Capture run URL and pass/fail status first, then terminate the stuck runner process cleanly.
|
|
|
|
## Run Learning: 2026-03-09 (Blacklist handling and status format)
|
|
- Observed requirement:
|
|
- Some ATVM machines must be skipped even when a broad selector such as `--containsVm` or `--randomize` would otherwise include them.
|
|
- Machines to blacklist via `--exclude_partial_match`:
|
|
- `BLACKLISTED: CMC INSTALL - CAN'T COMPILE`:
|
|
- `atvm6-centos6.0`
|
|
- `atvm41-redhat6.0`
|
|
- `atvm73-oracle6.0`
|
|
- `BLACKLISTED: SUPPORT REQUEST - WAITING`:
|
|
- `atvm113-debian9.0.0`
|
|
- `atvm115-debian9.1.0`
|
|
- `atvm116-debian9.2.0`
|
|
- `BLACKLISTED: RE-CREATE MIGHT BE NEEDED`:
|
|
- `atvm156-debian9.3.0`
|
|
- Action for future runs:
|
|
- Add these machine names to `--exclude_partial_match` when building broad-scope automation commands.
|
|
- When reporting run status, include skipped blacklisted machines separately with their reason, in addition to completed and remaining machines.
|
|
- Use the run `build_name` as the heading/title for status responses so the test type is obvious.
|
|
- For failed machines in status responses, include the failure reason taken from the run log.
|
|
- Include timing details in status responses: start time, end time when complete, and total or elapsed runtime.
|
|
- Also include timing stats in status responses: quickest completed test runtime, longest completed test runtime, and average completed test runtime.
|
|
|
|
## Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)
|
|
- Observed requirement:
|
|
- Status output must list each machine first and then its status, rather than leading with the status label.
|
|
- Estimated completion time must refer to the entire remaining automation run, not only the currently running machine.
|
|
- Action for future runs:
|
|
- Format machine entries as `machine-name - STATUS`.
|
|
- Keep failure reasons after the machine/status entry when a machine failed.
|
|
- When giving ETA, explicitly state it is the estimate for completion of the full remaining run.
|
|
|
|
## Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)
|
|
- Observed failure mode:
|
|
- `run-sorry-cypress.py --categorize` mutates the active config to the current category batch, so live files such as `specPattern`, `current_vm`, and the newest `/tmp` Cypress JSON only describe the current category, not the full automation run.
|
|
- Answering from only the current live batch underreports the run and misses already-finished machines from earlier category batches.
|
|
- Action for future runs:
|
|
- Reconstruct whole-run status from the generated machine scope plus all machine result artifacts written since the run start time.
|
|
- Use the current batch only to identify the live `RUNNING` machine and immediate next machine(s), not as the full run scope.
|
|
- Do not answer status requests for categorized runs until earlier category results have been checked as part of the same run.
|
|
|
|
## Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)
|
|
- Observed failure mode:
|
|
- Same-run JUnit output is not consistently named `test-result-atvm...xml`.
|
|
- Many machine results for the same automation run were written as hash-named files such as `test-result-01fe412894862398d06d9cc4bc7e81a0.xml`.
|
|
- Limiting status reconstruction to machine-named XML files causes major undercounting of completed machines.
|
|
- Action for future runs:
|
|
- Parse all `test-result-*.xml` files written since the run start time, not only `test-result-atvm*.xml`.
|
|
- Extract the machine name from XML contents such as `testsuite file=`, `testsuite name=`, or `testcase name=` when the filename does not include the machine name.
|
|
- Treat `check-xml-files.ts` XML outputs as bookkeeping steps, not machine results.
|
|
- Prefer the most recently written same-run XML per machine when multiple XML files exist for that machine.
|
|
|
|
## Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)
|
|
- Observed requirement:
|
|
- Listing multiple completed machines on one line makes run status harder to scan and does not meet the expected reporting format.
|
|
- Failure reasons and extra context should be separated from the machine status list so the list stays clean.
|
|
- Action for future runs:
|
|
- Under completed, skipped, and remaining sections, put exactly one machine status on each line.
|
|
- Add a `Notes` section after completed machines for failure reasons, anomalies, and other operator-relevant context.
|
|
- Keep completed machine lines in the form `machine-name - STATUS` and avoid appending long explanations inline.
|
|
|
|
## Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)
|
|
- Observed requirement:
|
|
- `atvm144-suse15.0` must be excluded from automation runs because it crashes while creating the migration session.
|
|
- Action for future runs:
|
|
- Add `atvm144-suse15.0` to the maintained blacklist.
|
|
- Record the reason as `CRASHES WHEN CREATING MIGRATION SESSION - BUG`.
|
|
- Include it in reusable `--exclude_partial_match` command examples.
|
|
|
|
## Run Learning: 2026-03-12 (Default to gold-named ATVM config files)
|
|
- Observed requirement:
|
|
- The automation VM does not reliably have `cypress.atvm-config.ts`, and defaulting to that filename can break runs before they start.
|
|
- Operator preference is to use ATVM config files with `gold` in the filename unless explicitly told otherwise.
|
|
- Action for future runs:
|
|
- Do not reference `cypress.atvm-config.ts` by default in commands or examples.
|
|
- Default to `cypress.atvm-config-gold.ts` unless the operator explicitly requests another config.
|
|
|
|
## Run Learning: 2026-03-12 (Examples are reference-only, not default intent)
|
|
- Observed requirement:
|
|
- Reusable examples may contain extra excludes or options that the operator did not ask for.
|
|
- Carrying those example details into a new run without confirmation can change the requested scope.
|
|
- Action for future runs:
|
|
- Treat `examples.md` as reference-only.
|
|
- Use only the options the operator explicitly requested, plus maintained mandatory blacklist handling.
|
|
- Do not assume extra example exclusions such as distro filters are desired unless the operator asks for them.
|
|
|
|
## Run Learning: 2026-03-12 (Use one status format for all automation run types)
|
|
- Observed requirement:
|
|
- The operator wants the same ATVM run status display every time, regardless of whether the run is `e2e`, `systemOS`, `reboot`, or another template.
|
|
- Changing the display style between run types makes the status harder to scan and compare.
|
|
- Action for future runs:
|
|
- Use one consistent ATVM status layout for all automation status responses.
|
|
- Keep the order the same: build name, completed machines, notes, skipped machines, remaining machines, summary, timing, estimated completion time.
|
|
- Keep machine entries one per line as `machine-name - STATUS` regardless of test type.
|
|
|
|
## Run Learning: 2026-03-13 (Put longer failure description on failed machine line)
|
|
- Observed requirement:
|
|
- Failed machines are easier to scan when the failure description appears directly on the same line as the machine status.
|
|
- A longer same-line description works better than a very short label when the extra detail helps explain what actually failed.
|
|
- Action for future runs:
|
|
- Format failed machine lines as `machine-name - FAIL - <failure description>`.
|
|
- Prefer the longer same-line description when it adds useful operator-facing context.
|
|
- Keep `Notes` for broader context, anomalies, and extra follow-up detail beyond the machine-specific failure description.
|
|
|
|
## Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)
|
|
- Observed requirement:
|
|
- If the operator asks for a specific ATVM config file and that file is missing on the automation VM, looking for other config files or substituting a different one creates the wrong next step.
|
|
- The operator wants to decide what to do after a missing-config failure.
|
|
- Action for future runs:
|
|
- If the requested config file is missing, stop immediately and report the missing filename.
|
|
- Do not search the automation VM for alternate config files.
|
|
- Do not switch to another config unless the operator explicitly instructs it.
|
|
|
|
## Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)
|
|
- Observed requirement:
|
|
- When the operator asks for ATVM automation run status, they want live status by default.
|
|
- If no automation is currently running, the status response must fall back to the most recent historical run.
|
|
- For categorized runs, the response must still cover the entire run rather than only the latest category batch or cloud sub-run.
|
|
- Action for future runs:
|
|
- Treat every ATVM status request as a request for live run status unless the operator explicitly asks for something else.
|
|
- If no automation is active, reconstruct status from the most recent historical run artifacts and logs.
|
|
- For categorized runs, always aggregate all same-run category batches so the response covers the full run scope.
|
|
|
|
## Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)
|
|
- Observed requirement:
|
|
- The operator wants `--ignore_force_shutdown` included on every ATVM automation run by default.
|
|
- The operator wants plugin selection to default to `--use_specified_plugin iscsi` unless a different plugin is explicitly requested.
|
|
- Action for future runs:
|
|
- Add `--ignore_force_shutdown` to every `cmc-templates.py` command unless the operator explicitly asks not to use it.
|
|
- Default plugin-bearing ATVM automation commands to `--use_specified_plugin iscsi`.
|
|
- Only switch away from `iscsi` when the operator explicitly requests `fc`, `both`, or another applicable override.
|
|
|
|
## Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)
|
|
- Observed failure mode:
|
|
- Interpreting "status of the ATVM automation run" as a request about Cirrus project operations can return the wrong source entirely.
|
|
- The operator uses "ATVM automation" to mean the automation contained in the local `atvm` folder and the corresponding automation VM workflow.
|
|
- Action for future runs:
|
|
- Resolve ATVM status requests from the local ATVM workflow first.
|
|
- Check the automation VM at `192.168.3.190` for live runner processes and live files before looking at historical artifacts.
|
|
- If no automation is active, reconstruct the most recent historical run from the automation VM shell history and reporter artifacts.
|
|
- Do not use Cirrus project operations such as `atvm - cypress` as the source for ATVM automation status unless the operator explicitly asks for project-operation status.
|
|
|
|
## Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)
|
|
- Observed failure mode:
|
|
- ATVM run commands were executed before the operator had a chance to review and approve them.
|
|
- This happened even though the operator expects a review gate before any ATVM automation command is launched.
|
|
- Action for future runs:
|
|
- Always display the exact planned ATVM commands before execution.
|
|
- Do not run `cmc-templates.py` until the operator explicitly approves the displayed commands.
|
|
- Do not run `run-sorry-cypress.py` until the operator explicitly approves the displayed commands.
|
|
- Treat template generation as execution that also requires operator approval.
|
|
- If any requested option changes after commands are displayed, rebuild and redisplay the commands and wait for fresh approval.
|
|
|
|
## Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)
|
|
- Observed failure mode:
|
|
- `cmc-templates.py` can successfully generate the requested `.ts` files, but a subsequent run can still start with an incomplete or stale `specPattern` if the runner is launched too early or the verification step is too fragile.
|
|
- Shell-escaped regex one-liners used over SSH can fail even when the controller config is actually correct, which makes the verification gate unreliable.
|
|
- Action for future runs:
|
|
- After `cmc-templates.py`, verify both the generated `.ts` files and the controller config `specPattern` before launching `run-sorry-cypress.py`.
|
|
- Prefer direct controller-side inspection of the config block and file presence rather than fragile shell-escaped regex checks.
|
|
- If the requested VM list is not visibly present in both places, stop and report the mismatch instead of starting the runner.
|
|
|
|
## Run Learning: 2026-03-26 (Do not repeat harmless reset-failed watcher noise)
|
|
- Observed requirement:
|
|
- `systemctl reset-failed atvm-run-watcher@...` often reports that the unit was not loaded.
|
|
- In normal watcher startup this has been harmless and does not change the run outcome.
|
|
- Repeating that note in routine run confirmations adds noise without helping the operator.
|
|
- Action for future runs:
|
|
- Do not mention expected, harmless `reset-failed` output in routine run updates.
|
|
- Only mention it if it actually prevents watcher startup or becomes relevant to debugging.
|
|
|
|
## Run Learning: 2026-03-27 (Replace FUNCTIONALLY with TEST FLOW in status output)
|
|
- Observed requirement:
|
|
- The operator wants the status format to show the full numbered ATVM test flow for the active template rather than a vague high-level `FUNCTIONALLY:` summary.
|
|
- Each ATVM template can have its own test-flow step list.
|
|
- The step list should appear once for the whole run, not repeated per host.
|
|
- Action for future runs:
|
|
- Replace the `FUNCTIONALLY:` section with `TEST FLOW:` in ATVM status output.
|
|
- Resolve `TEST FLOW:` from the ATVM template name instead of hardcoding one shared list for every template.
|
|
- For `cmc-e2e`, use this numbered run flow:
|
|
- `1. Verifying set up`
|
|
- `2. Power on and obtain ip address and host name`
|
|
- `3. Uninstall CMC if still exists`
|
|
- `4. Setting up disk on the host`
|
|
- `5. Copy CMC install command from GUI`
|
|
- `6. Install CMC`
|
|
- `7. Create migration session`
|
|
- `8. Tracking Changes`
|
|
- `9. Trigger cmotion and do I/O test before actual cutover`
|
|
- `10. Verify data for cmotion`
|
|
- `11. Trigger revert cmotion and do I/O test before and during cmotion`
|
|
- `12. Verify data for revert cmotion`
|
|
- `13. Trigger cmotion again`
|
|
- `14. Finalize cutover`
|
|
- `15. Create migration report`
|
|
- `16. Delete migration session`
|
|
- `17. Verify local destination disk`
|
|
- `18. Remove enabled FC integration`
|
|
- `19. Remove host and volumes`
|
|
- `20. Uninstall CMC`
|
|
- `21. Clean up iSCSI targets`
|
|
- `22. Power off`
|
|
|
|
## Run Learning: 2026-03-27 (Start watcher before runner when watcher is requested)
|
|
- Observed failure mode:
|
|
- Starting `run-sorry-cypress.py` before the watcher can race with the watcher helper's stale-log cleanup.
|
|
- The watcher helper clears stale `/tmp/<build-name>.log` before startup.
|
|
- If the runner has already opened the new log, the helper can delete that live log path, leaving the watcher unable to read the run by filename.
|
|
- Action for future runs:
|
|
- When the watcher is approved, start the watcher before `run-sorry-cypress.py`.
|
|
- Keep the order as: template generation, verification, watcher start, runner start.
|
|
- Do not launch the runner first when the watcher is part of the approved command set.
|
|
|
|
## Run Learning: 2026-03-27 (Watcher must recover when the consolidated run log is missing)
|
|
- Observed failure mode:
|
|
- A non-categorized watcher run can finish without posting Mattermost even when the ATVM test itself passed.
|
|
- In this case the watcher service expected `/tmp/<build-name>.log`, but that consolidated run log was never written.
|
|
- The run still produced the final `check-xml-files.ts` XML and fresh per-host reporter artifacts under `cmcReporter/logs/<host>/`.
|
|
- Action for future runs:
|
|
- Do not rely only on `/tmp/<build-name>.log` for non-categorized watcher result recovery.
|
|
- When final `check-xml-files.ts` validation is present but host XML is absent, recover host completion from the latest matching per-host reporter artifact within the run window.
|
|
- Keep non-categorized watcher notes accurate; do not describe that failure as a categorized sub-run issue.
|
|
|
|
## Run Learning: 2026-03-27 (Non-categorized watcher runs must post once and show the full 22-step E2E flow)
|
|
- Observed failure mode:
|
|
- A non-categorized watcher run for `cmc-e2e` sent two Mattermost posts for the same build.
|
|
- The posted `TEST FLOW:` list only showed 18 steps even though the current `cmc-e2e` ATVM flow has 22 steps.
|
|
- Action for future runs:
|
|
- For non-categorized runs, post only the parent run status and do not also post the single synthetic subrun.
|
|
- Keep the static `cmc-e2e` watcher flow aligned with the current 22-step ATVM E2E sequence.
|
|
|
|
## Run Learning: 2026-03-27 (Use summary-first status layout for ATVM run results)
|
|
- Observed requirement:
|
|
- The operator wants ATVM run results ordered as `SUMMARY:`, `HOSTS:`, `TIMING:`, `COVERAGE:`, `TEST FLOW:`, then `NOTES:`.
|
|
- Action for future runs:
|
|
- Render ATVM status output in that section order for both local output and Mattermost posts.
|
|
|
|
## Run Learning: 2026-03-27 (Persist the Currents run URL outside the transient runner log)
|
|
- Observed failure mode:
|
|
- The watcher can include the Currents run URL in `NOTES:`, but only if it can still read the URL from live runner output or a consolidated run log.
|
|
- In practice, `/tmp/<build-name>.log` is not guaranteed to exist, and the host reporter artifacts do not preserve the final Currents run URL.
|
|
- Action for future runs:
|
|
- Persist the Currents `Recorded Run` URL as soon as `run-sorry-cypress.py` sees it.
|
|
- Store it under the watcher state directory for the parent build so it survives runner exit and missing log files.
|
|
- Prefer the persisted Currents URL store over transient log scraping when building the final `NOTES:` section.
|
|
|
|
## Run Learning: 2026-03-27 (Keep ATVM notes meaningful and remove generic artifact-detected lines)
|
|
- Observed requirement:
|
|
- Generic watcher bookkeeping notes such as "Run finished and one or more sub-run result artifacts were detected." and "Final reporting artifacts were detected." do not add operator value in ATVM status posts.
|
|
- Action for future runs:
|
|
- Reserve `NOTES:` for meaningful operator-facing content such as the Currents run URL, real anomalies, failure context, and important fallback behavior.
|
|
- Do not include generic artifact-detection confirmations in the posted `NOTES:` section.
|
|
|
|
## Run Learning: 2026-03-27 (Default ATVM approval should include the watcher)
|
|
- Observed requirement:
|
|
- The operator wants `approve` to mean run with watcher by default.
|
|
- The explicit no-watcher override should be `approve without watcher`.
|
|
- Action for future runs:
|
|
- Treat `approve` as approval to run and start the watcher.
|
|
- Treat `approve without watcher` as approval to run without starting the watcher.
|
|
|
|
## Run Learning: 2026-03-27 (Do not auto-add blacklist excludes for explicitly specified VMs)
|
|
- Observed requirement:
|
|
- When the operator explicitly specifies the VM or VM list to run, they do not want the maintained `--exclude_partial_match` blacklist added automatically.
|
|
- Action for future runs:
|
|
- Keep the maintained `--exclude_partial_match` list for broad selectors such as `--containsVm` or `--randomize`.
|
|
- When the operator uses `--specify_vms`, do not auto-add the blacklist unless they explicitly request it.
|
|
- Even when the operator uses `--specify_vms`, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
|