24 KiB
24 KiB
Run ATVM Automation Runs
This file stores run-specific examples only when a run produced a new learning relevant to future automation tasks.
Entry Rule
- Add an entry only when a run changed workflow behavior, exposed a failure mode, or confirmed a required new check.
- Do not add routine runs with no new learning.
Current State
- No run-learning entries recorded yet from
guide.mdsource material.
Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)
- Request:
- template:
cmc-e2e - filter:
--containsVm redhat9.7 - integration:
--integration_type pure - plugin:
--use_specified_plugin fc
- template:
- Observed result:
- Cypress spec execution passed (
1test,1passing,0failing). - Cloud run URL was produced and marked uploaded.
run-sorry-cypress.pyremained running afterward with a defunctnpm exec cypress-cloudchild process and did not exit cleanly on its own.
- Cypress spec execution passed (
- Action for future runs:
- If pass/upload is confirmed but
run-sorry-cypress.pydoes not exit, treat it as a runner hang condition. - Capture run URL and pass/fail status first, then terminate the stuck runner process cleanly.
- If pass/upload is confirmed but
Run Learning: 2026-03-09 (Blacklist handling and status format)
- Observed requirement:
- Some ATVM machines must be skipped even when a broad selector such as
--containsVmor--randomizewould otherwise include them.
- Some ATVM machines must be skipped even when a broad selector such as
- Machines to blacklist via
--exclude_partial_match:BLACKLISTED: CMC INSTALL - CAN'T COMPILE:atvm6-centos6.0atvm41-redhat6.0atvm73-oracle6.0
BLACKLISTED: SUPPORT REQUEST - WAITING:atvm113-debian9.0.0atvm115-debian9.1.0atvm116-debian9.2.0
BLACKLISTED: RE-CREATE MIGHT BE NEEDED:atvm156-debian9.3.0
- Action for future runs:
- Add these machine names to
--exclude_partial_matchwhen building broad-scope automation commands. - When reporting run status, include skipped blacklisted machines separately with their reason, in addition to completed and remaining machines.
- Use the run
build_nameas the heading/title for status responses so the test type is obvious. - For failed machines in status responses, include the failure reason taken from the run log.
- Include timing details in status responses: start time, end time when complete, and total or elapsed runtime.
- Also include timing stats in status responses: quickest completed test runtime, longest completed test runtime, and average completed test runtime.
- Add these machine names to
Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)
- Observed requirement:
- Status output must list each machine first and then its status, rather than leading with the status label.
- Estimated completion time must refer to the entire remaining automation run, not only the currently running machine.
- Action for future runs:
- Format machine entries as
machine-name - STATUS. - Keep failure reasons after the machine/status entry when a machine failed.
- When giving ETA, explicitly state it is the estimate for completion of the full remaining run.
- Format machine entries as
Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)
- Observed failure mode:
run-sorry-cypress.py --categorizemutates the active config to the current category batch, so live files such asspecPattern,current_vm, and the newest/tmpCypress JSON only describe the current category, not the full automation run.- Answering from only the current live batch underreports the run and misses already-finished machines from earlier category batches.
- Action for future runs:
- Reconstruct whole-run status from the generated machine scope plus all machine result artifacts written since the run start time.
- Use the current batch only to identify the live
RUNNINGmachine and immediate next machine(s), not as the full run scope. - Do not answer status requests for categorized runs until earlier category results have been checked as part of the same run.
Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)
- Observed failure mode:
- Same-run JUnit output is not consistently named
test-result-atvm...xml. - Many machine results for the same automation run were written as hash-named files such as
test-result-01fe412894862398d06d9cc4bc7e81a0.xml. - Limiting status reconstruction to machine-named XML files causes major undercounting of completed machines.
- Same-run JUnit output is not consistently named
- Action for future runs:
- Parse all
test-result-*.xmlfiles written since the run start time, not onlytest-result-atvm*.xml. - Extract the machine name from XML contents such as
testsuite file=,testsuite name=, ortestcase name=when the filename does not include the machine name. - Treat
check-xml-files.tsXML outputs as bookkeeping steps, not machine results. - Prefer the most recently written same-run XML per machine when multiple XML files exist for that machine.
- Parse all
Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)
- Observed requirement:
- Listing multiple completed machines on one line makes run status harder to scan and does not meet the expected reporting format.
- Failure reasons and extra context should be separated from the machine status list so the list stays clean.
- Action for future runs:
- Under completed, skipped, and remaining sections, put exactly one machine status on each line.
- Add a
Notessection after completed machines for failure reasons, anomalies, and other operator-relevant context. - Keep completed machine lines in the form
machine-name - STATUSand avoid appending long explanations inline.
Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)
- Observed requirement:
atvm144-suse15.0must be excluded from automation runs because it crashes while creating the migration session.
- Action for future runs:
- Add
atvm144-suse15.0to the maintained blacklist. - Record the reason as
CRASHES WHEN CREATING MIGRATION SESSION - BUG. - Include it in reusable
--exclude_partial_matchcommand examples.
- Add
Run Learning: 2026-03-12 (Default to gold-named ATVM config files)
- Observed requirement:
- The automation VM does not reliably have
cypress.atvm-config.ts, and defaulting to that filename can break runs before they start. - Operator preference is to use ATVM config files with
goldin the filename unless explicitly told otherwise.
- The automation VM does not reliably have
- Action for future runs:
- Do not reference
cypress.atvm-config.tsby default in commands or examples. - Default to
cypress.atvm-config-gold.tsunless the operator explicitly requests another config.
- Do not reference
Run Learning: 2026-03-12 (Examples are reference-only, not default intent)
- Observed requirement:
- Reusable examples may contain extra excludes or options that the operator did not ask for.
- Carrying those example details into a new run without confirmation can change the requested scope.
- Action for future runs:
- Treat
examples.mdas reference-only. - Use only the options the operator explicitly requested, plus maintained mandatory blacklist handling.
- Do not assume extra example exclusions such as distro filters are desired unless the operator asks for them.
- Treat
Run Learning: 2026-03-12 (Use one status format for all automation run types)
- Observed requirement:
- The operator wants the same ATVM run status display every time, regardless of whether the run is
e2e,systemOS,reboot, or another template. - Changing the display style between run types makes the status harder to scan and compare.
- The operator wants the same ATVM run status display every time, regardless of whether the run is
- Action for future runs:
- Use one consistent ATVM status layout for all automation status responses.
- Keep the order the same: build name, completed machines, notes, skipped machines, remaining machines, summary, timing, estimated completion time.
- Keep machine entries one per line as
machine-name - STATUSregardless of test type.
Run Learning: 2026-03-13 (Put longer failure description on failed machine line)
- Observed requirement:
- Failed machines are easier to scan when the failure description appears directly on the same line as the machine status.
- A longer same-line description works better than a very short label when the extra detail helps explain what actually failed.
- Action for future runs:
- Format failed machine lines as
machine-name - FAIL - <failure description>. - Prefer the longer same-line description when it adds useful operator-facing context.
- Keep
Notesfor broader context, anomalies, and extra follow-up detail beyond the machine-specific failure description.
- Format failed machine lines as
Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)
- Observed requirement:
- If the operator asks for a specific ATVM config file and that file is missing on the automation VM, looking for other config files or substituting a different one creates the wrong next step.
- The operator wants to decide what to do after a missing-config failure.
- Action for future runs:
- If the requested config file is missing, stop immediately and report the missing filename.
- Do not search the automation VM for alternate config files.
- Do not switch to another config unless the operator explicitly instructs it.
Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)
- Observed requirement:
- When the operator asks for ATVM automation run status, they want live status by default.
- If no automation is currently running, the status response must fall back to the most recent historical run.
- For categorized runs, the response must still cover the entire run rather than only the latest category batch or cloud sub-run.
- Action for future runs:
- Treat every ATVM status request as a request for live run status unless the operator explicitly asks for something else.
- If no automation is active, reconstruct status from the most recent historical run artifacts and logs.
- For categorized runs, always aggregate all same-run category batches so the response covers the full run scope.
Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)
- Observed requirement:
- The operator wants
--ignore_force_shutdownincluded on every ATVM automation run by default. - The operator wants plugin selection to default to
--use_specified_plugin iscsiunless a different plugin is explicitly requested.
- The operator wants
- Action for future runs:
- Add
--ignore_force_shutdownto everycmc-templates.pycommand unless the operator explicitly asks not to use it. - Default plugin-bearing ATVM automation commands to
--use_specified_plugin iscsi. - Only switch away from
iscsiwhen the operator explicitly requestsfc,both, or another applicable override.
- Add
Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)
- Observed failure mode:
- Interpreting "status of the ATVM automation run" as a request about Cirrus project operations can return the wrong source entirely.
- The operator uses "ATVM automation" to mean the automation contained in the local
atvmfolder and the corresponding automation VM workflow.
- Action for future runs:
- Resolve ATVM status requests from the local ATVM workflow first.
- Check the automation VM at
192.168.3.190for live runner processes and live files before looking at historical artifacts. - If no automation is active, reconstruct the most recent historical run from the automation VM shell history and reporter artifacts.
- Do not use Cirrus project operations such as
atvm - cypressas the source for ATVM automation status unless the operator explicitly asks for project-operation status.
Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)
- Observed failure mode:
- ATVM run commands were executed before the operator had a chance to review and approve them.
- This happened even though the operator expects a review gate before any ATVM automation command is launched.
- Action for future runs:
- Always display the exact planned ATVM commands before execution.
- Do not run
cmc-templates.pyuntil the operator explicitly approves the displayed commands. - Do not run
run-sorry-cypress.pyuntil the operator explicitly approves the displayed commands. - Treat template generation as execution that also requires operator approval.
- If any requested option changes after commands are displayed, rebuild and redisplay the commands and wait for fresh approval.
Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)
- Observed failure mode:
cmc-templates.pycan successfully generate the requested.tsfiles, but a subsequent run can still start with an incomplete or stalespecPatternif the runner is launched too early or the verification step is too fragile.- Shell-escaped regex one-liners used over SSH can fail even when the controller config is actually correct, which makes the verification gate unreliable.
- Action for future runs:
- After
cmc-templates.py, verify both the generated.tsfiles and the controller configspecPatternbefore launchingrun-sorry-cypress.py. - Prefer direct controller-side inspection of the config block and file presence rather than fragile shell-escaped regex checks.
- If the requested VM list is not visibly present in both places, stop and report the mismatch instead of starting the runner.
- After
Run Learning: 2026-03-26 (Do not repeat harmless reset-failed watcher noise)
- Observed requirement:
systemctl reset-failed atvm-run-watcher@...often reports that the unit was not loaded.- In normal watcher startup this has been harmless and does not change the run outcome.
- Repeating that note in routine run confirmations adds noise without helping the operator.
- Action for future runs:
- Do not mention expected, harmless
reset-failedoutput in routine run updates. - Only mention it if it actually prevents watcher startup or becomes relevant to debugging.
- Do not mention expected, harmless
Run Learning: 2026-03-27 (Replace FUNCTIONALLY with TEST FLOW in status output)
- Observed requirement:
- The operator wants the status format to show the full numbered ATVM test flow for the active template rather than a vague high-level
FUNCTIONALLY:summary. - Each ATVM template can have its own test-flow step list.
- The step list should appear once for the whole run, not repeated per host.
- The operator wants the status format to show the full numbered ATVM test flow for the active template rather than a vague high-level
- Action for future runs:
- Replace the
FUNCTIONALLY:section withTEST FLOW:in ATVM status output. - Resolve
TEST FLOW:from the ATVM template name instead of hardcoding one shared list for every template. - For
cmc-e2e, use this numbered run flow:1. Verifying set up2. Power on and obtain ip address and host name3. Uninstall CMC if still exists4. Setting up disk on the host5. Copy CMC install command from GUI6. Install CMC7. Create migration session8. Tracking Changes9. Trigger cmotion and do I/O test before actual cutover10. Verify data for cmotion11. Trigger revert cmotion and do I/O test before and during cmotion12. Verify data for revert cmotion13. Trigger cmotion again14. Finalize cutover15. Create migration report16. Delete migration session17. Verify local destination disk18. Remove enabled FC integration19. Remove host and volumes20. Uninstall CMC21. Clean up iSCSI targets22. Power off
- Replace the
Run Learning: 2026-03-27 (Start watcher before runner when watcher is requested)
- Observed failure mode:
- Starting
run-sorry-cypress.pybefore the watcher can race with the watcher helper's stale-log cleanup. - The watcher helper clears stale
/tmp/<build-name>.logbefore startup. - If the runner has already opened the new log, the helper can delete that live log path, leaving the watcher unable to read the run by filename.
- Starting
- Action for future runs:
- When the watcher is approved, start the watcher before
run-sorry-cypress.py. - Keep the order as: template generation, verification, watcher start, runner start.
- Do not launch the runner first when the watcher is part of the approved command set.
- When the watcher is approved, start the watcher before
Run Learning: 2026-03-27 (Watcher must recover when the consolidated run log is missing)
- Observed failure mode:
- A non-categorized watcher run can finish without posting Mattermost even when the ATVM test itself passed.
- In this case the watcher service expected
/tmp/<build-name>.log, but that consolidated run log was never written. - The run still produced the final
check-xml-files.tsXML and fresh per-host reporter artifacts undercmcReporter/logs/<host>/.
- Action for future runs:
- Do not rely only on
/tmp/<build-name>.logfor non-categorized watcher result recovery. - When final
check-xml-files.tsvalidation is present but host XML is absent, recover host completion from the latest matching per-host reporter artifact within the run window. - Keep non-categorized watcher notes accurate; do not describe that failure as a categorized sub-run issue.
- Do not rely only on
Run Learning: 2026-03-27 (Non-categorized watcher runs must post once and show the full 22-step E2E flow)
- Observed failure mode:
- A non-categorized watcher run for
cmc-e2esent two Mattermost posts for the same build. - The posted
TEST FLOW:list only showed 18 steps even though the currentcmc-e2eATVM flow has 22 steps.
- A non-categorized watcher run for
- Action for future runs:
- For non-categorized runs, post only the parent run status and do not also post the single synthetic subrun.
- Keep the static
cmc-e2ewatcher flow aligned with the current 22-step ATVM E2E sequence.
Run Learning: 2026-03-27 (Use summary-first status layout for ATVM run results)
- Observed requirement:
- The operator wants ATVM run results ordered as
SUMMARY:,HOSTS:,TIMING:,COVERAGE:,TEST FLOW:, thenNOTES:.
- The operator wants ATVM run results ordered as
- Action for future runs:
- Render ATVM status output in that section order for both local output and Mattermost posts.
Run Learning: 2026-03-27 (Persist the Currents run URL outside the transient runner log)
- Observed failure mode:
- The watcher can include the Currents run URL in
NOTES:, but only if it can still read the URL from live runner output or a consolidated run log. - In practice,
/tmp/<build-name>.logis not guaranteed to exist, and the host reporter artifacts do not preserve the final Currents run URL.
- The watcher can include the Currents run URL in
- Action for future runs:
- Persist the Currents
Recorded RunURL as soon asrun-sorry-cypress.pysees it. - Store it under the watcher state directory for the parent build so it survives runner exit and missing log files.
- Prefer the persisted Currents URL store over transient log scraping when building the final
NOTES:section.
- Persist the Currents
Run Learning: 2026-03-27 (Keep ATVM notes meaningful and remove generic artifact-detected lines)
- Observed requirement:
- Generic watcher bookkeeping notes such as "Run finished and one or more sub-run result artifacts were detected." and "Final reporting artifacts were detected." do not add operator value in ATVM status posts.
- Action for future runs:
- Reserve
NOTES:for meaningful operator-facing content such as the Currents run URL, real anomalies, failure context, and important fallback behavior. - Do not include generic artifact-detection confirmations in the posted
NOTES:section. - Do not include internal fallback notes such as "
check-xml-files.tsvalidation passed" or "host details were derived from reporter artifacts" in the postedNOTES:section.
- Reserve
Run Learning: 2026-03-27 (Categorized grouped XML may need host recovery from the subrun's per-host artifact)
- Observed failure mode:
- A categorized subrun can finish and write its grouped
test-result-<build>.xml, but that XML may only containcheck-xml-files.ts. - In that case the watcher may know the grouped batch completed and even know its Currents URL, but still miss the host result unless it recovers the host from the matching per-host reporter artifact.
- A categorized subrun can finish and write its grouped
- Action for future runs:
- For categorized runs, when grouped XML only shows
check-xml-files.ts, infer the subrun host from the categorized build id and recover the result from the latest matching per-host reporter artifact within the grouped completion window. - Do not keep a completed grouped subrun in
RUNNINGjust because the grouped XML lacked a host testcase entry.
- For categorized runs, when grouped XML only shows
Run Learning: 2026-03-27 (Categorized batch results must aggregate all hosts in the group and use the earliest grouped host timestamp)
- Observed failure mode:
- A categorized grouped batch can post with only one host even when the batch actually ran multiple hosts of the same distro group.
- This also causes the grouped
startandtotaltiming values to collapse to the last recovered host artifact instead of the full grouped batch duration.
- Action for future runs:
- For categorized grouped batches, recover all matching per-host reporter artifacts for the distro group within the grouped completion window, not only the latest host.
- Derive the grouped
starttime from the earliest recovered host run timestamp and the groupedendtime from the grouped finalization timestamp. - Prefer the reporter JSON metadata timestamp or artifact filename timestamp over file write time when reconstructing grouped host timing, because file mtime reflects artifact completion rather than run start.
Run Learning: 2026-03-27 (Default ATVM approval should include the watcher)
- Observed requirement:
- The operator wants
approveto mean run with watcher by default. - The explicit no-watcher override should be
approve without watcher.
- The operator wants
- Action for future runs:
- Treat
approveas approval to run and start the watcher. - Treat
approve without watcheras approval to run without starting the watcher.
- Treat
Run Learning: 2026-03-27 (Expand coverage details with operator-relevant run options)
- Observed requirement:
- The operator wants
COVERAGE:to include more than template and datastore family. - Useful additions include the config filename and important flags such as
--ignore_force_shutdown. - Explicit VM names do not need to be repeated there because the host listing already shows them.
- The operator wants
- Action for future runs:
- Include the ATVM config filename in
COVERAGE:. - Include important operator-relevant run options such as
--ignore_force_shutdowninCOVERAGE:. - Keep
COVERAGE:focused on run intent and options, not the explicit target-host list. - Do not include verbose prose lines such as
scope of this run: ...inCOVERAGE:. - Treat
COVERAGE:as a concise reflection of the importantcmc-templates.pycommand inputs.
- Include the ATVM config filename in
Run Learning: 2026-03-27 (Log the exact template command in NOTES)
- Observed requirement:
- The operator wants
NOTES:to include the exactcmc-templates.pycommand that triggered the ATVM run. - The outer
sshpass/sshwrapper should be omitted, but the command itself should not be trimmed even when long.
- The operator wants
- Action for future runs:
- Store and display the exact
cmc-templates.pycommand inNOTES:. - Omit only the outer remote-execution wrapper.
- Store and display the exact
Run Learning: 2026-03-27 (Do not auto-add blacklist excludes for explicitly specified VMs)
- Observed requirement:
- When the operator explicitly specifies the VM or VM list to run, they do not want the maintained
--exclude_partial_matchblacklist added automatically.
- When the operator explicitly specifies the VM or VM list to run, they do not want the maintained
- Action for future runs:
- Keep the maintained
--exclude_partial_matchlist for broad selectors such as--containsVmor--randomize. - When the operator uses
--specify_vms, do not auto-add the blacklist unless they explicitly request it. - Even when the operator uses
--specify_vms, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.
- Keep the maintained