Files
cds-ai/atvm/docs/automation/run-learnings.md

17 KiB

Run ATVM Automation Runs

This file stores run-specific examples only when a run produced a new learning relevant to future automation tasks.

Entry Rule

  • Add an entry only when a run changed workflow behavior, exposed a failure mode, or confirmed a required new check.
  • Do not add routine runs with no new learning.

Current State

  • No run-learning entries recorded yet from guide.md source material.

Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)

  • Request:
    • template: cmc-e2e
    • filter: --containsVm redhat9.7
    • integration: --integration_type pure
    • plugin: --use_specified_plugin fc
  • Observed result:
    • Cypress spec execution passed (1 test, 1 passing, 0 failing).
    • Cloud run URL was produced and marked uploaded.
    • run-sorry-cypress.py remained running afterward with a defunct npm exec cypress-cloud child process and did not exit cleanly on its own.
  • Action for future runs:
    • If pass/upload is confirmed but run-sorry-cypress.py does not exit, treat it as a runner hang condition.
    • Capture run URL and pass/fail status first, then terminate the stuck runner process cleanly.

Run Learning: 2026-03-09 (Blacklist handling and status format)

  • Observed requirement:
    • Some ATVM machines must be skipped even when a broad selector such as --containsVm or --randomize would otherwise include them.
  • Machines to blacklist via --exclude_partial_match:
    • BLACKLISTED: CMC INSTALL - CAN'T COMPILE:
      • atvm6-centos6.0
      • atvm41-redhat6.0
      • atvm73-oracle6.0
    • BLACKLISTED: SUPPORT REQUEST - WAITING:
      • atvm113-debian9.0.0
      • atvm115-debian9.1.0
      • atvm116-debian9.2.0
    • BLACKLISTED: RE-CREATE MIGHT BE NEEDED:
      • atvm156-debian9.3.0
  • Action for future runs:
    • Add these machine names to --exclude_partial_match when building broad-scope automation commands.
    • When reporting run status, include skipped blacklisted machines separately with their reason, in addition to completed and remaining machines.
    • Use the run build_name as the heading/title for status responses so the test type is obvious.
    • For failed machines in status responses, include the failure reason taken from the run log.
    • Include timing details in status responses: start time, end time when complete, and total or elapsed runtime.
    • Also include timing stats in status responses: quickest completed test runtime, longest completed test runtime, and average completed test runtime.

Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)

  • Observed requirement:
    • Status output must list each machine first and then its status, rather than leading with the status label.
    • Estimated completion time must refer to the entire remaining automation run, not only the currently running machine.
  • Action for future runs:
    • Format machine entries as machine-name - STATUS.
    • Keep failure reasons after the machine/status entry when a machine failed.
    • When giving ETA, explicitly state it is the estimate for completion of the full remaining run.

Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)

  • Observed failure mode:
    • run-sorry-cypress.py --categorize mutates the active config to the current category batch, so live files such as specPattern, current_vm, and the newest /tmp Cypress JSON only describe the current category, not the full automation run.
    • Answering from only the current live batch underreports the run and misses already-finished machines from earlier category batches.
  • Action for future runs:
    • Reconstruct whole-run status from the generated machine scope plus all machine result artifacts written since the run start time.
    • Use the current batch only to identify the live RUNNING machine and immediate next machine(s), not as the full run scope.
    • Do not answer status requests for categorized runs until earlier category results have been checked as part of the same run.

Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)

  • Observed failure mode:
    • Same-run JUnit output is not consistently named test-result-atvm...xml.
    • Many machine results for the same automation run were written as hash-named files such as test-result-01fe412894862398d06d9cc4bc7e81a0.xml.
    • Limiting status reconstruction to machine-named XML files causes major undercounting of completed machines.
  • Action for future runs:
    • Parse all test-result-*.xml files written since the run start time, not only test-result-atvm*.xml.
    • Extract the machine name from XML contents such as testsuite file=, testsuite name=, or testcase name= when the filename does not include the machine name.
    • Treat check-xml-files.ts XML outputs as bookkeeping steps, not machine results.
    • Prefer the most recently written same-run XML per machine when multiple XML files exist for that machine.

Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)

  • Observed requirement:
    • Listing multiple completed machines on one line makes run status harder to scan and does not meet the expected reporting format.
    • Failure reasons and extra context should be separated from the machine status list so the list stays clean.
  • Action for future runs:
    • Under completed, skipped, and remaining sections, put exactly one machine status on each line.
    • Add a Notes section after completed machines for failure reasons, anomalies, and other operator-relevant context.
    • Keep completed machine lines in the form machine-name - STATUS and avoid appending long explanations inline.

Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)

  • Observed requirement:
    • atvm144-suse15.0 must be excluded from automation runs because it crashes while creating the migration session.
  • Action for future runs:
    • Add atvm144-suse15.0 to the maintained blacklist.
    • Record the reason as CRASHES WHEN CREATING MIGRATION SESSION - BUG.
    • Include it in reusable --exclude_partial_match command examples.

Run Learning: 2026-03-12 (Default to gold-named ATVM config files)

  • Observed requirement:
    • The automation VM does not reliably have cypress.atvm-config.ts, and defaulting to that filename can break runs before they start.
    • Operator preference is to use ATVM config files with gold in the filename unless explicitly told otherwise.
  • Action for future runs:
    • Do not reference cypress.atvm-config.ts by default in commands or examples.
    • Default to cypress.atvm-config-gold.ts unless the operator explicitly requests another config.

Run Learning: 2026-03-12 (Examples are reference-only, not default intent)

  • Observed requirement:
    • Reusable examples may contain extra excludes or options that the operator did not ask for.
    • Carrying those example details into a new run without confirmation can change the requested scope.
  • Action for future runs:
    • Treat examples.md as reference-only.
    • Use only the options the operator explicitly requested, plus maintained mandatory blacklist handling.
    • Do not assume extra example exclusions such as distro filters are desired unless the operator asks for them.

Run Learning: 2026-03-12 (Use one status format for all automation run types)

  • Observed requirement:
    • The operator wants the same ATVM run status display every time, regardless of whether the run is e2e, systemOS, reboot, or another template.
    • Changing the display style between run types makes the status harder to scan and compare.
  • Action for future runs:
    • Use one consistent ATVM status layout for all automation status responses.
    • Keep the order the same: build name, completed machines, notes, skipped machines, remaining machines, summary, timing, estimated completion time.
    • Keep machine entries one per line as machine-name - STATUS regardless of test type.

Run Learning: 2026-03-13 (Put longer failure description on failed machine line)

  • Observed requirement:
    • Failed machines are easier to scan when the failure description appears directly on the same line as the machine status.
    • A longer same-line description works better than a very short label when the extra detail helps explain what actually failed.
  • Action for future runs:
    • Format failed machine lines as machine-name - FAIL - <failure description>.
    • Prefer the longer same-line description when it adds useful operator-facing context.
    • Keep Notes for broader context, anomalies, and extra follow-up detail beyond the machine-specific failure description.

Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)

  • Observed requirement:
    • If the operator asks for a specific ATVM config file and that file is missing on the automation VM, looking for other config files or substituting a different one creates the wrong next step.
    • The operator wants to decide what to do after a missing-config failure.
  • Action for future runs:
    • If the requested config file is missing, stop immediately and report the missing filename.
    • Do not search the automation VM for alternate config files.
    • Do not switch to another config unless the operator explicitly instructs it.

Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)

  • Observed requirement:
    • When the operator asks for ATVM automation run status, they want live status by default.
    • If no automation is currently running, the status response must fall back to the most recent historical run.
    • For categorized runs, the response must still cover the entire run rather than only the latest category batch or cloud sub-run.
  • Action for future runs:
    • Treat every ATVM status request as a request for live run status unless the operator explicitly asks for something else.
    • If no automation is active, reconstruct status from the most recent historical run artifacts and logs.
    • For categorized runs, always aggregate all same-run category batches so the response covers the full run scope.

Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)

  • Observed requirement:
    • The operator wants --ignore_force_shutdown included on every ATVM automation run by default.
    • The operator wants plugin selection to default to --use_specified_plugin iscsi unless a different plugin is explicitly requested.
  • Action for future runs:
    • Add --ignore_force_shutdown to every cmc-templates.py command unless the operator explicitly asks not to use it.
    • Default plugin-bearing ATVM automation commands to --use_specified_plugin iscsi.
    • Only switch away from iscsi when the operator explicitly requests fc, both, or another applicable override.

Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)

  • Observed failure mode:
    • Interpreting "status of the ATVM automation run" as a request about Cirrus project operations can return the wrong source entirely.
    • The operator uses "ATVM automation" to mean the automation contained in the local atvm folder and the corresponding automation VM workflow.
  • Action for future runs:
    • Resolve ATVM status requests from the local ATVM workflow first.
    • Check the automation VM at 192.168.3.190 for live runner processes and live files before looking at historical artifacts.
    • If no automation is active, reconstruct the most recent historical run from the automation VM shell history and reporter artifacts.
    • Do not use Cirrus project operations such as atvm - cypress as the source for ATVM automation status unless the operator explicitly asks for project-operation status.

Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)

  • Observed failure mode:
    • ATVM run commands were executed before the operator had a chance to review and approve them.
    • This happened even though the operator expects a review gate before any ATVM automation command is launched.
  • Action for future runs:
    • Always display the exact planned ATVM commands before execution.
    • Do not run cmc-templates.py until the operator explicitly approves the displayed commands.
    • Do not run run-sorry-cypress.py until the operator explicitly approves the displayed commands.
    • Treat template generation as execution that also requires operator approval.
    • If any requested option changes after commands are displayed, rebuild and redisplay the commands and wait for fresh approval.

Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)

  • Observed failure mode:
    • cmc-templates.py can successfully generate the requested .ts files, but a subsequent run can still start with an incomplete or stale specPattern if the runner is launched too early or the verification step is too fragile.
    • Shell-escaped regex one-liners used over SSH can fail even when the controller config is actually correct, which makes the verification gate unreliable.
  • Action for future runs:
    • After cmc-templates.py, verify both the generated .ts files and the controller config specPattern before launching run-sorry-cypress.py.
    • Prefer direct controller-side inspection of the config block and file presence rather than fragile shell-escaped regex checks.
    • If the requested VM list is not visibly present in both places, stop and report the mismatch instead of starting the runner.

Run Learning: 2026-03-26 (Do not repeat harmless reset-failed watcher noise)

  • Observed requirement:
    • systemctl reset-failed atvm-run-watcher@... often reports that the unit was not loaded.
    • In normal watcher startup this has been harmless and does not change the run outcome.
    • Repeating that note in routine run confirmations adds noise without helping the operator.
  • Action for future runs:
    • Do not mention expected, harmless reset-failed output in routine run updates.
    • Only mention it if it actually prevents watcher startup or becomes relevant to debugging.

Run Learning: 2026-03-27 (Replace FUNCTIONALLY with TEST FLOW in status output)

  • Observed requirement:
    • The operator wants the status format to show the full numbered ATVM test flow for the active template rather than a vague high-level FUNCTIONALLY: summary.
    • Each ATVM template can have its own test-flow step list.
    • The step list should appear once for the whole run, not repeated per host.
  • Action for future runs:
    • Replace the FUNCTIONALLY: section with TEST FLOW: in ATVM status output.
    • Resolve TEST FLOW: from the ATVM template name instead of hardcoding one shared list for every template.
    • For cmc-e2e, use this numbered run flow:
      • 1. Verifying set up
      • 2. Power on and obtain ip address and host name
      • 3. Uninstall CMC if still exists
      • 4. Setting up disk
      • 5. Copy CMC install command from GUI
      • 6. Install CMC
      • 7. Create migration session
      • 8. Tracking Changes
      • 9. Trigger cmotion and do I/O test before actual cutover
      • 10. Verify migration remains healthy during I/O activity
      • 11. Prepare for cutover
      • 12. Stop application / stop test I/O
      • 13. Run final sync
      • 14. Confirm destination is fully up to date
      • 15. Perform cutover
      • 16. Validate destination host / disk state
      • 17. Run post-cutover checks
      • 18. Power off

Run Learning: 2026-03-27 (Start watcher before runner when watcher is requested)

  • Observed failure mode:
    • Starting run-sorry-cypress.py before the watcher can race with the watcher helper's stale-log cleanup.
    • The watcher helper clears stale /tmp/<build-name>.log before startup.
    • If the runner has already opened the new log, the helper can delete that live log path, leaving the watcher unable to read the run by filename.
  • Action for future runs:
    • When the watcher is approved, start the watcher before run-sorry-cypress.py.
    • Keep the order as: template generation, verification, watcher start, runner start.
    • Do not launch the runner first when the watcher is part of the approved command set.

Run Learning: 2026-03-27 (Watcher must recover when the consolidated run log is missing)

  • Observed failure mode:
    • A non-categorized watcher run can finish without posting Mattermost even when the ATVM test itself passed.
    • In this case the watcher service expected /tmp/<build-name>.log, but that consolidated run log was never written.
    • The run still produced the final check-xml-files.ts XML and fresh per-host reporter artifacts under cmcReporter/logs/<host>/.
  • Action for future runs:
    • Do not rely only on /tmp/<build-name>.log for non-categorized watcher result recovery.
    • When final check-xml-files.ts validation is present but host XML is absent, recover host completion from the latest matching per-host reporter artifact within the run window.
    • Keep non-categorized watcher notes accurate; do not describe that failure as a categorized sub-run issue.