Files

anthony.wen 3431c40af7 Document ATVM spec verification lesson in run learnings

- add a 2026-03-26 run learning that explains how cmc-templates.py can generate the requested spec files while a fragile verification step still misses them
- document that shell-escaped regex one-liners over SSH are not a reliable way to validate the controller specPattern
- record the preferred future workflow: verify generated .ts files and the config specPattern directly on the controller before launching run-sorry-cypress.py

2026-03-26 19:48:16 -04:00

13 KiB

Raw Blame History

Run ATVM Automation Runs

This file stores run-specific examples only when a run produced a new learning relevant to future automation tasks.

Entry Rule

Add an entry only when a run changed workflow behavior, exposed a failure mode, or confirmed a required new check.
Do not add routine runs with no new learning.

Current State

No run-learning entries recorded yet from guide.md source material.

Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)

Request:
- template: cmc-e2e
- filter: --containsVm redhat9.7
- integration: --integration_type pure
- plugin: --use_specified_plugin fc
Observed result:
- Cypress spec execution passed (1 test, 1 passing, 0 failing).
- Cloud run URL was produced and marked uploaded.
- run-sorry-cypress.py remained running afterward with a defunct npm exec cypress-cloud child process and did not exit cleanly on its own.
Action for future runs:
- If pass/upload is confirmed but run-sorry-cypress.py does not exit, treat it as a runner hang condition.
- Capture run URL and pass/fail status first, then terminate the stuck runner process cleanly.

Run Learning: 2026-03-09 (Blacklist handling and status format)

Observed requirement:
- Some ATVM machines must be skipped even when a broad selector such as --containsVm or --randomize would otherwise include them.
Machines to blacklist via --exclude_partial_match:
- BLACKLISTED: CMC INSTALL - CAN'T COMPILE:
  - atvm6-centos6.0
  - atvm41-redhat6.0
  - atvm73-oracle6.0
- BLACKLISTED: SUPPORT REQUEST - WAITING:
  - atvm113-debian9.0.0
  - atvm115-debian9.1.0
  - atvm116-debian9.2.0
- BLACKLISTED: RE-CREATE MIGHT BE NEEDED:
  - atvm156-debian9.3.0
Action for future runs:
- Add these machine names to --exclude_partial_match when building broad-scope automation commands.
- When reporting run status, include skipped blacklisted machines separately with their reason, in addition to completed and remaining machines.
- Use the run build_name as the heading/title for status responses so the test type is obvious.
- For failed machines in status responses, include the failure reason taken from the run log.
- Include timing details in status responses: start time, end time when complete, and total or elapsed runtime.
- Also include timing stats in status responses: quickest completed test runtime, longest completed test runtime, and average completed test runtime.

Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)

Observed requirement:
- Status output must list each machine first and then its status, rather than leading with the status label.
- Estimated completion time must refer to the entire remaining automation run, not only the currently running machine.
Action for future runs:
- Format machine entries as machine-name - STATUS.
- Keep failure reasons after the machine/status entry when a machine failed.
- When giving ETA, explicitly state it is the estimate for completion of the full remaining run.

Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)

Observed failure mode:
- run-sorry-cypress.py --categorize mutates the active config to the current category batch, so live files such as specPattern, current_vm, and the newest /tmp Cypress JSON only describe the current category, not the full automation run.
- Answering from only the current live batch underreports the run and misses already-finished machines from earlier category batches.
Action for future runs:
- Reconstruct whole-run status from the generated machine scope plus all machine result artifacts written since the run start time.
- Use the current batch only to identify the live RUNNING machine and immediate next machine(s), not as the full run scope.
- Do not answer status requests for categorized runs until earlier category results have been checked as part of the same run.

Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)

Observed failure mode:
- Same-run JUnit output is not consistently named test-result-atvm...xml.
- Many machine results for the same automation run were written as hash-named files such as test-result-01fe412894862398d06d9cc4bc7e81a0.xml.
- Limiting status reconstruction to machine-named XML files causes major undercounting of completed machines.
Action for future runs:
- Parse all test-result-*.xml files written since the run start time, not only test-result-atvm*.xml.
- Extract the machine name from XML contents such as testsuite file=, testsuite name=, or testcase name= when the filename does not include the machine name.
- Treat check-xml-files.ts XML outputs as bookkeeping steps, not machine results.
- Prefer the most recently written same-run XML per machine when multiple XML files exist for that machine.

Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)

Observed requirement:
- Listing multiple completed machines on one line makes run status harder to scan and does not meet the expected reporting format.
- Failure reasons and extra context should be separated from the machine status list so the list stays clean.
Action for future runs:
- Under completed, skipped, and remaining sections, put exactly one machine status on each line.
- Add a Notes section after completed machines for failure reasons, anomalies, and other operator-relevant context.
- Keep completed machine lines in the form machine-name - STATUS and avoid appending long explanations inline.

Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)

Observed requirement:
- atvm144-suse15.0 must be excluded from automation runs because it crashes while creating the migration session.
Action for future runs:
- Add atvm144-suse15.0 to the maintained blacklist.
- Record the reason as CRASHES WHEN CREATING MIGRATION SESSION - BUG.
- Include it in reusable --exclude_partial_match command examples.

Run Learning: 2026-03-12 (Default to gold-named ATVM config files)

Observed requirement:
- The automation VM does not reliably have cypress.atvm-config.ts, and defaulting to that filename can break runs before they start.
- Operator preference is to use ATVM config files with gold in the filename unless explicitly told otherwise.
Action for future runs:
- Do not reference cypress.atvm-config.ts by default in commands or examples.
- Default to cypress.atvm-config-gold.ts unless the operator explicitly requests another config.

Run Learning: 2026-03-12 (Examples are reference-only, not default intent)

Observed requirement:
- Reusable examples may contain extra excludes or options that the operator did not ask for.
- Carrying those example details into a new run without confirmation can change the requested scope.
Action for future runs:
- Treat examples.md as reference-only.
- Use only the options the operator explicitly requested, plus maintained mandatory blacklist handling.
- Do not assume extra example exclusions such as distro filters are desired unless the operator asks for them.

Run Learning: 2026-03-12 (Use one status format for all automation run types)

Observed requirement:
- The operator wants the same ATVM run status display every time, regardless of whether the run is e2e, systemOS, reboot, or another template.
- Changing the display style between run types makes the status harder to scan and compare.
Action for future runs:
- Use one consistent ATVM status layout for all automation status responses.
- Keep the order the same: build name, completed machines, notes, skipped machines, remaining machines, summary, timing, estimated completion time.
- Keep machine entries one per line as machine-name - STATUS regardless of test type.

Run Learning: 2026-03-13 (Put longer failure description on failed machine line)

Observed requirement:
- Failed machines are easier to scan when the failure description appears directly on the same line as the machine status.
- A longer same-line description works better than a very short label when the extra detail helps explain what actually failed.
Action for future runs:
- Format failed machine lines as machine-name - FAIL - <failure description>.
- Prefer the longer same-line description when it adds useful operator-facing context.
- Keep Notes for broader context, anomalies, and extra follow-up detail beyond the machine-specific failure description.

Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)

Observed requirement:
- If the operator asks for a specific ATVM config file and that file is missing on the automation VM, looking for other config files or substituting a different one creates the wrong next step.
- The operator wants to decide what to do after a missing-config failure.
Action for future runs:
- If the requested config file is missing, stop immediately and report the missing filename.
- Do not search the automation VM for alternate config files.
- Do not switch to another config unless the operator explicitly instructs it.

Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)

Observed requirement:
- When the operator asks for ATVM automation run status, they want live status by default.
- If no automation is currently running, the status response must fall back to the most recent historical run.
- For categorized runs, the response must still cover the entire run rather than only the latest category batch or cloud sub-run.
Action for future runs:
- Treat every ATVM status request as a request for live run status unless the operator explicitly asks for something else.
- If no automation is active, reconstruct status from the most recent historical run artifacts and logs.
- For categorized runs, always aggregate all same-run category batches so the response covers the full run scope.

Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)

Observed requirement:
- The operator wants --ignore_force_shutdown included on every ATVM automation run by default.
- The operator wants plugin selection to default to --use_specified_plugin iscsi unless a different plugin is explicitly requested.
Action for future runs:
- Add --ignore_force_shutdown to every cmc-templates.py command unless the operator explicitly asks not to use it.
- Default plugin-bearing ATVM automation commands to --use_specified_plugin iscsi.
- Only switch away from iscsi when the operator explicitly requests fc, both, or another applicable override.

Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)

Observed failure mode:
- Interpreting "status of the ATVM automation run" as a request about Cirrus project operations can return the wrong source entirely.
- The operator uses "ATVM automation" to mean the automation contained in the local atvm folder and the corresponding automation VM workflow.
Action for future runs:
- Resolve ATVM status requests from the local ATVM workflow first.
- Check the automation VM at 192.168.3.190 for live runner processes and live files before looking at historical artifacts.
- If no automation is active, reconstruct the most recent historical run from the automation VM shell history and reporter artifacts.
- Do not use Cirrus project operations such as atvm - cypress as the source for ATVM automation status unless the operator explicitly asks for project-operation status.

Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)

Observed failure mode:
- ATVM run commands were executed before the operator had a chance to review and approve them.
- This happened even though the operator expects a review gate before any ATVM automation command is launched.
Action for future runs:
- Always display the exact planned ATVM commands before execution.
- Do not run cmc-templates.py until the operator explicitly approves the displayed commands.
- Do not run run-sorry-cypress.py until the operator explicitly approves the displayed commands.
- Treat template generation as execution that also requires operator approval.
- If any requested option changes after commands are displayed, rebuild and redisplay the commands and wait for fresh approval.

Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)

Observed failure mode:
- cmc-templates.py can successfully generate the requested .ts files, but a subsequent run can still start with an incomplete or stale specPattern if the runner is launched too early or the verification step is too fragile.
- Shell-escaped regex one-liners used over SSH can fail even when the controller config is actually correct, which makes the verification gate unreliable.
Action for future runs:
- After cmc-templates.py, verify both the generated .ts files and the controller config specPattern before launching run-sorry-cypress.py.
- Prefer direct controller-side inspection of the config block and file presence rather than fragile shell-escaped regex checks.
- If the requested VM list is not visibly present in both places, stop and report the mismatch instead of starting the runner.

13 KiB Raw Blame History

Run ATVM Automation Runs

Entry Rule

Current State

Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)

Run Learning: 2026-03-09 (Blacklist handling and status format)

Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)

Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)

Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)

Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)

Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)

Run Learning: 2026-03-12 (Default to gold-named ATVM config files)

Run Learning: 2026-03-12 (Examples are reference-only, not default intent)

Run Learning: 2026-03-12 (Use one status format for all automation run types)

Run Learning: 2026-03-13 (Put longer failure description on failed machine line)

Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)

Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)

Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)

Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)

Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)

Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)

13 KiB

Raw Blame History