Files
cds-ai/atvm/docs/automation/run-learnings.md
anthony.wen 2e0acb69c1 fix watcher failure detection for host reporter json
Handle dict-shaped reporter events when deriving watcher host failures.

- parse reporter JSON events with type/message/severity fields
- preserve existing support for list-shaped event records
- record the false-PASS failure mode in ATVM automation run learnings
2026-04-29 12:37:48 -04:00

48 KiB
Raw Blame History

Run ATVM Automation Runs

This file stores run-specific examples only when a run produced a new learning relevant to future automation tasks.

Entry Rule

  • Add an entry only when a run changed workflow behavior, exposed a failure mode, or confirmed a required new check.
  • Do not add routine runs with no new learning.

Current State

  • No run-learning entries recorded yet from guide.md source material.

Run Learning: 2026-04-29 (Combined watcher wrapper must execute template generation before runner startup)

  • Observed failure mode:
    • A watcher-backed start-atvm-run.sh launch for cmc-migrateops-compute-migration started run-sorry-cypress.py without ever running the approved cmc-templates.py command.
    • The wrapper passed --template-command into watcher metadata only, so the runner consumed stale controller config state and started against a previous specPattern pointing at atvm121-ubuntu24.04.
  • Action for future runs:
    • The combined watcher wrapper must execute --template-command synchronously before watcher and runner startup.
    • Write the template phase output to /tmp/<build>.launch.log so template activity is preserved separately from the live runner log.
    • If the template step fails, stop immediately and do not start the watcher or the runner.

Run Learning: 2026-04-29 (Watcher host-artifact parser must handle dict-shaped reporter events)

  • Observed failure mode:
    • A non-categorized ATVM compute-migration run failed in the host reporter artifacts, but the watcher posted PASS.
    • The watcher fell back to the per-host JSON artifact after check-xml-files.ts, but extract_failure_from_reporter_events() only recognized the older list-shaped event format.
    • Current reporter JSON stores events as dicts with fields such as type, message, and severity, so the parser missed severity: error and incorrectly returned 0 failures.
  • Action for future runs:
    • Treat both list-shaped and dict-shaped reporter event records as valid inputs when extracting failure details from host JSON artifacts.
    • Continue treating host reporter artifacts as authoritative fallback evidence when final XML only contains check-xml-files.ts.

Run Learning: 2026-04-24 (Categorized watcher false-PASS guardrail)

  • Observed failure mode:
    • A categorized compute-migration run was incorrectly reported as PASS for atvm121-ubuntu24.04 even though the actual Ubuntu grouped sub-run failed.
    • The false PASS came from cached watcher host_results plus a grouped XML that only contained check-xml-files.ts with failures="0".
    • The authoritative launch log and Cloud Run Finished summary for that same child run showed 1 failing.
  • Action for future runs:
    • Do not report a categorized grouped sub-run as PASS from watcher state, grouped XML, or check-xml-files.ts alone.
    • Before reporting a categorized grouped sub-run as PASS, confirm that the matching child batch also passed in the live launch log or the final Cloud Run Finished summary for that child run.
    • If watcher state or grouped XML disagrees with the launch log or child-run summary, treat the cached/grouped result as stale and report from the launch log plus per-host artifacts instead.

Run Learning: 2026-03-08 (E2E redhat9.7, pure/fc)

  • Request:
    • template: cmc-e2e
    • filter: --containsVm redhat9.7
    • integration: --integration_type pure
    • plugin: --use_specified_plugin fc
  • Observed result:
    • Cypress spec execution passed (1 test, 1 passing, 0 failing).
    • Cloud run URL was produced and marked uploaded.
    • run-sorry-cypress.py remained running afterward with a defunct npm exec cypress-cloud child process and did not exit cleanly on its own.
  • Action for future runs:
    • If pass/upload is confirmed but run-sorry-cypress.py does not exit, treat it as a runner hang condition.
    • Capture run URL and pass/fail status first, then terminate the stuck runner process cleanly.

Run Learning: 2026-03-09 (Blacklist handling and status format)

  • Observed requirement:
    • Some ATVM machines must be skipped even when a broad selector such as --containsVm or --randomize would otherwise include them.
  • Machines to blacklist via --exclude_partial_match:
    • BLACKLISTED: CMC INSTALL - CAN'T COMPILE:
      • atvm6-centos6.0
      • atvm41-redhat6.0
      • atvm73-oracle6.0
    • BLACKLISTED: SUPPORT REQUEST - WAITING:
      • atvm113-debian9.0.0
      • atvm115-debian9.1.0
      • atvm116-debian9.2.0
    • BLACKLISTED: RE-CREATE MIGHT BE NEEDED:
      • atvm156-debian9.3.0
  • Action for future runs:
    • Add these machine names to --exclude_partial_match when building broad-scope automation commands.
    • When reporting run status, include skipped blacklisted machines separately with their reason, in addition to completed and remaining machines.
    • Use the run build_name as the heading/title for status responses so the test type is obvious.
    • For failed machines in status responses, include the failure reason taken from the run log.
    • Include timing details in status responses: start time, end time when complete, and total or elapsed runtime.
    • Also include timing stats in status responses: quickest completed test runtime, longest completed test runtime, and average completed test runtime.

Run Learning: 2026-03-11 (Machine-first status lines and whole-run ETA)

  • Observed requirement:
    • Status output must list each machine first and then its status, rather than leading with the status label.
    • Estimated completion time must refer to the entire remaining automation run, not only the currently running machine.
  • Action for future runs:
    • Format machine entries as machine-name - STATUS.
    • Keep failure reasons after the machine/status entry when a machine failed.
    • When giving ETA, explicitly state it is the estimate for completion of the full remaining run.

Run Learning: 2026-03-11 (Categorized run status must be reconstructed across batches)

  • Observed failure mode:
    • run-sorry-cypress.py --categorize mutates the active config to the current category batch, so live files such as specPattern, current_vm, and the newest /tmp Cypress JSON only describe the current category, not the full automation run.
    • Answering from only the current live batch underreports the run and misses already-finished machines from earlier category batches.
  • Action for future runs:
    • Reconstruct whole-run status from the generated machine scope plus all machine result artifacts written since the run start time.
    • Use the current batch only to identify the live RUNNING machine and immediate next machine(s), not as the full run scope.
    • Do not answer status requests for categorized runs until earlier category results have been checked as part of the same run.

Run Learning: 2026-03-11 (Hash-named XML files still belong to machine runs)

  • Observed failure mode:
    • Same-run JUnit output is not consistently named test-result-atvm...xml.
    • Many machine results for the same automation run were written as hash-named files such as test-result-01fe412894862398d06d9cc4bc7e81a0.xml.
    • Limiting status reconstruction to machine-named XML files causes major undercounting of completed machines.
  • Action for future runs:
    • Parse all test-result-*.xml files written since the run start time, not only test-result-atvm*.xml.
    • Extract the machine name from XML contents such as testsuite file=, testsuite name=, or testcase name= when the filename does not include the machine name.
    • Treat check-xml-files.ts XML outputs as bookkeeping steps, not machine results.
    • Prefer the most recently written same-run XML per machine when multiple XML files exist for that machine.

Run Learning: 2026-03-12 (Status output must be one machine per line with notes separated)

  • Observed requirement:
    • Listing multiple completed machines on one line makes run status harder to scan and does not meet the expected reporting format.
    • Failure reasons and extra context should be separated from the machine status list so the list stays clean.
  • Action for future runs:
    • Under completed, skipped, and remaining sections, put exactly one machine status on each line.
    • Add a Notes section after completed machines for failure reasons, anomalies, and other operator-relevant context.
    • Keep completed machine lines in the form machine-name - STATUS and avoid appending long explanations inline.

Run Learning: 2026-03-12 (Add suse15.0 machine to blacklist)

  • Observed requirement:
    • atvm144-suse15.0 must be excluded from automation runs because it crashes while creating the migration session.
  • Action for future runs:
    • Add atvm144-suse15.0 to the maintained blacklist.
    • Record the reason as CRASHES WHEN CREATING MIGRATION SESSION - BUG.
    • Include it in reusable --exclude_partial_match command examples.

Run Learning: 2026-03-12 (Default to gold-named ATVM config files)

  • Observed requirement:
    • The automation VM does not reliably have cypress.atvm-config.ts, and defaulting to that filename can break runs before they start.
    • Operator preference is to use ATVM config files with gold in the filename unless explicitly told otherwise.
  • Action for future runs:
    • Do not reference cypress.atvm-config.ts by default in commands or examples.
    • Default to cypress.atvm-config-gold.ts unless the operator explicitly requests another config.

Run Learning: 2026-03-12 (Examples are reference-only, not default intent)

  • Observed requirement:
    • Reusable examples may contain extra excludes or options that the operator did not ask for.
    • Carrying those example details into a new run without confirmation can change the requested scope.
  • Action for future runs:
    • Treat examples.md as reference-only.
    • Use only the options the operator explicitly requested, plus maintained mandatory blacklist handling.
    • Do not assume extra example exclusions such as distro filters are desired unless the operator asks for them.

Run Learning: 2026-03-12 (Use one status format for all automation run types)

  • Observed requirement:
    • The operator wants the same ATVM run status display every time, regardless of whether the run is e2e, systemOS, reboot, or another template.
    • Changing the display style between run types makes the status harder to scan and compare.
  • Action for future runs:
    • Use one consistent ATVM status layout for all automation status responses.
    • Keep the order the same: build name, completed machines, notes, skipped machines, remaining machines, summary, timing, estimated completion time.
    • Keep machine entries one per line as machine-name - STATUS regardless of test type.

Run Learning: 2026-03-13 (Put longer failure description on failed machine line)

  • Observed requirement:
    • Failed machines are easier to scan when the failure description appears directly on the same line as the machine status.
    • A longer same-line description works better than a very short label when the extra detail helps explain what actually failed.
  • Action for future runs:
    • Format failed machine lines as machine-name - FAIL - <failure description>.
    • Prefer the longer same-line description when it adds useful operator-facing context.
    • Keep Notes for broader context, anomalies, and extra follow-up detail beyond the machine-specific failure description.

Run Learning: 2026-03-14 (Missing requested ATVM config must fail fast)

  • Observed requirement:
    • If the operator asks for a specific ATVM config file and that file is missing on the automation VM, looking for other config files or substituting a different one creates the wrong next step.
    • The operator wants to decide what to do after a missing-config failure.
  • Action for future runs:
    • If the requested config file is missing, stop immediately and report the missing filename.
    • Do not search the automation VM for alternate config files.
    • Do not switch to another config unless the operator explicitly instructs it.

Run Learning: 2026-03-16 (Status requests default to live view with whole-run historical fallback)

  • Observed requirement:
    • When the operator asks for ATVM automation run status, they want live status by default.
    • If no automation is currently running, the status response must fall back to the most recent historical run.
    • For categorized runs, the response must still cover the entire run rather than only the latest category batch or cloud sub-run.
  • Action for future runs:
    • Treat every ATVM status request as a request for live run status unless the operator explicitly asks for something else.
    • If no automation is active, reconstruct status from the most recent historical run artifacts and logs.
    • For categorized runs, always aggregate all same-run category batches so the response covers the full run scope.

Run Learning: 2026-03-17 (Default ignore-force-shutdown and iscsi plugin)

  • Observed requirement:
    • The operator wants --ignore_force_shutdown included on every ATVM automation run by default.
    • The operator wants plugin selection to default to --use_specified_plugin iscsi unless a different plugin is explicitly requested.
  • Action for future runs:
    • Add --ignore_force_shutdown to every cmc-templates.py command unless the operator explicitly asks not to use it.
    • Default plugin-bearing ATVM automation commands to --use_specified_plugin iscsi.
    • Only switch away from iscsi when the operator explicitly requests fc, both, or another applicable override.

Run Learning: 2026-03-18 (ATVM status requests must resolve from the local ATVM workflow, not Cirrus project operations)

  • Observed failure mode:
    • Interpreting "status of the ATVM automation run" as a request about Cirrus project operations can return the wrong source entirely.
    • The operator uses "ATVM automation" to mean the automation contained in the local atvm folder and the corresponding automation VM workflow.
  • Action for future runs:
    • Resolve ATVM status requests from the local ATVM workflow first.
    • Check the automation VM at 192.168.3.190 for live runner processes and live files before looking at historical artifacts.
    • If no automation is active, reconstruct the most recent historical run from the automation VM shell history and reporter artifacts.
    • Do not use Cirrus project operations such as atvm - cypress as the source for ATVM automation status unless the operator explicitly asks for project-operation status.

Run Learning: 2026-03-20 (Display exact ATVM commands and wait for approval before any execution)

  • Observed failure mode:
    • ATVM run commands were executed before the operator had a chance to review and approve them.
    • This happened even though the operator expects a review gate before any ATVM automation command is launched.
  • Action for future runs:
    • Always display the exact planned ATVM commands before execution.
    • Do not run cmc-templates.py until the operator explicitly approves the displayed commands.
    • Do not run run-sorry-cypress.py until the operator explicitly approves the displayed commands.
    • Treat template generation as execution that also requires operator approval.
    • If any requested option changes after commands are displayed, rebuild and redisplay the commands and wait for fresh approval.

Run Learning: 2026-03-26 (Verify generated specs directly on the controller before launching the runner)

  • Observed failure mode:
    • cmc-templates.py can successfully generate the requested .ts files, but a subsequent run can still start with an incomplete or stale specPattern if the runner is launched too early or the verification step is too fragile.
    • Shell-escaped regex one-liners used over SSH can fail even when the controller config is actually correct, which makes the verification gate unreliable.
  • Action for future runs:
    • After cmc-templates.py, verify both the generated .ts files and the controller config specPattern before launching run-sorry-cypress.py.
    • Prefer direct controller-side inspection of the config block and file presence rather than fragile shell-escaped regex checks.
    • If the requested VM list is not visibly present in both places, stop and report the mismatch instead of starting the runner.

Run Learning: 2026-03-26 (Do not repeat harmless reset-failed watcher noise)

  • Observed requirement:
    • systemctl reset-failed atvm-run-watcher@... often reports that the unit was not loaded.
    • In normal watcher startup this has been harmless and does not change the run outcome.
    • Repeating that note in routine run confirmations adds noise without helping the operator.
  • Action for future runs:
    • Do not mention expected, harmless reset-failed output in routine run updates.
    • Only mention it if it actually prevents watcher startup or becomes relevant to debugging.

Run Learning: 2026-03-27 (Replace FUNCTIONALLY with TEST FLOW in status output)

  • Observed requirement:
    • The operator wants the status format to show the full numbered ATVM test flow for the active template rather than a vague high-level FUNCTIONALLY: summary.
    • Each ATVM template can have its own test-flow step list.
    • The step list should appear once for the whole run, not repeated per host.
  • Action for future runs:
    • Replace the FUNCTIONALLY: section with TEST FLOW: in ATVM status output.
    • Resolve TEST FLOW: from the ATVM template name instead of hardcoding one shared list for every template.
    • For cmc-e2e, use this numbered run flow:
      • 1. Verifying set up
      • 2. Power on and obtain ip address and host name
      • 3. Uninstall CMC if still exists
      • 4. Setting up disk on the host
      • 5. Copy CMC install command from GUI
      • 6. Install CMC
      • 7. Create migration session
      • 8. Tracking Changes
      • 9. Trigger cmotion and do I/O test before actual cutover
      • 10. Verify data for cmotion
      • 11. Trigger revert cmotion and do I/O test before and during cmotion
      • 12. Verify data for revert cmotion
      • 13. Trigger cmotion again
      • 14. Finalize cutover
      • 15. Create migration report
      • 16. Delete migration session
      • 17. Verify local destination disk
      • 18. Remove enabled FC integration
      • 19. Remove host and volumes
      • 20. Uninstall CMC
      • 21. Clean up iSCSI targets
      • 22. Power off

Run Learning: 2026-03-27 (Template-specific coverage fields and systemOS flow)

  • Observed requirement:
    • COVERAGE: should only show fields that were actually present in the cmc-templates.py command for that template.
    • Showing an empty integration/plugin path on a template that does not use one adds noise and misleads the reader.
    • cmc-systemOS needs its own full numbered TEST FLOW: list rather than falling back to the generic short placeholder flow.
    • NOTES: should stay consistent across templates and should not include internal parent-summary recovery notes for cmc-systemOS.
  • Action for future runs:
    • Render COVERAGE: from the actual template command inputs used for that run.
    • Omit integration/plugin coverage lines when the template command did not use them.
    • Use the 21-step cmc-systemOS flow from status-template.md.
    • Keep NOTES: template-consistent and operator-facing, without parent-log-summary recovery notes.

Run Learning: 2026-03-27 (Start watcher before runner when watcher is requested)

  • Observed failure mode:
    • Starting run-sorry-cypress.py before the watcher can race with the watcher helper's stale-log cleanup.
    • The watcher helper clears stale /tmp/<build-name>.log before startup.
    • If the runner has already opened the new log, the helper can delete that live log path, leaving the watcher unable to read the run by filename.
  • Action for future runs:
    • When the watcher is approved, start the watcher before run-sorry-cypress.py.
    • Keep the order as: template generation, verification, watcher start, runner start.
    • Do not launch the runner first when the watcher is part of the approved command set.

Run Learning: 2026-03-27 (Watcher must recover when the consolidated run log is missing)

  • Observed failure mode:
    • A non-categorized watcher run can finish without posting Mattermost even when the ATVM test itself passed.
    • In this case the watcher service expected /tmp/<build-name>.log, but that consolidated run log was never written.
    • The run still produced the final check-xml-files.ts XML and fresh per-host reporter artifacts under cmcReporter/logs/<host>/.
  • Action for future runs:
    • Do not rely only on /tmp/<build-name>.log for non-categorized watcher result recovery.
    • When final check-xml-files.ts validation is present but host XML is absent, recover host completion from the latest matching per-host reporter artifact within the run window.
    • Keep non-categorized watcher notes accurate; do not describe that failure as a categorized sub-run issue.

Run Learning: 2026-03-27 (Non-categorized watcher runs must post once and show the full 22-step E2E flow)

  • Observed failure mode:
    • A non-categorized watcher run for cmc-e2e sent two Mattermost posts for the same build.
    • The posted TEST FLOW: list only showed 18 steps even though the current cmc-e2e ATVM flow has 22 steps.
  • Action for future runs:
    • For non-categorized runs, post only the parent run status and do not also post the single synthetic subrun.
    • Keep the static cmc-e2e watcher flow aligned with the current 22-step ATVM E2E sequence.

Run Learning: 2026-03-27 (Use summary-first status layout for ATVM run results)

  • Observed requirement:
    • The operator wants ATVM run results ordered as SUMMARY:, HOSTS:, TIMING:, COVERAGE:, TEST FLOW:, then NOTES:.
  • Action for future runs:
    • Render ATVM status output in that section order for both local output and Mattermost posts.

Run Learning: 2026-03-30 (Give cmc-reboot a full template-specific test flow)

  • Observed failure mode:
    • cmc-reboot status output fell back to the generic 5-step placeholder flow.
    • The actual reboot workflow is substantially longer and includes reboot-specific validation around cmotion, revert cmotion, and post-reboot disk verification.
  • Action for future runs:
    • Define a dedicated cmc-reboot TEST FLOW: in the watcher and status template.
    • Keep the reboot flow aligned with the generated reboot Cypress spec rather than the generic fallback list.

Run Learning: 2026-03-27 (Persist the Currents run URL outside the transient runner log)

  • Observed failure mode:
    • The watcher can include the Currents run URL in NOTES:, but only if it can still read the URL from live runner output or a consolidated run log.
    • In practice, /tmp/<build-name>.log is not guaranteed to exist, and the host reporter artifacts do not preserve the final Currents run URL.
  • Action for future runs:
    • Persist the Currents Recorded Run URL as soon as run-sorry-cypress.py sees it.
    • Store it under the watcher state directory for the parent build so it survives runner exit and missing log files.
    • Prefer the persisted Currents URL store over transient log scraping when building the final NOTES: section.

Run Learning: 2026-03-27 (Keep ATVM notes meaningful and remove generic artifact-detected lines)

  • Observed requirement:
    • Generic watcher bookkeeping notes such as "Run finished and one or more sub-run result artifacts were detected." and "Final reporting artifacts were detected." do not add operator value in ATVM status posts.
  • Action for future runs:
    • Reserve NOTES: for meaningful operator-facing content such as the Currents run URL, real anomalies, failure context, and important fallback behavior.
    • Do not include generic artifact-detection confirmations in the posted NOTES: section.
    • Do not include internal fallback notes such as "check-xml-files.ts validation passed" or "host details were derived from reporter artifacts" in the posted NOTES: section.

Run Learning: 2026-03-27 (Categorized grouped XML may need host recovery from the subrun's per-host artifact)

  • Observed failure mode:
    • A categorized subrun can finish and write its grouped test-result-<build>.xml, but that XML may only contain check-xml-files.ts.
    • In that case the watcher may know the grouped batch completed and even know its Currents URL, but still miss the host result unless it recovers the host from the matching per-host reporter artifact.
  • Action for future runs:
    • For categorized runs, when grouped XML only shows check-xml-files.ts, infer the subrun host from the categorized build id and recover the result from the latest matching per-host reporter artifact within the grouped completion window.
    • Do not keep a completed grouped subrun in RUNNING just because the grouped XML lacked a host testcase entry.

Run Learning: 2026-03-27 (Categorized batch results must aggregate all hosts in the group and use the earliest grouped host timestamp)

  • Observed failure mode:
    • A categorized grouped batch can post with only one host even when the batch actually ran multiple hosts of the same distro group.
    • This also causes the grouped start and total timing values to collapse to the last recovered host artifact instead of the full grouped batch duration.
  • Action for future runs:
    • For categorized grouped batches, recover all matching per-host reporter artifacts for the distro group within the grouped completion window, not only the latest host.
    • Derive the grouped start time from the earliest recovered host run timestamp and the grouped end time from the grouped finalization timestamp.
    • Prefer the reporter JSON metadata timestamp or artifact filename timestamp over file write time when reconstructing grouped host timing, because file mtime reflects artifact completion rather than run start.

Run Learning: 2026-03-27 (Default ATVM approval should include the watcher)

  • Observed requirement:
    • The operator wants approve to mean run with watcher by default.
    • The explicit no-watcher override should be approve without watcher.
  • Action for future runs:
    • Treat approve as approval to run and start the watcher.
    • Treat approve without watcher as approval to run without starting the watcher.

Run Learning: 2026-03-27 (Expand coverage details with operator-relevant run options)

  • Observed requirement:
    • The operator wants COVERAGE: to include more than template and datastore family.
    • Useful additions include the config filename and important flags such as --ignore_force_shutdown.
    • Explicit VM names do not need to be repeated there because the host listing already shows them.
  • Action for future runs:
    • Include the ATVM config filename in COVERAGE:.
    • Include important operator-relevant run options such as --ignore_force_shutdown in COVERAGE:.
    • Keep COVERAGE: focused on run intent and options, not the explicit target-host list.
    • Do not include verbose prose lines such as scope of this run: ... in COVERAGE:.
    • Treat COVERAGE: as a concise reflection of the important cmc-templates.py command inputs.

Run Learning: 2026-03-27 (Log the exact template command in NOTES)

  • Observed requirement:
    • The operator wants NOTES: to include the exact cmc-templates.py command that triggered the ATVM run.
    • The outer sshpass/ssh wrapper should be omitted, but the command itself should not be trimmed even when long.
  • Action for future runs:
    • Store and display the exact cmc-templates.py command in NOTES:.
    • Omit only the outer remote-execution wrapper.

Run Learning: 2026-03-27 (Avoid redundant categorize flags and infer grouped timing stats)

  • Observed requirement:
    • When categorize mode: enabled is already shown in COVERAGE:, repeating --categorize under run options is redundant.
    • Grouped categorized results should still show quickest, longest, and average when those values can be inferred from recovered host timing.
  • Action for future runs:
    • Do not repeat --categorize under run options when categorize mode is already shown separately.
    • When grouped host results are reconstructed from reporter artifacts, infer per-host durations from the recovered host timestamp sequence and grouped end time so grouped timing stats do not default to n/a unnecessarily.

Run Learning: 2026-03-27 (Do not auto-add blacklist excludes for explicitly specified VMs)

  • Observed requirement:
    • When the operator explicitly specifies the VM or VM list to run, they do not want the maintained --exclude_partial_match blacklist added automatically.
  • Action for future runs:
    • Keep the maintained --exclude_partial_match list for broad selectors such as --containsVm or --randomize.
    • When the operator uses --specify_vms, do not auto-add the blacklist unless they explicitly request it.
    • Even when the operator uses --specify_vms, first check whether any requested VM is on the maintained blacklist and stop instead of launching it if one is included.

Run Learning: 2026-03-30 (Controller watcher deployment must match the repo watcher before trusting live posts)

  • Observed failure mode:
    • The repo watcher had the corrected cmc-reboot flow, but the controller install at /opt/atvm-watcher-service/atvm_run_watcher.py still had the old generic 5-step fallback.
    • A live categorized reboot subrun therefore posted the stale 5-step TEST FLOW: even though the repo copy had already been fixed.
  • Action for future runs:
    • Before trusting watcher-generated live posts for new watcher behavior, verify that the controller install matches the intended repo watcher version.
    • If the controller install is stale and the operator approves it, deploy the updated watcher code to /opt/atvm-watcher-service and restart only the watcher instance for the active build.

Run Learning: 2026-03-30 (Categorized grouped recovery must parse real per-host reporter status, not assume pass)

  • Observed failure mode:
    • A categorized Red Hat reboot subrun posted both hosts as passed even though atvm71-redhat9.1 actually failed during 1. Verifying set up.
    • The grouped XML only contained check-xml-files.ts, and the watcher incorrectly treated the presence of a per-host reporter artifact as PASS completed.
  • Action for future runs:
    • When grouped XML lacks explicit host testcase results, recover grouped host status from the per-host reporter JSON or equivalent detailed artifact.
    • Carry through the real failures, pending, and failure message from that host artifact instead of assuming PASS completed.
    • If a correction post is needed because stale or reconstructed state was wrong, mark it explicitly as a correction that supersedes the earlier result.

Run Learning: 2026-03-30 (Git push must stay manual even after commit approval)

  • Observed failure mode:
    • After creating a requested local commit, the assistant treated a later approve as permission to run git push.
    • The operator expectation was stricter: the assistant should stop at the local commit and only provide the manual push command reference.
  • Action for future runs:
    • Treat commit creation and push as separate gates.
    • Never execute git push for this workspace unless the operator explicitly overrides the workspace rule.
    • After creating a local commit, provide the manual push command reference only, defaulting to git push origin main unless the operator explicitly asks for a different remote or branch.
    • Do not interpret a generic approve after a commit as push approval.

Run Learning: 2026-03-30 (Do not infer plugin execution from generated spec text alone)

  • Observed failure mode:
    • A generated reboot spec for Pure still contained both iSCSI and FC code blocks, and that was incorrectly treated as proof that both plugin paths would run.
    • In this template, the generated file includes both branches, but runtime execution is gated by Cypress.env("pure_plugin_type").
  • Action for future runs:
    • Do not treat the presence of plugin-specific strings or code blocks in the generated .ts file as proof that those plugin steps will execute.
    • For plugin-specific questions, determine expected behavior from the template/runtime gate and only call it a mismatch if the runtime logic would execute the wrong plugin path.
    • Continue verifying that the requested VM set is present in the generated files and specPattern, but keep plugin-path validation separate from simple text-presence checks.

Run Learning: 2026-03-30 (Do not classify reporter TXT logs as failed from generic error words)

  • Observed failure mode:
    • A completed reboot-redhat8.10-iscsi run actually passed in the launch log and Cloud Run Finished table, but the watcher saved it as failed.
    • The TXT fallback matched generic strings such as auth error encountered and treated them as proof of host failure.
  • Action for future runs:
    • Do not classify a reporter TXT artifact as failed just because it contains the word error.
    • For TXT fallback, require explicit terminal failure markers such as cy:command error, cy:task error, or real Error:/AssertionError:/timeout text.
    • Prefer the parent run summary when available, because it is less prone to false failure signals than raw per-step console text.

Run Learning: 2026-03-30 (Replay exact artifacts before assuming a thin closed-run detail is a current watcher bug)

  • Observed failure mode:
    • The saved controller state for reboot-redhat8.10-both still showed only 1 failures under the host detail, even though the launch log contained the full md5sum failure text.
    • Replaying the exact launch log and reporter artifacts through the currently installed watcher produced the correct host detail with 57 tests, 1 failures and the failing testcase/error text.
  • Action for future runs:
    • Before patching the watcher again for a thin closed-run detail, replay the exact run artifacts through the currently installed watcher code.
    • Treat a mismatch between saved state and current replay as evidence of a stale in-memory watcher instance or stale deployment, not automatically as a parser regression.
    • Use an isolated temp state directory or other no-post path for that replay so historical validation does not repost results.

Run Learning: 2026-03-30 (Red Hat 8.10 Pure both failure on step 38 was a missing FC reboot-validation artifact with concurrent storage instability)

  • Observed failure mode:
    • The failing testcase was 38. Verify diskname2Reboot file is the same as diskname2Reboots source (Reboot test).
    • The concrete error was md5sum: /root/tmp/fcDisk/diskname2Reboot.md5: No such file or directory.
    • On the target after the run, /root/tmp/fcDisk contained diskname2Disk and diskname2Disk.md5, but not diskname2Reboot.md5.
  • Additional host findings:
    • The target showed repeated iSCSI authorization failures and later Could not log into all portals.
    • mtdi-driver.service started at 17:30:26 EDT.
    • iscsid.service / Open-iSCSI started at 17:30:30 EDT.
    • iscsi.service, mtdi-daemon.service, and galaxy-migrate.service reached active state at 17:32:45 EDT.
    • Repeated multipath reinitialization and failed to get ... uid messages continued through the run window.
  • Action for future runs:
    • If this failure recurs, treat it as a host/storage investigation first, not just a watcher-formatting issue.
    • Check whether the FC reboot-validation step actually created diskname2Reboot.md5 on /root/tmp/fcDisk before the md5 verification step ran.
    • Check whether repeated iSCSI auth failures or multipath churn during the same boot window are interfering with the expected disk/file state.

Run Learning: 2026-03-30 (cmc-reboot with Pure both needs an explicit warning/confirmation gate)

  • Observed operator requirement:
    • For reboot runs, using both FC and iSCSI together is not a normal default choice.
    • There may be a "chicken before the egg" timing problem where iSCSI disks are not attached before mTDI / CMC services start.
    • The operator wants both on cmc-reboot to trigger a warning and an explicit reconfirmation instead of being treated like a routine plugin selection.
  • Action for future runs:
    • If a planned cmc-reboot command includes --use_specified_plugin both, call out the FC+iSCSI timing risk before execution.
    • Ask the operator to explicitly confirm that both is really intended for that reboot run.
    • Otherwise prefer fc or iscsi, but not both.

Run Learning: 2026-03-30 (Default --test_partition on ATVM template commands)

  • Observed operator requirement:
    • The operator wants --test_partition included on ATVM test-template commands by default unless they explicitly say otherwise.
  • Action for future runs:
    • Add --test_partition to cmc-templates.py commands by default.
    • Omit it only when the operator explicitly asks not to use it.

Run Learning: 2026-03-30 (Use generated spec as the source of truth for TEST FLOW:)

  • Observed operator requirement:
    • The operator wants the current full workflow steps for the actual test template/run, not a stale hand-maintained flow list.
  • Action for future runs:
    • Resolve TEST FLOW: from the generated .ts spec for the actual run whenever that spec exists.
    • Extract the numbered it(...) steps from the generated spec referenced by the run's specPattern.
    • Only use template-level or static fallback flow definitions when the generated spec cannot be found or parsed.

Run Learning: 2026-03-30 (Event-log reporter JSON must not be ignored in non-categorized fallback)

  • Observed failure mode:
    • A failed non-categorized run still posted/saved host detail as only 1 failures even though the per-host reporter artifacts preserved the failing step.
    • The per-host .json artifact used an event-log format with metadata plus tests, but no top-level stats block.
    • The watcher ignored that JSON format, fell back to the .txt, and lost structured test counts/detail.
  • Action for future runs:
    • Support the event-log JSON format directly when parsing per-host reporter artifacts.
    • In non-categorized fallback, prefer the structured .json artifact over the matching .txt when they belong to the same run timestamp.
    • Recover at least the failing testcase name and a nonzero test count from those artifacts even when the consolidated run log is missing.

Run Learning: 2026-03-30 (Use mochawesome as the rich fallback for host failure detail)

  • Observed failure mode:
    • The full UI-visible Cypress error text for a failed ATVM host run existed in cypress/cmcReporter/mochawesome/*.html, but the lower-fidelity host-level .json and .txt reporter artifacts only preserved the failing step boundary.
    • That made the host detail fall back to a thin summary even though a richer error payload was available on the controller.
  • Action for future runs:
    • When the consolidated run log is missing, use mochawesome as the rich fallback source for per-host failure text before settling for lower-fidelity reporter artifacts.
    • Keep the HOSTS table compact by showing the failing step plus a short error summary.
    • Put the longer trimmed failure excerpt in NOTES: instead of dumping the full raw stack trace into the host-detail column.

Run Learning: 2026-03-30 (Apply rich failed-host detail recovery to every ATVM template)

  • Observed operator requirement:
    • The same failed-host recovery and formatting rules should apply across all ATVM template runs, not only reboot scenarios.
    • If any ATVM test template fails, the result should still recover the best available failure detail and present it consistently.
  • Action for future runs:
    • Use the same failure-detail recovery order for every ATVM template: consolidated run log, mochawesome, structured reporter artifacts, then text reporter artifacts.
    • Keep failed-host Detail compact and put the longer trimmed excerpt in FAILURE NOTES: for every template type.

Run Learning: 2026-03-30 (Separate failure detail from general notes in ATVM status output)

  • Observed operator requirement:
    • The HOSTS detail column should stay short and scannable.
    • Detailed per-host error text should not crowd the host table or mix with general NOTES:.
  • Action for future runs:
    • Keep HOSTS detail to the failing step plus a short error summary only.
    • Put richer per-host error excerpts in FAILURE NOTES:.
    • Reserve NOTES: for non-failure context such as template command, Currents URL, and operator-facing caveats.

Run Learning: 2026-03-31 (Mochawesome failure parsing must stay within one testcase object)

  • Observed failure mode:
    • A reboot failure post showed step 36 with an empty FAILURE NOTES: excerpt even though the real failure remained step 38 and the mochawesome HTML contained the full sshpass / md5sum error text.
    • The parser was scanning beyond the current mochawesome testcase object, so it paired one step title with another step's later failed-state/message fields.
    • Empty mochawesome message / estack values must not be accepted as valid failure detail.
  • Action for future runs:
    • Parse mochawesome one testcase object at a time and do not cross object boundaries when matching title, fullTitle, state, message, and estack.
    • Only use mochawesome to enrich host detail when it returns a non-empty failure payload.
    • If mochawesome and structured reporter artifacts disagree on the step number, keep the structured reporter step as the safer fallback for the host detail.

Run Learning: 2026-03-31 (Generated-spec TEST FLOW must not depend only on log-scoped specPattern)

  • Observed failure mode:
    • A completed e2e-redhat8.10-both run posted the static 22-step cmc-e2e flow even though the generated spec for that exact run contained a longer flow.
    • The watcher only extracted generated-spec flow when it could find Extracted specPattern: in the available log text.
    • When that log-scoped specPattern line was unavailable at final render time, the watcher silently fell back to the static template flow.
  • Action for future runs:
    • Resolve generated-spec TEST FLOW from the active config file's specPattern when the required log line is missing.
    • Treat the static template flow as a last-resort fallback only after both log-derived and config-derived specPattern resolution fail.

Run Learning: 2026-04-16 (Generated-spec TEST FLOW must honor test-install-only gates)

  • Observed failure mode:
    • An install-only cmc-e2e run for atvm5-ubuntu22.04 posted the full 22-step TEST FLOW: to Mattermost even though the generated spec for that run only executed the shorter install-only path.
    • The watcher already used the generated spec as the source of truth, but its gate evaluator did not understand if (Cypress.env("test-install-only") == true/false).
    • That left both the install-only branch and the normal post-install branch visible to the flow extractor.
  • Action for future runs:
    • When extracting TEST FLOW: from a generated spec, evaluate test-install-only gates the same way plugin and cutover gates are evaluated.
    • For install-only runs, exclude the normal post-install branch and report only the actual numbered install-only steps from the generated spec.

Run Learning: 2026-03-31 (Default vmware compute-migration options for ATVM)

  • Observed operator requirement:
    • For cmc-migrateops-compute-migration runs to VMware, the operator wants a stable default option set instead of having to restate the same platform flags each time.
  • Action for future runs:
    • Default VMware compute-migration runs to:
      • --ignore_force_shutdown
      • --vm_platforms vmware
      • --test_partition
      • --set_static_ip_dest
    • Only omit or change those options when the operator explicitly overrides them.

Run Learning: 2026-04-14 (Generated-spec TEST FLOW must honor the selected plugin branch)

  • Observed failure mode:
    • A Pure FC cmc-e2e run posted a 39-step TEST FLOW: even though the actual FC path for that template uses 22 steps.
    • The generated spec contained both if(useFCPlugin) and if(useIscsiPlugin) blocks, and the watcher counted every it(...) step without applying the runtime plugin gate.
  • Action for future runs:
    • When extracting TEST FLOW: from a generated spec, include common steps plus only the runtime-gated plugin branch selected for that run.
    • Use watcher metadata such as the approved integration/plugin path to decide whether to include FC steps, iSCSI steps, or both.
    • Do not count every plugin-gated branch in the generated spec just because the text is present.

Run Learning: 2026-04-14 (cmc-systemOS should not carry plugin or integration arguments)

  • Observed operator requirement:
    • cmc-systemOS runs should not be planned with --use_specified_plugin, --integration_type, or watcher integration/plugin metadata.
    • Treating cmc-systemOS like a plugin-bearing template adds incorrect command arguments and misleading status metadata.
  • Action for future runs:
    • Plan cmc-systemOS template commands without plugin-selection or integration-type arguments.
    • When watcher-backed execution is used for cmc-systemOS, omit watcher integration/plugin metadata too.
    • Keep plugin defaults scoped to templates that actually use plugin selection.

Run Learning: 2026-04-14 (Plugin-gated TEST FLOW filtering must match reboot and other template gate names too)

  • Observed failure mode:
    • A Pure FC cmc-reboot run still posted the combined FC+iSCSI step count even after the earlier cmc-e2e fix.
    • The watcher only recognized if(useFCPlugin) / if(useIscsiPlugin) gates, while the reboot templates use names such as if(usePureFCPlugin) / if(usePureIscsiPlugin).
  • Action for future runs:
    • Match plugin-gated generated-spec branches generically by plugin-bearing gate variable name instead of hardcoding only one template's variable names.
    • Apply the same plugin-branch filtering logic across ATVM templates so new templates do not need one-off watcher fixes.
    • Validate generated-spec TEST FLOW against the selected runtime plugin path for reboot and other templates before assuming the generic fix is complete.

Run Learning: 2026-04-15 (Parent Cloud Run Finished parsing must tolerate late host rows after Recorded Run detection)

  • Observed failure mode:
    • A non-categorized watcher run tested three VMs, but the Mattermost status only showed two hosts.
    • In the launch log, the parent Cloud Run Finished summary printed one host row, then logged Detected 'Recorded Run' after 'Cloud Run Finished' - results uploaded successfully., then printed the remaining host rows.
    • The watcher treated that detection log line as the end of the summary block, so the merged parent-run summary dropped the later host row.
  • Action for future runs:
    • Do not stop parent summary parsing at the Recorded Run detection log line.
    • Bound each Cloud Run Finished block by the next run boundary such as the next Extracted specPattern: or the next Cloud Run Finished, then parse all host rows inside that block.

Run Learning: 2026-04-16 (Categorized Cloud Run Finished parsing must stop at the Recorded Run URL for each grouped batch)

  • Observed failure mode:
    • A categorized ATVM run completed its Windows batch in the Cypress launch log, but the watcher posted only the earlier grouped results and never sent a separate Windows Mattermost status.
    • The watcher let one categorized Cloud Run Finished block run forward into the next grouped batch because the next grouped run did not present a fresh Extracted specPattern: boundary before the next runner output.
    • That let host-row parsing drift across grouped runs, which caused the Windows batch XML to be relabeled under the wrong subrun and left the real Windows subrun stuck in RUNNING.
  • Action for future runs:
    • For categorized grouped recovery, stop each Cloud Run Finished block at that grouped run's 🏁 Recorded Run: line when it is present.
    • Do not let categorized summary parsing continue into the next grouped batch's runner output.
    • Keep grouped host-row parsing scoped to the actual summary table rows for that grouped run only.

Run Learning: 2026-04-22 (Wrapped duration rows in parent Cloud Run Finished tables must not drop hosts)

  • Observed failure mode:
    • A non-categorized cmc-migrateops run completed four hosts, and the launch log's parent Cloud Run Finished table showed all four host rows.
    • The saved watcher state still only kept two hosts.
    • The Currents summary wrapped the trailing s in long duration values such as 16m 13.9s onto its own continuation row.
    • The watcher normalization appended that standalone s to the far end of the host row, which broke the host-row regex for those wrapped rows.
  • Action for future runs:
    • When parsing parent Cloud Run Finished tables, treat standalone wrapped s rows as duration-cell continuations and remove those rows instead of appending s to the end of the host line.
    • Rely on the existing duration parser to accept wrapped values without the trailing s.
    • Replay the exact launch log through the current watcher code after this fix before trusting a corrected host count.