Tighten CMC kernel upgrade test procedure

This commit is contained in:
Cirrus Codex
2026-05-20 21:27:49 -04:00
parent 3c87b2e0b2
commit 511c3b7401

View File

@@ -6,33 +6,29 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
## Scope
- Run per source host provided by operator.
- Work only on the cloned VM created for this test.
- If the operator asks to run `tests/cmc-upgrade-kernel-test.md` or any variation of the "cmc upgrade kernel test" for an ATVM host, treat that request as referring to this file and this workflow only.
- Treat this file as the source of truth for this test and ignore unrelated workflow references unless the operator explicitly asks to incorporate them for the current request.
- If the operator asks to run `tests/cmc-upgrade-kernel-test.md` or any variation of the "cmc upgrade kernel test" for an ATVM host, treat that request as referring to this file only.
- Treat this file as the source of truth for this test and ignore unrelated procedure references unless the operator explicitly asks to incorporate them for the current request.
## Inputs
- Source VM hostname: `<atvmxxx-...>`
- vCenter target/source location: `<cluster/datastore/folder>`
- vCenter source VM inventory location: `<cluster/folder/resource context>`
- Required clone datastore: `AutomatedTest-UnitTesting`
- Default clone ESXi host: `CDS1-ESX165` / `192.168.1.165` unless the operator explicitly specifies otherwise
- Initial clone access host/IP: `<INITIAL_CLONE_HOST_OR_IP>`
- SSH username variable: `<SSH_USER_VAR>`
- SSH password variable: `<SSH_PASSWORD_VAR>`
- Cirrus profile/project: `gcstage` / `skidamarink`
## Credential Source
## Credentials
- Use credentials from: `/home/cirrus/cds/.env.credentials.local`
- Do not hardcode usernames/passwords in test records or commands.
- Before any vCenter, SSH, Red Hat subscription, or CMC action, load credentials with `set -a; source /home/cirrus/cds/.env.credentials.local; set +a`.
- Verify required credential variable names are present without printing secret values.
- Verify required credential variable names are present without printing secret values: `VCENTER_HOST`, `VCENTER_USER`, `VCENTER_PASSWORD`, `ATVM_TARGET_USER`, `ATVM_TARGET_PASSWORD`, `REDHAT_SUBSCRIPTION_USER`, `REDHAT_SUBSCRIPTION_PASSWORD`, `CMC_GCSTAGE_URL`, `CMC_GCSTAGE_REGISTRATION_CODE`, and `CIRRUS_API_TOKEN`.
- Do not parse the credential file with `grep`/`awk` as the authority; source it and inspect the environment because entries may use `export KEY=...`.
## CMC Tooling Rule (Global)
## CMC Tooling
- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path.
- Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
- Apply this rule to every relevant step in this procedure.
- Exception: host cleanup is not handled by that skill yet; use the Cirrus Data MCP tools for offline-host cleanup and cloned-host cleanup.
- For every CMC install/reinstall command in this test, always include installer option: `-no-prebuilt-mtdi-nexus`.
## Kernel Package Matching Rule (Global)
## Kernel Package Rules
- For every planned kernel upgrade, verify matching development/header packages are available for the exact target kernel version before installing that kernel.
- On Red Hat-family systems, verify `kernel-devel-<target>` and `kernel-headers-<target>` availability (or documented distro-equivalent package names where applicable).
- The first kernel upgrade attempt must not use the latest kernel in the filtered candidate list; reserve the latest kernel for the final kernel-upgrade stage.
@@ -40,35 +36,33 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- On Red Hat-family systems that use `grubby` (including Oracle Linux), explicitly set the selected kernel as the default before rebooting, then verify `grubby --default-kernel` returns the selected `/boot/vmlinuz-<target>` path. If the default does not match, stop before reboot as blocker-fail.
- After each kernel upgrade and reboot, verify running kernel version and installed dev/header package versions all match.
- If kernel and dev/header package versions are mismatched at any point, stop immediately as blocker-fail and do not continue with remediation by assumption.
- Before any kernel candidate discovery step on any distro, force a fresh package metadata refresh on the live host before evaluating available kernel builds. Use the distro's normal refresh command for the installed package manager (for example `dnf makecache`, `yum makecache`, or `zypper refresh`). For APT-based distros, use a hard APT refresh so stale or empty package-list files are rebuilt: `rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update`. If the refreshed view differs from a prior result, trust the refreshed live metadata and record that the earlier view was stale.
- Before any kernel candidate discovery step, force a fresh package metadata refresh on the live host before evaluating available kernel builds. Use the distro command set in the checklist for RHEL-family and APT-based hosts. If the refreshed view differs from a prior result, trust the refreshed live metadata and record that the earlier view was stale.
## Red Hat Preflight (Global, Manual Tasks Only)
## Red Hat Preflight
- Apply this section only when the test target is an actual Red Hat subscription-managed machine and the run is manually executed.
- Do not apply this section to CentOS, Oracle Linux, Rocky, Alma, or other RHEL-derived distributions unless the operator explicitly says the machine should be treated as Red Hat-managed for this run.
- If the target is not actual RHEL, skip this preflight entirely and do not attempt `subscription-manager`.
- Do not apply this section to ATVM automation runs that already handle subscription flow.
- Before running test steps on Red Hat, run:
- After sourcing credentials and before running test steps on Red Hat, run:
- `subscription-manager remove --all`
- `subscription-manager unregister`
- `subscription-manager clean`
- `subscription-manager register --username "$REDHAT_SUBSCRIPTION_USER" --password "$REDHAT_SUBSCRIPTION_PASSWORD"`
- Source credentials from `/home/cirrus/cds/.env.credentials.local`.
## SUSE Exclusion Rule (Global)
- Do not run this test against SUSE/SLES ATVM machines.
## SUSE Exclusion
- Do not run this test against SUSE/SLES ATVM machines; stop before source power-on or clone creation and report that SUSE is excluded for this test.
- SUSE ATVM machines use a local offline DVD/vault repository for packages.
- Kernel upgrade discovery is not valid for this test unless the machine can access official SUSE repositories, which requires a SUSE subscription.
- If the operator requests this test against any SUSE/SLES machine, stop immediately before source power-on or clone creation and report that SUSE is excluded for this test because it uses the local offline repository.
## Execution Mode (Global)
## Execution Mode
- Run this test in continuous execution mode.
- Do not pause for additional operator prompts between steps.
- Keep monitoring and continue automatically until the test reaches a terminal outcome (`PASS` or `FAIL`) and all required cleanup/reporting steps are completed.
- Keep monitoring and continue automatically until the test reaches a terminal outcome (`PASS`, `FAIL`, or operator-directed `PARTIAL`) and all required cleanup/reporting steps are completed.
- Only stop early if a true blocker prevents safe continuation, and still complete required cleanup/reporting before returning control.
- Time every step explicitly.
- If any single step takes longer than 10 minutes, hard stop the test and treat it as a blocker-fail.
## Naming Rule
## Naming
- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]`
- Before cloning, verify the clone VM name is not already in use.
- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found.
@@ -83,14 +77,14 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- Do not power off, delete, or otherwise tear down the clone until the final latest-kernel migration/session validation is complete and recorded. The latest-kernel reboot or reinstall is not the end of the test.
## Execution Checklist
- Treat this checklist as the run ledger for the test. Figuratively check off the items in the checklist to ensure we do and confirm each step.
- Treat this checklist as the run ledger for the test; check each item as it is completed and confirmed.
- Do not skip ahead, collapse, or reorder checklist items.
- Do not begin teardown until every item below is checked complete.
- If any checklist item cannot be checked, stop the test and record the blocker.
- [ ] 0. Source `/home/cirrus/cds/.env.credentials.local` and verify required credential variables are present without printing secret values.
- [ ] 1. Confirm the requested source host is not a SUSE/SLES machine; if it is SUSE/SLES, hard stop before source power-on or clone creation.
- [ ] 2. Remove offline hosts in `skidamarink` using MCP offline-host cleanup.
- [ ] 2. Remove offline hosts in `skidamarink` using Cirrus Data MCP tools for offline-host cleanup.
- [ ] 3. From vCenter, confirm source host is powered on for the inspection phase; power it on if it is not already powered on.
- [ ] 4. From vCenter, query guest-tools for the live source host IP address.
- [ ] 5. SSH to the source host IP address found in step 4 using credentials from `/home/cirrus/cds/.env.credentials.local`.
@@ -122,7 +116,7 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- [ ] 31. If the clone still reports the previous static IP, fix config cleanup and repeat steps 26-30.
- [ ] 32. Continue all remaining steps using the live DHCP IP confirmed in step 30.
- [ ] 33. On the clone, wipe `/dev/sdb` once and verify no filesystem or partition signatures remain.
- [ ] 34. Using the cirrusdata skill, reinstall CMC on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus`.
- [ ] 34. Using the cirrusdata skill, install CMC on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus`.
- [ ] 35. Using the cirrusdata skill, create the first local migration from the 10 GB source disk to the 11 GB destination disk in the `skidamarink` project.
- [ ] 36. If migration session creation fails, hard stop as blocker-fail.
- [ ] 37. Using the cirrusdata skill, wait for initial sync completion in the `skidamarink` project.
@@ -141,8 +135,8 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- [ ] 50. On the clone, write sample data to the source 10 GB disk.
- [ ] 51. Using the cirrusdata skill, trigger sync and confirm tracking status in the `skidamarink` project.
- [ ] 52. Using the cirrusdata skill, uninstall CMC from the clone in the `skidamarink` project.
- [ ] 53. Using MCP, run host cleanup for `skidamarink` and remove the cloned host entry for this test clone only, regardless of online/offline status.
- [ ] 54. Using MCP, verify the cloned host entry and all migration sessions for the cloned host are gone from `skidamarink` before continuing.
- [ ] 53. Using Cirrus Data MCP tools, run host cleanup for `skidamarink` and remove the cloned host entry for this test clone only, regardless of online/offline status.
- [ ] 54. Using Cirrus Data MCP tools, verify the cloned host entry and all migration sessions for the cloned host are gone from `skidamarink` before continuing.
- [ ] 55. SSH to the live DHCP clone IP confirmed in step 30, refresh package metadata, and check available kernels again using the full distro candidate listing: RHEL/Oracle/Rocky/Alma: `dnf makecache; dnf list --showduplicates kernel kernel-devel kernel-headers`; older RHEL/CentOS: `yum makecache; yum list --showduplicates kernel kernel-devel kernel-headers`; Debian/Ubuntu: `rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update; apt-cache madison linux-image-generic linux-headers-generic; apt list -a linux-image-generic linux-headers-generic`.
- [ ] 56. Select the latest-upgrade target kernel from the filtered candidate list; it must stay in the same major OS family and should use the latest available candidate in that scope. If no valid latest-upgrade target exists, hard stop as blocker-fail.
- [ ] 57. On the clone, verify matching dev/header packages are available for the exact latest-upgrade target.
@@ -153,7 +147,7 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- [ ] 62. SSH to the rebooted clone IP found in step 61.
- [ ] 63. On the clone, verify kernel plus dev/header package versions match the selected latest-upgrade version.
- [ ] 64. If versions do not match exactly, stop as blocker-fail.
- [ ] 65. Using the cirrusdata skill, reinstall CMC on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus` on the latest kernel.
- [ ] 65. Using the cirrusdata skill, install CMC again on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus` on the latest kernel.
- [ ] 66. Using the cirrusdata skill, create the second local migration from the 10 GB source disk to the 11 GB destination disk in the `skidamarink` project and wait for initial sync completion.
- [ ] 67. If migration session creation fails, hard stop as blocker-fail.
- [ ] 68. Using the cirrusdata skill, confirm the machine is online in the `skidamarink` project.
@@ -161,23 +155,25 @@ Validate CMC behavior across staged kernel upgrades on a cloned VM, including re
- [ ] 70. Only after steps 65-69 all pass, begin success-path cleanup.
- [ ] 71. From vCenter, power off the cloned machine.
- [ ] 72. From vCenter, delete the cloned VM and its disks from inventory.
- [ ] 73. Using MCP, run final host cleanup for `skidamarink`, remove the cloned host entry for this test clone only, and verify the cloned host entry plus all migration sessions for the cloned host are gone.
- [ ] 73. Using Cirrus Data MCP tools, run final host cleanup for `skidamarink`, remove the cloned host entry for this test clone only, and verify the cloned host entry plus all migration sessions for the cloned host are gone.
- [ ] 74. Blocker-fail path after clone creation, as an alternate to steps 70-73: leave the cloned VM powered on and present in inventory for manual inspection.
- [ ] 75. Append the current run to the summary and results files with the required host metadata, kernel progression, execution summary, final outcome, and total test duration.
## Stop Conditions
Stop immediately and record a blocker if any of these occur:
- Requested source host is a SUSE/SLES machine.
- Cannot verify clone identity.
- Cannot detach required FC PCI adapters.
- Clone cannot be created on datastore `AutomatedTest-UnitTesting`.
- FC passthrough adapters remain attached after the detach/verification step.
- DHCP transition cannot be completed (clone remains static at `<INITIAL_CLONE_HOST_OR_IP>`).
- DHCP transition cannot be completed because the clone still reports the previous static IP after cleanup and retry.
- Kernel upgrade candidate criteria not met.
- Migration session creation failed (including API/service errors such as HTTP 5xx or equivalent backend unavailability).
- Any critical migration/service validation failure that blocks continuation.
## Per-Host Test Result Record
Use one cumulative results file and append one new section per tested host.
Use one cumulative results file and append one new section per tested host. Keep the record concise but complete enough to reconstruct the run.
### Host Metadata
- Test start time (UTC):
@@ -206,8 +202,8 @@ Use one cumulative results file and append one new section per tested host.
### Execution Summary (Short Bullets)
- Clone created / FC PCI detached: `PASS|FAIL` - notes
- Hostname/IP DHCP conversion: `PASS|FAIL` - notes
- CMC reinstall #1: `PASS|FAIL` - notes
- 10 GB source disk prep before first CMC install: `PASS|FAIL` - notes
- CMC reinstall #1: `PASS|FAIL` - notes
- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
- Step-up kernel upgrade: `PASS|FAIL` - notes
- Step-up dev/header package match check: `PASS|FAIL` - notes
@@ -232,35 +228,25 @@ Use one cumulative results file and append one new section per tested host.
- Blocking issue summary:
- Follow-up actions:
## Timestamp Standard
- All recorded test timestamps must use UTC.
- Format: `YYYY-MM-DD HH:MM UTC`
## Result Storage Location
Store and append all per-host results in:
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md`
Also generate a run summary file in the same directory:
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md`
## Artifact Recording Rule
- Always append the latest run outcome to the results file and summary file at the end of each run.
- Do this for `PASS`, `FAIL`, and `PARTIAL` outcomes.
## Result Artifacts
- Results file: `/home/cirrus/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md`
- Summary file: `/home/cirrus/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md`
- Result artifacts under `tmp/` are local run records only and must not be committed.
- Always append the latest run outcome to both files for `PASS`, `FAIL`, and `PARTIAL` outcomes.
- Do not leave a completed test run only in conversation; the artifact files are the source of record.
- Include the total test runtime in both artifact files for every run.
- All recorded timestamps must use UTC format: `YYYY-MM-DD HH:MM UTC`.
- Record the UTC start time when the run begins.
- Record the UTC end time when the run reaches a terminal outcome and cleanup/reporting is complete.
- Compute `Test duration` from the recorded start/end timestamps and include it in both files.
- If a run is still in progress when first recorded, update the runtime once the run reaches its terminal outcome.
- Use the `Per-Host Test Result Record` format for the results file.
Summary file requirements:
- Start the file with the test file name line: `Test file: cmc-upgrade-kernel-test.md`
- Title: `CMC Upgrade Kernel Test Summary`
- Include test start time, test end time, and total test duration for the run
- Include a short workflow summary (current kernel -> install CMC -> kernel upgrade -> uninstall CMC -> kernel upgrade -> install CMC)
- Include a short run summary (current kernel -> first CMC install phase -> kernel upgrade -> CMC uninstall -> kernel upgrade -> second CMC install phase)
- Include host tested, kernel progression (start, step-up, latest), and overall result
- Start each run section with a `##` heading that includes the OS family and the final outcome, for example: `## Amazon Linux 2023 - PASS`.
- Put the OS version and the rest of the run details under that heading so the heading stays the visible OS label above the test snippet.
### Duration Rule
- Record the UTC start time when the run begins.
- Record the UTC end time when the run reaches a terminal outcome and cleanup/reporting is complete.
- Compute `Test duration` from the recorded start/end timestamps.
- Backfill `Test duration` into the summary and results artifacts for any run where both timestamps are known.