Files
cds-ai/tests/cmc-upgrade-kernel-test.md
2026-05-20 21:27:49 -04:00

253 lines
22 KiB
Markdown

# CMC Upgrade Kernel Test Template
## Purpose
Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.
## Scope
- Run per source host provided by operator.
- Work only on the cloned VM created for this test.
- If the operator asks to run `tests/cmc-upgrade-kernel-test.md` or any variation of the "cmc upgrade kernel test" for an ATVM host, treat that request as referring to this file only.
- Treat this file as the source of truth for this test and ignore unrelated procedure references unless the operator explicitly asks to incorporate them for the current request.
## Inputs
- Source VM hostname: `<atvmxxx-...>`
- vCenter source VM inventory location: `<cluster/folder/resource context>`
- Required clone datastore: `AutomatedTest-UnitTesting`
- Default clone ESXi host: `CDS1-ESX165` / `192.168.1.165` unless the operator explicitly specifies otherwise
- Cirrus profile/project: `gcstage` / `skidamarink`
## Credentials
- Use credentials from: `/home/cirrus/cds/.env.credentials.local`
- Do not hardcode usernames/passwords in test records or commands.
- Before any vCenter, SSH, Red Hat subscription, or CMC action, load credentials with `set -a; source /home/cirrus/cds/.env.credentials.local; set +a`.
- Verify required credential variable names are present without printing secret values: `VCENTER_HOST`, `VCENTER_USER`, `VCENTER_PASSWORD`, `ATVM_TARGET_USER`, `ATVM_TARGET_PASSWORD`, `REDHAT_SUBSCRIPTION_USER`, `REDHAT_SUBSCRIPTION_PASSWORD`, `CMC_GCSTAGE_URL`, `CMC_GCSTAGE_REGISTRATION_CODE`, and `CIRRUS_API_TOKEN`.
- Do not parse the credential file with `grep`/`awk` as the authority; source it and inspect the environment because entries may use `export KEY=...`.
## CMC Tooling
- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path.
- Exception: host cleanup is not handled by that skill yet; use the Cirrus Data MCP tools for offline-host cleanup and cloned-host cleanup.
- For every CMC install/reinstall command in this test, always include installer option: `-no-prebuilt-mtdi-nexus`.
## Kernel Package Rules
- For every planned kernel upgrade, verify matching development/header packages are available for the exact target kernel version before installing that kernel.
- On Red Hat-family systems, verify `kernel-devel-<target>` and `kernel-headers-<target>` availability (or documented distro-equivalent package names where applicable).
- The first kernel upgrade attempt must not use the latest kernel in the filtered candidate list; reserve the latest kernel for the final kernel-upgrade stage.
- When upgrading kernel versions, also upgrade/install the matching development/header packages for that same version.
- On Red Hat-family systems that use `grubby` (including Oracle Linux), explicitly set the selected kernel as the default before rebooting, then verify `grubby --default-kernel` returns the selected `/boot/vmlinuz-<target>` path. If the default does not match, stop before reboot as blocker-fail.
- After each kernel upgrade and reboot, verify running kernel version and installed dev/header package versions all match.
- If kernel and dev/header package versions are mismatched at any point, stop immediately as blocker-fail and do not continue with remediation by assumption.
- Before any kernel candidate discovery step, force a fresh package metadata refresh on the live host before evaluating available kernel builds. Use the distro command set in the checklist for RHEL-family and APT-based hosts. If the refreshed view differs from a prior result, trust the refreshed live metadata and record that the earlier view was stale.
## Red Hat Preflight
- Apply this section only when the test target is an actual Red Hat subscription-managed machine and the run is manually executed.
- Do not apply this section to CentOS, Oracle Linux, Rocky, Alma, or other RHEL-derived distributions unless the operator explicitly says the machine should be treated as Red Hat-managed for this run.
- If the target is not actual RHEL, skip this preflight entirely and do not attempt `subscription-manager`.
- Do not apply this section to ATVM automation runs that already handle subscription flow.
- After sourcing credentials and before running test steps on Red Hat, run:
- `subscription-manager remove --all`
- `subscription-manager unregister`
- `subscription-manager clean`
- `subscription-manager register --username "$REDHAT_SUBSCRIPTION_USER" --password "$REDHAT_SUBSCRIPTION_PASSWORD"`
## SUSE Exclusion
- Do not run this test against SUSE/SLES ATVM machines; stop before source power-on or clone creation and report that SUSE is excluded for this test.
- SUSE ATVM machines use a local offline DVD/vault repository for packages.
- Kernel upgrade discovery is not valid for this test unless the machine can access official SUSE repositories, which requires a SUSE subscription.
## Execution Mode
- Run this test in continuous execution mode.
- Do not pause for additional operator prompts between steps.
- Keep monitoring and continue automatically until the test reaches a terminal outcome (`PASS`, `FAIL`, or operator-directed `PARTIAL`) and all required cleanup/reporting steps are completed.
- Only stop early if a true blocker prevents safe continuation, and still complete required cleanup/reporting before returning control.
- Time every step explicitly.
- If any single step takes longer than 10 minutes, hard stop the test and treat it as a blocker-fail.
## Naming
- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]`
- Before cloning, verify the clone VM name is not already in use.
- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found.
- Use plain VM name only (no `/CDSHQ-Eng/vm/` prefix) for clone destination name, and set folder separately if needed.
- OS hostname on clone: same clone name but replace `.` with `-`
## Safety Rules
- Delete only the clone created for this test.
- If the clone is missing or identity is uncertain, stop and do not delete any other VM.
- If any blocker occurs after clone creation, stop the test and leave the cloned VM powered on for manual inspection.
- Do not delete or power off the clone on blocker-fail outcomes.
- Do not power off, delete, or otherwise tear down the clone until the final latest-kernel migration/session validation is complete and recorded. The latest-kernel reboot or reinstall is not the end of the test.
## Execution Checklist
- Treat this checklist as the run ledger for the test; check each item as it is completed and confirmed.
- Do not skip ahead, collapse, or reorder checklist items.
- Do not begin teardown until every item below is checked complete.
- If any checklist item cannot be checked, stop the test and record the blocker.
- [ ] 0. Source `/home/cirrus/cds/.env.credentials.local` and verify required credential variables are present without printing secret values.
- [ ] 1. Confirm the requested source host is not a SUSE/SLES machine; if it is SUSE/SLES, hard stop before source power-on or clone creation.
- [ ] 2. Remove offline hosts in `skidamarink` using Cirrus Data MCP tools for offline-host cleanup.
- [ ] 3. From vCenter, confirm source host is powered on for the inspection phase; power it on if it is not already powered on.
- [ ] 4. From vCenter, query guest-tools for the live source host IP address.
- [ ] 5. SSH to the source host IP address found in step 4 using credentials from `/home/cirrus/cds/.env.credentials.local`.
- [ ] 6. On the source host, inspect distro repository files before listing available kernel builds and hard stop if any enabled/source repo points at `192.168.3.199` (`/etc/yum.repos.d/*.repo`, `/etc/apt/sources.list`, `/etc/apt/sources.list.d/*`, `/etc/zypp/repos.d/*.repo`, or equivalent files present on the host).
- [ ] 7. On the source host, record the current OS version and running kernel version before cloning.
- [ ] 8. On the source host, refresh package metadata and build the kernel candidate list from all available versions using the distro command set: RHEL/Oracle/Rocky/Alma: `dnf makecache; dnf list --showduplicates kernel kernel-devel kernel-headers`; older RHEL/CentOS: `yum makecache; yum list --showduplicates kernel kernel-devel kernel-headers`; Debian/Ubuntu: `rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update; apt-cache madison linux-image-generic linux-headers-generic; apt list -a linux-image-generic linux-headers-generic`. Do not run this test for SUSE/SLES; step 1 must stop those hosts before this point. On Ubuntu, inspect the generic track first, then confirm candidate availability with alternate package listing methods if needed before deciding whether the generic track is usable.
- [ ] 9. Apply the candidate scope rule: same major OS family only, with same minor stream preferred.
- [ ] 10. Verify at least 2 upgrade candidates exist in the filtered candidate list.
- [ ] 11. If fewer than 2 candidates exist, hard stop and end the run before clone creation.
- [ ] 12. Confirm steps 6-11 passed; if any stop condition was hit, do not clone.
- [ ] 13. From vCenter, issue the source-host power-off request and wait for `poweredOff`.
- [ ] 14. From vCenter, confirm the source host is still `poweredOff` immediately before cloning.
- [ ] 15. Determine the base clone name `aw999-[source-without-atvmxxx-]`.
- [ ] 16. From vCenter, check whether the base clone name already exists.
- [ ] 17. If needed, choose the next available suffixed clone name using `aw999-[source-without-atvmxxx-]-1`, then `-2`, then `-N` as needed.
- [ ] 18. From vCenter, clone the source VM on `AutomatedTest-UnitTesting` using the resolved clone VM name from steps 15-17, pass only the clone VM name as the destination, and default it to `CDS1-ESX165` / `192.168.1.165` unless overridden.
- [ ] 19. From vCenter, detach the 2 FC PCI adapters from the cloned VM.
- [ ] 20. From vCenter, verify both FC passthrough devices are no longer present on the clone.
- [ ] 21. From vCenter, power on the clone.
- [ ] 22. From vCenter, query guest-tools for the live clone IP.
- [ ] 23. SSH to the live clone IP found in step 22 using credentials from `/home/cirrus/cds/.env.credentials.local`.
- [ ] 24. On the clone, change the OS hostname to the clone name with `.` replaced by `-`.
- [ ] 25. On the clone, convert networking from static IP to DHCP.
- [ ] 26. On the clone, remove/clean static IP configuration references.
- [ ] 27. On the clone, reboot the machine.
- [ ] 28. From vCenter, query guest-tools again for the new live clone IP.
- [ ] 29. SSH to the new live clone IP found in step 28.
- [ ] 30. On the clone, verify DHCP state.
- [ ] 31. If the clone still reports the previous static IP, fix config cleanup and repeat steps 26-30.
- [ ] 32. Continue all remaining steps using the live DHCP IP confirmed in step 30.
- [ ] 33. On the clone, wipe `/dev/sdb` once and verify no filesystem or partition signatures remain.
- [ ] 34. Using the cirrusdata skill, install CMC on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus`.
- [ ] 35. Using the cirrusdata skill, create the first local migration from the 10 GB source disk to the 11 GB destination disk in the `skidamarink` project.
- [ ] 36. If migration session creation fails, hard stop as blocker-fail.
- [ ] 37. Using the cirrusdata skill, wait for initial sync completion in the `skidamarink` project.
- [ ] 38. SSH to the live DHCP clone IP confirmed in step 30, refresh package metadata, and check available kernels again using the full distro candidate listing: RHEL/Oracle/Rocky/Alma: `dnf makecache; dnf list --showduplicates kernel kernel-devel kernel-headers`; older RHEL/CentOS: `yum makecache; yum list --showduplicates kernel kernel-devel kernel-headers`; Debian/Ubuntu: `rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update; apt-cache madison linux-image-generic linux-headers-generic; apt list -a linux-image-generic linux-headers-generic`.
- [ ] 39. Select the first-upgrade target from the filtered candidate list; it must stay in the same major OS family and must not be the latest candidate. If no valid non-latest first-upgrade target exists, hard stop as blocker-fail.
- [ ] 40. On the clone, verify matching dev/header packages are available for the exact first-upgrade target.
- [ ] 41. On the clone, install the first-upgrade kernel and matching dev/header packages without rebooting yet.
- [ ] 42. On Red Hat-family systems with `grubby` including Oracle Linux, set the first-upgrade kernel as the grubby default and verify `grubby --default-kernel` returns the selected `/boot/vmlinuz-<target>` path before reboot.
- [ ] 43. On the clone, reboot into the first-upgrade kernel.
- [ ] 44. From vCenter, query guest-tools again for the live clone IP after reboot.
- [ ] 45. SSH to the rebooted clone IP found in step 44.
- [ ] 46. On the clone, verify kernel plus dev/header package versions match the selected first-upgrade version.
- [ ] 47. If versions do not match exactly, stop as blocker-fail.
- [ ] 48. Using the cirrusdata skill, verify the clone is online in the `skidamarink` project.
- [ ] 49. On the clone, verify MTDI and Galaxy Migrate services/driver are up.
- [ ] 50. On the clone, write sample data to the source 10 GB disk.
- [ ] 51. Using the cirrusdata skill, trigger sync and confirm tracking status in the `skidamarink` project.
- [ ] 52. Using the cirrusdata skill, uninstall CMC from the clone in the `skidamarink` project.
- [ ] 53. Using Cirrus Data MCP tools, run host cleanup for `skidamarink` and remove the cloned host entry for this test clone only, regardless of online/offline status.
- [ ] 54. Using Cirrus Data MCP tools, verify the cloned host entry and all migration sessions for the cloned host are gone from `skidamarink` before continuing.
- [ ] 55. SSH to the live DHCP clone IP confirmed in step 30, refresh package metadata, and check available kernels again using the full distro candidate listing: RHEL/Oracle/Rocky/Alma: `dnf makecache; dnf list --showduplicates kernel kernel-devel kernel-headers`; older RHEL/CentOS: `yum makecache; yum list --showduplicates kernel kernel-devel kernel-headers`; Debian/Ubuntu: `rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update; apt-cache madison linux-image-generic linux-headers-generic; apt list -a linux-image-generic linux-headers-generic`.
- [ ] 56. Select the latest-upgrade target kernel from the filtered candidate list; it must stay in the same major OS family and should use the latest available candidate in that scope. If no valid latest-upgrade target exists, hard stop as blocker-fail.
- [ ] 57. On the clone, verify matching dev/header packages are available for the exact latest-upgrade target.
- [ ] 58. On the clone, install the latest-upgrade kernel and matching dev/header packages without rebooting yet.
- [ ] 59. On Red Hat-family systems with `grubby` including Oracle Linux, set the latest-upgrade kernel as the grubby default and verify `grubby --default-kernel` returns the selected `/boot/vmlinuz-<target>` path before reboot.
- [ ] 60. On the clone, reboot into the latest-upgrade kernel.
- [ ] 61. From vCenter, query guest-tools again for the live clone IP after reboot.
- [ ] 62. SSH to the rebooted clone IP found in step 61.
- [ ] 63. On the clone, verify kernel plus dev/header package versions match the selected latest-upgrade version.
- [ ] 64. If versions do not match exactly, stop as blocker-fail.
- [ ] 65. Using the cirrusdata skill, install CMC again on the clone in the `skidamarink` project with `-no-prebuilt-mtdi-nexus` on the latest kernel.
- [ ] 66. Using the cirrusdata skill, create the second local migration from the 10 GB source disk to the 11 GB destination disk in the `skidamarink` project and wait for initial sync completion.
- [ ] 67. If migration session creation fails, hard stop as blocker-fail.
- [ ] 68. Using the cirrusdata skill, confirm the machine is online in the `skidamarink` project.
- [ ] 69. SSH to the live clone IP currently reported by vCenter and verify MTDI and Galaxy Migrate services/driver are up.
- [ ] 70. Only after steps 65-69 all pass, begin success-path cleanup.
- [ ] 71. From vCenter, power off the cloned machine.
- [ ] 72. From vCenter, delete the cloned VM and its disks from inventory.
- [ ] 73. Using Cirrus Data MCP tools, run final host cleanup for `skidamarink`, remove the cloned host entry for this test clone only, and verify the cloned host entry plus all migration sessions for the cloned host are gone.
- [ ] 74. Blocker-fail path after clone creation, as an alternate to steps 70-73: leave the cloned VM powered on and present in inventory for manual inspection.
- [ ] 75. Append the current run to the summary and results files with the required host metadata, kernel progression, execution summary, final outcome, and total test duration.
## Stop Conditions
Stop immediately and record a blocker if any of these occur:
- Requested source host is a SUSE/SLES machine.
- Cannot verify clone identity.
- Cannot detach required FC PCI adapters.
- Clone cannot be created on datastore `AutomatedTest-UnitTesting`.
- FC passthrough adapters remain attached after the detach/verification step.
- DHCP transition cannot be completed because the clone still reports the previous static IP after cleanup and retry.
- Kernel upgrade candidate criteria not met.
- Migration session creation failed (including API/service errors such as HTTP 5xx or equivalent backend unavailability).
- Any critical migration/service validation failure that blocks continuation.
## Per-Host Test Result Record
Use one cumulative results file and append one new section per tested host. Keep the record concise but complete enough to reconstruct the run.
### Host Metadata
- Test start time (UTC):
- Test end time (UTC):
- Test duration:
- Operator:
- Source VM:
- Cloned VM name:
- Clone origin (vCenter path/folder/cluster):
- Final DHCP IP of clone:
### Kernel / OS Tracking
- Start OS version:
- Start kernel version:
- Kernel list before first upgrade (full candidate list, filtered by scope rule):
- Kernel selected for step-up upgrade:
- Matching dev/header packages for step-up target (availability check):
- Kernel after step-up reboot:
- Installed dev/header package versions after step-up:
- Kernel list before latest upgrade (full candidate list, filtered by scope rule):
- Kernel selected for latest upgrade:
- Matching dev/header packages for latest target (availability check):
- Kernel after latest reboot:
- Installed dev/header package versions after latest upgrade:
### Execution Summary (Short Bullets)
- Clone created / FC PCI detached: `PASS|FAIL` - notes
- Hostname/IP DHCP conversion: `PASS|FAIL` - notes
- 10 GB source disk prep before first CMC install: `PASS|FAIL` - notes
- CMC reinstall #1: `PASS|FAIL` - notes
- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
- Step-up kernel upgrade: `PASS|FAIL` - notes
- Step-up dev/header package match check: `PASS|FAIL` - notes
- Online in skidamarink after step-up: `PASS|FAIL` - notes
- MTDI/Galaxy Migrate service+driver health after step-up: `PASS|FAIL` - notes
- Write data + tracking status: `PASS|FAIL` - notes
- CMC uninstall: `PASS|FAIL` - notes
- Latest kernel upgrade: `PASS|FAIL` - notes
- Latest dev/header package match check: `PASS|FAIL` - notes
- CMC reinstall #2: `PASS|FAIL` - notes
- Local migration #2 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
- Online in skidamarink after latest upgrade: `PASS|FAIL` - notes
- MTDI/Galaxy Migrate service+driver health after latest upgrade: `PASS|FAIL` - notes
- Clone power off and deletion (success path only): `PASS|FAIL|N/A` - notes
### Final Outcome
- Overall result: `PASS|FAIL|PARTIAL`
- Outcome interpretation:
- `PASS`: full planned test flow completed and core validation goals passed (CMC install/uninstall/reinstall, kernel step-up/latest upgrade, and post-upgrade service/driver health checks), even if non-blocking warnings occurred.
- `FAIL`: a true blocker prevented completion of required validation goals.
- `PARTIAL`: use only when execution stops early by operator choice or scope is intentionally reduced, not for non-blocking warnings in a completed run.
- Blocking issue summary:
- Follow-up actions:
## Result Artifacts
- Results file: `/home/cirrus/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md`
- Summary file: `/home/cirrus/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md`
- Result artifacts under `tmp/` are local run records only and must not be committed.
- Always append the latest run outcome to both files for `PASS`, `FAIL`, and `PARTIAL` outcomes.
- Do not leave a completed test run only in conversation; the artifact files are the source of record.
- All recorded timestamps must use UTC format: `YYYY-MM-DD HH:MM UTC`.
- Record the UTC start time when the run begins.
- Record the UTC end time when the run reaches a terminal outcome and cleanup/reporting is complete.
- Compute `Test duration` from the recorded start/end timestamps and include it in both files.
- If a run is still in progress when first recorded, update the runtime once the run reaches its terminal outcome.
- Use the `Per-Host Test Result Record` format for the results file.
Summary file requirements:
- Start the file with the test file name line: `Test file: cmc-upgrade-kernel-test.md`
- Title: `CMC Upgrade Kernel Test Summary`
- Include test start time, test end time, and total test duration for the run
- Include a short run summary (current kernel -> first CMC install phase -> kernel upgrade -> CMC uninstall -> kernel upgrade -> second CMC install phase)
- Include host tested, kernel progression (start, step-up, latest), and overall result
- Start each run section with a `##` heading that includes the OS family and the final outcome, for example: `## Amazon Linux 2023 - PASS`.
- Put the OS version and the rest of the run details under that heading so the heading stays the visible OS label above the test snippet.
- Backfill `Test duration` into the summary and results artifacts for any run where both timestamps are known.