16 KiB
16 KiB
CMC Upgrade Kernel Test Template
Purpose
Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.
Scope
- Run per source host provided by operator.
- Work only on the cloned VM created for this test.
Inputs
- Source VM hostname:
<atvmxxx-...> - vCenter target/source location:
<cluster/datastore/folder> - Required clone datastore:
AutomatedTest-UnitTesting - Initial clone access host/IP:
<INITIAL_CLONE_HOST_OR_IP> - SSH username variable:
<SSH_USER_VAR> - SSH password variable:
<SSH_PASSWORD_VAR> - Cirrus profile/project:
gcstage/skidamarink
Credential Source
- Use credentials from:
/home/aw/code/cds/.env.credentials.local - Do not hardcode usernames/passwords in test records or commands.
CMC Tooling Rule (Global)
- For all CMC-related actions in this test, use the
cirrusdataskill/CLI path. - Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
- Apply this rule to every relevant step in this procedure.
- For every CMC install/reinstall command in this test, always include installer option:
-no-prebuilt-mtdi-nexus.
Kernel Package Matching Rule (Global)
- For every planned kernel upgrade, verify matching development/header packages are available for the exact target kernel version before installing that kernel.
- On Red Hat-family systems, verify
kernel-devel-<target>andkernel-headers-<target>availability (or documented distro-equivalent package names where applicable). - The first kernel upgrade attempt must not use the latest kernel in the filtered candidate list; reserve the latest kernel for the final kernel-upgrade stage.
- When upgrading kernel versions, also upgrade/install the matching development/header packages for that same version.
- After each kernel upgrade and reboot, verify running kernel version and installed dev/header package versions all match.
- If kernel and dev/header package versions are mismatched at any point, stop immediately as blocker-fail and do not continue with remediation by assumption.
Red Hat Preflight (Global, Manual Tasks Only)
- Apply this section only when the test target is an actual Red Hat subscription-managed machine and the run is manually executed.
- Do not apply this section to CentOS, Oracle Linux, Rocky, Alma, or other RHEL-derived distributions unless the operator explicitly says the machine should be treated as Red Hat-managed for this run.
- If the target is not actual RHEL, skip this preflight entirely and do not attempt
subscription-manager. - Do not apply this section to ATVM automation runs that already handle subscription flow.
- Before running test steps on Red Hat, run:
subscription-manager remove --allsubscription-manager unregistersubscription-manager cleansubscription-manager register --username "$REDHAT_SUBSCRIPTION_USER" --password "$REDHAT_SUBSCRIPTION_PASSWORD"
- Source credentials from
/home/aw/code/cds/.env.credentials.local.
Execution Mode (Global)
- Run this test in continuous execution mode.
- Do not pause for additional operator prompts between steps.
- Keep monitoring and continue automatically until the test reaches a terminal outcome (
PASSorFAIL) and all required cleanup/reporting steps are completed. - Only stop early if a true blocker prevents safe continuation, and still complete required cleanup/reporting before returning control.
Naming Rule
- Base clone VM name in vCenter:
aw999-[source hostname without atvmxxx- prefix] - Before cloning, verify the clone VM name is not already in use.
- If already in use, append a numeric suffix to the base name:
-1,-2, ...-Nuntil an unused name is found. - Use plain VM name only (no
/CDSHQ-Eng/vm/prefix) for clone destination name, and set folder separately if needed. - OS hostname on clone: same clone name but replace
.with-
Safety Rules
- Delete only the clone created for this test.
- If the clone is missing or identity is uncertain, stop and do not delete any other VM.
- If any blocker occurs after clone creation, stop the test and leave the cloned VM powered on for manual inspection.
- Do not delete or power off the clone on blocker-fail outcomes.
- After source-host kernel inspection is complete, power the source VM off and re-verify in vCenter that it is powered off before cloning.
- Detaching the 2 FC PCI passthrough adapters from the cloned VM is mandatory before any guest boot or guest-side change.
- Verify in vCenter that both FC passthrough devices are absent before proceeding past the clone-prep stage.
- Always use live vCenter guest-tools data to confirm the current clone IP before any SSH or polling attempt.
- Re-check live vCenter guest-tools IP after clone power-on, after switching networking from static to DHCP, and after any reboot before attempting SSH.
- Do not assume the previous IP is still valid after a reboot or network change.
- Cleanup actions that remove hosts from CMC must target only the cloned host used in the current test run.
- Treat migration session creation failures (for either migration #1 or migration #2) as blocker-fail events.
Test Procedure
- Remove offline hosts in
skidamarinkusing MCP offline-host cleanup. - Confirm source host is powered on for the inspection phase. If it is powered off, power it on.
- SSH to the source host and check available kernel versions on the source before cloning.
- Build source-host kernel candidate list from all available versions (include intermediate versions, not just the latest from
check-update). - Candidate scope rule:
- Include only kernels in the same major OS family as the current machine (no major-version upgrades).
- Prefer candidates within the same minor stream as current OS/kernel when available.
- Verify at least 2 upgrade candidates exist in the filtered candidate list.
- If fewer than 2 candidates: hard stop and end run before clone creation.
- Gate check:
- If step 7 triggered a stop condition, execute no further steps.
- If no stop condition was triggered, continue with the next step.
- After source-host inspection is complete, power the source VM off.
- Confirm in vCenter that the source host is powered off before cloning.
- Determine base clone name:
aw999-[source-without-atvmxxx-]. - Before cloning, check whether that clone name already exists in vCenter.
- If the name exists, choose the next available suffixed name:
aw999-[source-without-atvmxxx-]-1, then-2, then-Nas needed. - Clone source VM using the resolved unique clone name on datastore
AutomatedTest-UnitTestingonly. - For the clone command destination name, pass only the VM name (for example
aw999-ubuntu24.04-1), not an inventory path like/CDSHQ-Eng/vm/...; set folder separately if needed. - Detach the 2 FC PCI adapters from the cloned VM.
- Verify in vCenter that both FC passthrough devices are no longer present on the clone.
- Power on clone.
- Query vCenter guest-tools for the live clone IP.
- SSH to the live clone IP using credentials from
/home/aw/code/cds/.env.credentials.local. - Change OS hostname to clone name, replacing
.with-. - Convert networking from static IP to DHCP.
- Remove/clean static IP configuration references.
- Reboot clone.
- Query vCenter guest-tools again for the new live clone IP.
- SSH to the new live clone IP and verify the DHCP state.
- If the clone still reports the previous static IP, fix static config cleanup and repeat reboot/verify.
- Continue all remaining steps using the live DHCP IP from vCenter and credentials from
/home/aw/code/cds/.env.credentials.local. - Before the first CMC install, wipe the 10GB source disk with
dd if=/dev/zero of=/dev/sdb bs=1M count=32 status=progress conv=fsync, then verify that no filesystem or partition signatures remain (wipefs -n /dev/sdb,blkid /dev/sdb,file -s /dev/sdb,lsblk -f /dev/sdb). This disk prep is one-time only and must not be repeated in later stages of the test. - Using
cirrusdata(gcstage, projectskidamarink), reinstall CMC on clone, always adding-no-prebuilt-mtdi-nexus. - Create local migration from 10GB source disk to 11GB destination disk using
cirrusdata. - If migration session creation fails (including API/service errors such as 5xx), hard stop as blocker-fail.
- Wait for initial sync completion.
- Check available kernels again using full candidate listing (not latest-only output).
- Select first-upgrade target from filtered candidate list (same major; same minor preferred), ensuring it is not the latest candidate.
- Verify matching dev/header packages for the selected first-upgrade target are available.
- Install selected first-upgrade kernel and matching dev/header packages, then reboot.
- Query vCenter guest-tools again for the live clone IP after reboot.
- SSH to the rebooted clone via the live vCenter IP and verify running kernel and installed dev/header package versions match the selected first-upgrade version.
- If versions do not match exactly, stop as blocker-fail.
- After reboot, verify clone is online in
skidamarinkusingcirrusdata. - SSH to clone and verify MTDI, Galaxy Migrate services/driver are up.
- Write sample data to source 10GB disk.
- Trigger sync and confirm tracking status using
cirrusdata. - Uninstall CMC.
- Post-uninstall cleanup checkpoint:
- Run MCP offline-host cleanup for
skidamarink. - If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically via MCP (target only this test clone host).
- Because CMC status can lag behind VM state, poll briefly for status transition; if still online, perform targeted MCP host removal for the tested clone.
- Check available kernels.
- Select latest-upgrade target kernel from the filtered candidate list (same major required; same minor preferred).
- Verify matching dev/header packages for the selected latest-upgrade target are available.
- Install selected latest-upgrade kernel and matching dev/header packages, then reboot.
- Query vCenter guest-tools again for the live clone IP after reboot.
- SSH to the rebooted clone via the live vCenter IP and verify running kernel and installed dev/header package versions match the selected latest-upgrade version.
- If versions do not match exactly, stop as blocker-fail.
- Reinstall CMC via
cirrusdata(gcstage,skidamarink), always adding-no-prebuilt-mtdi-nexus. - Create a local migration (10GB -> 11GB) via
cirrusdataand wait for initial sync completion. - If migration session creation fails (including API/service errors such as 5xx), hard stop as blocker-fail.
- Confirm machine is online in
skidamarinkusingcirrusdata. - SSH and verify MTDI, Galaxy Migrate services/driver are up.
- Success-path cleanup only: power off cloned machine.
- Success-path cleanup only: delete cloned VM and its disks from vCenter inventory.
- Success-path final cleanup checkpoint:
- Run MCP offline-host cleanup for
skidamarink. - If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically via MCP (target only this test clone host).
- Because CMC status can lag behind VM deletion/power-off, wait/poll briefly first; if still online, perform targeted MCP host removal for the tested clone.
- Blocker-fail path after clone creation:
- Stop test immediately after recording failure details.
- Leave cloned VM powered on and present in inventory for manual inspection.
- Do not run clone power-off/delete steps in blocker-fail path.
Stop Conditions
- Cannot verify clone identity.
- Cannot detach required FC PCI adapters.
- Clone cannot be created on datastore
AutomatedTest-UnitTesting. - FC passthrough adapters remain attached after the detach/verification step.
- DHCP transition cannot be completed (clone remains static at
<INITIAL_CLONE_HOST_OR_IP>). - Kernel upgrade candidate criteria not met.
- Migration session creation failed (including API/service errors such as HTTP 5xx or equivalent backend unavailability).
- Any critical migration/service validation failure that blocks continuation.
Per-Host Test Result Record
Use one cumulative results file and append one new section per tested host.
Host Metadata
- Test date/time (UTC):
- Operator:
- Source VM:
- Cloned VM name:
- Clone origin (vCenter path/folder/cluster):
- Final DHCP IP of clone:
Kernel / OS Tracking
- Start OS version:
- Start kernel version:
- Kernel list before first upgrade (full candidate list, filtered by scope rule):
- Kernel selected for step-up upgrade:
- Matching dev/header packages for step-up target (availability check):
- Kernel after step-up reboot:
- Installed dev/header package versions after step-up:
- Kernel list before latest upgrade (full candidate list, filtered by scope rule):
- Kernel selected for latest upgrade:
- Matching dev/header packages for latest target (availability check):
- Kernel after latest reboot:
- Installed dev/header package versions after latest upgrade:
Execution Summary (Short Bullets)
- Clone created / FC PCI detached:
PASS|FAIL- notes - Hostname/IP DHCP conversion:
PASS|FAIL- notes - CMC reinstall #1:
PASS|FAIL- notes - 10 GB source disk prep before first CMC install:
PASS|FAIL- notes - Local migration #1 (10GB -> 11GB) initial sync:
PASS|FAIL- notes - Step-up kernel upgrade:
PASS|FAIL- notes - Step-up dev/header package match check:
PASS|FAIL- notes - Online in skidamarink after step-up:
PASS|FAIL- notes - MTDI/Galaxy Migrate service+driver health after step-up:
PASS|FAIL- notes - Write data + tracking status:
PASS|FAIL- notes - CMC uninstall:
PASS|FAIL- notes - Latest kernel upgrade:
PASS|FAIL- notes - Latest dev/header package match check:
PASS|FAIL- notes - CMC reinstall #2:
PASS|FAIL- notes - Local migration #2 (10GB -> 11GB) initial sync:
PASS|FAIL- notes - Online in skidamarink after latest upgrade:
PASS|FAIL- notes - MTDI/Galaxy Migrate service+driver health after latest upgrade:
PASS|FAIL- notes - Clone power off and deletion (success path only):
PASS|FAIL|N/A- notes
Final Outcome
- Overall result:
PASS|FAIL|PARTIAL - Outcome interpretation:
PASS: full planned test flow completed and core validation goals passed (CMC install/uninstall/reinstall, kernel step-up/latest upgrade, and post-upgrade service/driver health checks), even if non-blocking warnings occurred.FAIL: a true blocker prevented completion of required validation goals.PARTIAL: use only when execution stops early by operator choice or scope is intentionally reduced, not for non-blocking warnings in a completed run.
- Blocking issue summary:
- Follow-up actions:
Timestamp Standard
- All recorded test timestamps must use UTC.
- Format:
YYYY-MM-DD HH:MM UTC
Result Storage Location
Store and append all per-host results in:
/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md
Also generate a run summary file in the same directory:
/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md
Artifact Recording Rule
- Always append the latest run outcome to the results file and summary file at the end of each run.
- Do this for
PASS,FAIL, andPARTIALoutcomes. - Do not leave a completed test run only in conversation; the artifact files are the source of record.
Summary file requirements:
- Title:
CMC Upgrade Kernel Test Summary - Include UTC date/time for the run
- Include a short workflow summary (current kernel -> install CMC -> kernel upgrade -> uninstall CMC -> kernel upgrade -> install CMC)
- Include host tested, kernel progression (start, step-up, latest), and overall result