233 lines
15 KiB
Markdown
233 lines
15 KiB
Markdown
# CMC Upgrade Kernel Test Template
|
|
|
|
## Purpose
|
|
Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.
|
|
|
|
## Scope
|
|
- Run per source host provided by operator.
|
|
- Work only on the cloned VM created for this test.
|
|
|
|
## Inputs
|
|
- Source VM hostname: `<atvmxxx-...>`
|
|
- vCenter target/source location: `<cluster/datastore/folder>`
|
|
- Required clone datastore: `AutomatedTest-UnitTesting`
|
|
- Initial clone access host/IP: `<INITIAL_CLONE_HOST_OR_IP>`
|
|
- SSH username variable: `<SSH_USER_VAR>`
|
|
- SSH password variable: `<SSH_PASSWORD_VAR>`
|
|
- Cirrus profile/project: `gcstage` / `skidamarink`
|
|
|
|
## Credential Source
|
|
- Use credentials from: `/home/aw/code/cds/.env.credentials.local`
|
|
- Do not hardcode usernames/passwords in test records or commands.
|
|
|
|
## CMC Tooling Rule (Global)
|
|
- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path.
|
|
- Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
|
|
- Apply this rule to every relevant step in this procedure.
|
|
- For every CMC install/reinstall command in this test, always include installer option: `-no-prebuilt-mtdi-nexus`.
|
|
|
|
## Kernel Package Matching Rule (Global)
|
|
- For every planned kernel upgrade, verify matching development/header packages are available for the exact target kernel version before installing that kernel.
|
|
- On Red Hat-family systems, verify `kernel-devel-<target>` and `kernel-headers-<target>` availability (or documented distro-equivalent package names where applicable).
|
|
- The first kernel upgrade attempt must not use the latest kernel in the filtered candidate list; reserve the latest kernel for the final kernel-upgrade stage.
|
|
- When upgrading kernel versions, also upgrade/install the matching development/header packages for that same version.
|
|
- After each kernel upgrade and reboot, verify running kernel version and installed dev/header package versions all match.
|
|
- If kernel and dev/header package versions are mismatched at any point, stop immediately as blocker-fail and do not continue with remediation by assumption.
|
|
|
|
## Red Hat Preflight (Global, Manual Tasks Only)
|
|
- Apply this section only when the test target is an actual Red Hat subscription-managed machine and the run is manually executed.
|
|
- Do not apply this section to CentOS, Oracle Linux, Rocky, Alma, or other RHEL-derived distributions unless the operator explicitly says the machine should be treated as Red Hat-managed for this run.
|
|
- If the target is not actual RHEL, skip this preflight entirely and do not attempt `subscription-manager`.
|
|
- Do not apply this section to ATVM automation runs that already handle subscription flow.
|
|
- Before running test steps on Red Hat, run:
|
|
- `subscription-manager remove --all`
|
|
- `subscription-manager unregister`
|
|
- `subscription-manager clean`
|
|
- `subscription-manager register --username "$REDHAT_SUBSCRIPTION_USER" --password "$REDHAT_SUBSCRIPTION_PASSWORD"`
|
|
- Source credentials from `/home/aw/code/cds/.env.credentials.local`.
|
|
|
|
## Execution Mode (Global)
|
|
- Run this test in continuous execution mode.
|
|
- Do not pause for additional operator prompts between steps.
|
|
- Keep monitoring and continue automatically until the test reaches a terminal outcome (`PASS` or `FAIL`) and all required cleanup/reporting steps are completed.
|
|
- Only stop early if a true blocker prevents safe continuation, and still complete required cleanup/reporting before returning control.
|
|
|
|
## Naming Rule
|
|
- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]`
|
|
- Before cloning, verify the clone VM name is not already in use.
|
|
- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found.
|
|
- Use plain VM name only (no `/CDSHQ-Eng/vm/` prefix) for clone destination name, and set folder separately if needed.
|
|
- OS hostname on clone: same clone name but replace `.` with `-`
|
|
|
|
## Safety Rules
|
|
- Delete only the clone created for this test.
|
|
- If the clone is missing or identity is uncertain, stop and do not delete any other VM.
|
|
- If any blocker occurs after clone creation, stop the test and leave the cloned VM powered on for manual inspection.
|
|
- Do not delete or power off the clone on blocker-fail outcomes.
|
|
- After source-host kernel inspection is complete, power the source VM off and re-verify in vCenter that it is powered off before cloning.
|
|
- Detaching the 2 FC PCI passthrough adapters from the cloned VM is mandatory before any guest boot or guest-side change.
|
|
- Verify in vCenter that both FC passthrough devices are absent before proceeding past the clone-prep stage.
|
|
- Always use live vCenter guest-tools data to confirm the current clone IP before any SSH or polling attempt.
|
|
- Re-check live vCenter guest-tools IP after clone power-on, after switching networking from static to DHCP, and after any reboot before attempting SSH.
|
|
- Do not assume the previous IP is still valid after a reboot or network change.
|
|
- Cleanup actions that remove hosts from CMC must target only the cloned host used in the current test run.
|
|
- Treat migration session creation failures (for either migration #1 or migration #2) as blocker-fail events.
|
|
|
|
## Test Procedure
|
|
1. Remove offline hosts in `skidamarink` using MCP offline-host cleanup.
|
|
2. Confirm source host is powered on for the inspection phase. If it is powered off, power it on.
|
|
3. SSH to the source host and check available kernel versions on the source before cloning.
|
|
4. Build source-host kernel candidate list from all available versions (include intermediate versions, not just the latest from `check-update`).
|
|
5. Candidate scope rule:
|
|
- Include only kernels in the same major OS family as the current machine (no major-version upgrades).
|
|
- Prefer candidates within the same minor stream as current OS/kernel when available.
|
|
6. Verify at least 2 upgrade candidates exist in the filtered candidate list.
|
|
7. If fewer than 2 candidates: hard stop and end run before clone creation.
|
|
8. Gate check:
|
|
- If step 7 triggered a stop condition, execute no further steps.
|
|
- If no stop condition was triggered, continue with the next step.
|
|
9. After source-host inspection is complete, power the source VM off.
|
|
10. Confirm in vCenter that the source host is powered off before cloning.
|
|
11. Determine base clone name: `aw999-[source-without-atvmxxx-]`.
|
|
12. Before cloning, check whether that clone name already exists in vCenter.
|
|
13. If the name exists, choose the next available suffixed name: `aw999-[source-without-atvmxxx-]-1`, then `-2`, then `-N` as needed.
|
|
14. Clone source VM using the resolved unique clone name on datastore `AutomatedTest-UnitTesting` only.
|
|
15. For the clone command destination name, pass only the VM name (for example `aw999-ubuntu24.04-1`), not an inventory path like `/CDSHQ-Eng/vm/...`; set folder separately if needed.
|
|
16. Detach the 2 FC PCI adapters from the cloned VM.
|
|
17. Verify in vCenter that both FC passthrough devices are no longer present on the clone.
|
|
18. Power on clone.
|
|
19. Query vCenter guest-tools for the live clone IP.
|
|
20. SSH to the live clone IP using credentials from `/home/aw/code/cds/.env.credentials.local`.
|
|
21. Change OS hostname to clone name, replacing `.` with `-`.
|
|
22. Convert networking from static IP to DHCP.
|
|
23. Remove/clean static IP configuration references.
|
|
24. Reboot clone.
|
|
25. Query vCenter guest-tools again for the new live clone IP.
|
|
26. SSH to the new live clone IP and verify the DHCP state.
|
|
27. If the clone still reports the previous static IP, fix static config cleanup and repeat reboot/verify.
|
|
28. Continue all remaining steps using the live DHCP IP from vCenter and credentials from `/home/aw/code/cds/.env.credentials.local`.
|
|
29. Before the first CMC install, wipe the 10GB source disk with `dd if=/dev/zero of=/dev/sdb bs=1M count=32 status=progress conv=fsync`, then verify that no filesystem or partition signatures remain (`wipefs -n /dev/sdb`, `blkid /dev/sdb`, `file -s /dev/sdb`, `lsblk -f /dev/sdb`). This disk prep is one-time only and must not be repeated in later stages of the test.
|
|
30. Using `cirrusdata` (`gcstage`, project `skidamarink`), reinstall CMC on clone, always adding `-no-prebuilt-mtdi-nexus`.
|
|
31. Create local migration from 10GB source disk to 11GB destination disk using `cirrusdata`.
|
|
32. If migration session creation fails (including API/service errors such as 5xx), hard stop as blocker-fail.
|
|
33. Wait for initial sync completion.
|
|
34. Check available kernels again using full candidate listing (not latest-only output).
|
|
35. Select first-upgrade target from filtered candidate list (same major; same minor preferred), ensuring it is not the latest candidate.
|
|
36. Verify matching dev/header packages for the selected first-upgrade target are available.
|
|
37. Install selected first-upgrade kernel and matching dev/header packages, then reboot.
|
|
38. Query vCenter guest-tools again for the live clone IP after reboot.
|
|
39. SSH to the rebooted clone via the live vCenter IP and verify running kernel and installed dev/header package versions match the selected first-upgrade version.
|
|
40. If versions do not match exactly, stop as blocker-fail.
|
|
41. After reboot, verify clone is online in `skidamarink` using `cirrusdata`.
|
|
42. SSH to clone and verify MTDI, Galaxy Migrate services/driver are up.
|
|
43. Write sample data to source 10GB disk.
|
|
44. Trigger sync and confirm tracking status using `cirrusdata`.
|
|
45. Uninstall CMC.
|
|
46. Post-uninstall cleanup checkpoint:
|
|
- Run MCP offline-host cleanup for `skidamarink`.
|
|
- If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically via MCP (target only this test clone host).
|
|
- Because CMC status can lag behind VM state, poll briefly for status transition; if still online, perform targeted MCP host removal for the tested clone.
|
|
47. Check available kernels.
|
|
48. Select latest-upgrade target kernel from the filtered candidate list (same major required; same minor preferred).
|
|
49. Verify matching dev/header packages for the selected latest-upgrade target are available.
|
|
50. Install selected latest-upgrade kernel and matching dev/header packages, then reboot.
|
|
51. Query vCenter guest-tools again for the live clone IP after reboot.
|
|
52. SSH to the rebooted clone via the live vCenter IP and verify running kernel and installed dev/header package versions match the selected latest-upgrade version.
|
|
53. If versions do not match exactly, stop as blocker-fail.
|
|
54. Reinstall CMC via `cirrusdata` (`gcstage`, `skidamarink`), always adding `-no-prebuilt-mtdi-nexus`.
|
|
55. Create a local migration (10GB -> 11GB) via `cirrusdata` and wait for initial sync completion.
|
|
56. If migration session creation fails (including API/service errors such as 5xx), hard stop as blocker-fail.
|
|
57. Confirm machine is online in `skidamarink` using `cirrusdata`.
|
|
58. SSH and verify MTDI, Galaxy Migrate services/driver are up.
|
|
59. Success-path cleanup only: power off cloned machine.
|
|
60. Success-path cleanup only: delete cloned VM and its disks from vCenter inventory.
|
|
61. Success-path final cleanup checkpoint:
|
|
- Run MCP offline-host cleanup for `skidamarink`.
|
|
- If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically via MCP (target only this test clone host).
|
|
- Because CMC status can lag behind VM deletion/power-off, wait/poll briefly first; if still online, perform targeted MCP host removal for the tested clone.
|
|
62. Blocker-fail path after clone creation:
|
|
- Stop test immediately after recording failure details.
|
|
- Leave cloned VM powered on and present in inventory for manual inspection.
|
|
- Do not run clone power-off/delete steps in blocker-fail path.
|
|
|
|
## Stop Conditions
|
|
- Cannot verify clone identity.
|
|
- Cannot detach required FC PCI adapters.
|
|
- Clone cannot be created on datastore `AutomatedTest-UnitTesting`.
|
|
- FC passthrough adapters remain attached after the detach/verification step.
|
|
- DHCP transition cannot be completed (clone remains static at `<INITIAL_CLONE_HOST_OR_IP>`).
|
|
- Kernel upgrade candidate criteria not met.
|
|
- Migration session creation failed (including API/service errors such as HTTP 5xx or equivalent backend unavailability).
|
|
- Any critical migration/service validation failure that blocks continuation.
|
|
|
|
## Per-Host Test Result Record
|
|
Use one cumulative results file and append one new section per tested host.
|
|
|
|
### Host Metadata
|
|
- Test date/time (UTC):
|
|
- Operator:
|
|
- Source VM:
|
|
- Cloned VM name:
|
|
- Clone origin (vCenter path/folder/cluster):
|
|
- Final DHCP IP of clone:
|
|
|
|
### Kernel / OS Tracking
|
|
- Start OS version:
|
|
- Start kernel version:
|
|
- Kernel list before first upgrade (full candidate list, filtered by scope rule):
|
|
- Kernel selected for step-up upgrade:
|
|
- Matching dev/header packages for step-up target (availability check):
|
|
- Kernel after step-up reboot:
|
|
- Installed dev/header package versions after step-up:
|
|
- Kernel list before latest upgrade (full candidate list, filtered by scope rule):
|
|
- Kernel selected for latest upgrade:
|
|
- Matching dev/header packages for latest target (availability check):
|
|
- Kernel after latest reboot:
|
|
- Installed dev/header package versions after latest upgrade:
|
|
|
|
### Execution Summary (Short Bullets)
|
|
- Clone created / FC PCI detached: `PASS|FAIL` - notes
|
|
- Hostname/IP DHCP conversion: `PASS|FAIL` - notes
|
|
- CMC reinstall #1: `PASS|FAIL` - notes
|
|
- 10 GB source disk prep before first CMC install: `PASS|FAIL` - notes
|
|
- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
|
|
- Step-up kernel upgrade: `PASS|FAIL` - notes
|
|
- Step-up dev/header package match check: `PASS|FAIL` - notes
|
|
- Online in skidamarink after step-up: `PASS|FAIL` - notes
|
|
- MTDI/Galaxy Migrate service+driver health after step-up: `PASS|FAIL` - notes
|
|
- Write data + tracking status: `PASS|FAIL` - notes
|
|
- CMC uninstall: `PASS|FAIL` - notes
|
|
- Latest kernel upgrade: `PASS|FAIL` - notes
|
|
- Latest dev/header package match check: `PASS|FAIL` - notes
|
|
- CMC reinstall #2: `PASS|FAIL` - notes
|
|
- Local migration #2 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
|
|
- Online in skidamarink after latest upgrade: `PASS|FAIL` - notes
|
|
- MTDI/Galaxy Migrate service+driver health after latest upgrade: `PASS|FAIL` - notes
|
|
- Clone power off and deletion (success path only): `PASS|FAIL|N/A` - notes
|
|
|
|
### Final Outcome
|
|
- Overall result: `PASS|FAIL|PARTIAL`
|
|
- Outcome interpretation:
|
|
- `PASS`: full planned test flow completed and core validation goals passed (CMC install/uninstall/reinstall, kernel step-up/latest upgrade, and post-upgrade service/driver health checks), even if non-blocking warnings occurred.
|
|
- `FAIL`: a true blocker prevented completion of required validation goals.
|
|
- `PARTIAL`: use only when execution stops early by operator choice or scope is intentionally reduced, not for non-blocking warnings in a completed run.
|
|
- Blocking issue summary:
|
|
- Follow-up actions:
|
|
|
|
## Timestamp Standard
|
|
- All recorded test timestamps must use UTC.
|
|
- Format: `YYYY-MM-DD HH:MM UTC`
|
|
|
|
## Result Storage Location
|
|
Store and append all per-host results in:
|
|
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md`
|
|
|
|
Also generate a run summary file in the same directory:
|
|
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md`
|
|
|
|
Summary file requirements:
|
|
- Title: `CMC Upgrade Kernel Test Summary`
|
|
- Include UTC date/time for the run
|
|
- Include a short workflow summary (current kernel -> install CMC -> kernel upgrade -> uninstall CMC -> kernel upgrade -> install CMC)
|
|
- Include host tested, kernel progression (start, step-up, latest), and overall result
|