docs(test): rename CMC upgrade test and keep clone on blocker failures

This commit is contained in:
2026-05-12 18:07:58 -04:00
parent 7477d18cff
commit 8025db4ea6

View File

@@ -0,0 +1,187 @@
# CMC Upgrade Kernel Test Template
## Purpose
Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.
## Scope
- Run per source host provided by operator.
- Work only on the cloned VM created for this test.
## Inputs
- Source VM hostname: `<atvmxxx-...>`
- vCenter target/source location: `<cluster/datastore/folder>`
- Required clone datastore: `AutomatedTest-UnitTesting`
- Initial clone access host/IP: `<INITIAL_CLONE_HOST_OR_IP>`
- SSH username variable: `<SSH_USER_VAR>`
- SSH password variable: `<SSH_PASSWORD_VAR>`
- Cirrus profile/project: `gcstage` / `skidamarink`
## Credential Source
- Use credentials from: `/home/aw/code/cds/.env.credentials.local`
- Do not hardcode usernames/passwords in test records or commands.
## CMC Tooling Rule (Global)
- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path.
- Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
- Apply this rule to every relevant step in this procedure.
## Red Hat Preflight (Global, Manual Tasks Only)
- Apply this section when the test target is a Red Hat machine and the run is manually executed.
- Do not apply this section to ATVM automation runs that already handle subscription flow.
- Before running test steps on Red Hat, run:
- `subscription-manager remove --all`
- `subscription-manager unregister`
- `subscription-manager clean`
- `subscription-manager register --username "$REDHAT_SUBSCRIPTION_USER" --password "$REDHAT_SUBSCRIPTION_PASSWORD"`
- Source credentials from `/home/aw/code/cds/.env.credentials.local`.
## Execution Mode (Global)
- Run this test in continuous execution mode.
- Do not pause for additional operator prompts between steps.
- Keep monitoring and continue automatically until the test reaches a terminal outcome (`PASS` or `FAIL`) and all required cleanup/reporting steps are completed.
- Only stop early if a true blocker prevents safe continuation, and still complete required cleanup/reporting before returning control.
## Naming Rule
- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]`
- Before cloning, verify the clone VM name is not already in use.
- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found.
- Use plain VM name only (no `/CDSHQ-Eng/vm/` prefix) for clone destination name, and set folder separately if needed.
- OS hostname on clone: same clone name but replace `.` with `-`
## Safety Rules
- Delete only the clone created for this test.
- If the clone is missing or identity is uncertain, stop and do not delete any other VM.
- If any blocker occurs after clone creation, stop the test and leave the cloned VM powered on for manual inspection.
- Do not delete or power off the clone on blocker-fail outcomes.
## Test Procedure
1. Remove offline hosts in `skidamarink` using MCP offline-host cleanup.
2. Confirm source host is powered on. If it is powered off, power it on.
3. SSH to the source host and check available kernel versions on the source before cloning.
4. Build source-host kernel candidate list from all available versions (include intermediate versions, not just the latest from `check-update`).
5. Candidate scope rule:
- Include only kernels in the same major OS family as the current machine (no major-version upgrades).
- Prefer candidates within the same minor stream as current OS/kernel when available.
6. Verify at least 2 upgrade candidates exist in the filtered candidate list.
7. If fewer than 2 candidates: hard stop and end run before clone creation.
8. Gate check:
- If step 7 triggered a stop condition, execute no further steps.
- If no stop condition was triggered, continue with the next step.
9. Confirm source host is powered off (required pre-clone state).
10. Determine base clone name: `aw999-[source-without-atvmxxx-]`.
11. Before cloning, check whether that clone name already exists in vCenter.
12. If the name exists, choose the next available suffixed name: `aw999-[source-without-atvmxxx-]-1`, then `-2`, then `-N` as needed.
13. Clone source VM using the resolved unique clone name on datastore `AutomatedTest-UnitTesting` only.
14. For the clone command destination name, pass only the VM name (for example `aw999-ubuntu24.04-1`), not an inventory path like `/CDSHQ-Eng/vm/...`; set folder separately if needed.
15. Detach the 2 FC PCI adapters from the cloned VM.
16. Power on clone.
17. SSH to `<INITIAL_CLONE_HOST_OR_IP>` using credentials from `/home/aw/code/cds/.env.credentials.local`.
18. Change OS hostname to clone name, replacing `.` with `-`.
19. Convert networking from static IP to DHCP.
20. Remove/clean static IP configuration references.
21. Reboot clone.
22. Find DHCP address and verify it is not `<INITIAL_CLONE_HOST_OR_IP>`.
23. If still `<INITIAL_CLONE_HOST_OR_IP>`, fix static config cleanup and repeat reboot/verify.
24. Continue all remaining steps using DHCP IP and credentials from `/home/aw/code/cds/.env.credentials.local`.
25. Using `cirrusdata` (`gcstage`, project `skidamarink`), reinstall CMC on clone.
26. Create local migration from 10GB source disk to 11GB destination disk using `cirrusdata`.
27. Wait for initial sync completion.
28. Check available kernels again using full candidate listing (not latest-only output).
29. Select upgrade target one step above current kernel from the filtered candidate list (same major; same minor preferred).
30. Install selected kernel and reboot.
31. After reboot, verify clone is online in `skidamarink` using `cirrusdata`.
32. SSH to clone and verify MTDI, Galaxy Migrate services/driver are up.
33. Write sample data to source 10GB disk.
34. Trigger sync and confirm tracking status using `cirrusdata`.
35. Uninstall CMC.
36. Post-uninstall cleanup checkpoint:
- Run MCP offline-host cleanup for `skidamarink`.
- If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically.
37. Check available kernels.
38. Select latest-upgrade target kernel from the filtered candidate list (same major required; same minor preferred).
39. Upgrade to selected latest target kernel and reboot.
40. Reinstall CMC via `cirrusdata` (`gcstage`, `skidamarink`).
41. Create a local migration (10GB -> 11GB) via `cirrusdata` and wait for initial sync completion.
42. Confirm machine is online in `skidamarink` using `cirrusdata`.
43. SSH and verify MTDI, Galaxy Migrate services/driver are up.
44. Success-path cleanup only: power off cloned machine.
45. Success-path cleanup only: delete cloned VM and its disks from vCenter inventory.
46. Success-path final cleanup checkpoint:
- Run MCP offline-host cleanup for `skidamarink`.
- If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically.
47. Blocker-fail path after clone creation:
- Stop test immediately after recording failure details.
- Leave cloned VM powered on and present in inventory for manual inspection.
- Do not run clone power-off/delete steps in blocker-fail path.
## Stop Conditions
- Cannot verify clone identity.
- Cannot detach required FC PCI adapters.
- Clone cannot be created on datastore `AutomatedTest-UnitTesting`.
- DHCP transition cannot be completed (clone remains static at `<INITIAL_CLONE_HOST_OR_IP>`).
- Kernel upgrade candidate criteria not met.
- Any critical migration/service validation failure that blocks continuation.
## Per-Host Test Result Record
Use one cumulative results file and append one new section per tested host.
### Host Metadata
- Test date/time (UTC):
- Operator:
- Source VM:
- Cloned VM name:
- Clone origin (vCenter path/folder/cluster):
- Final DHCP IP of clone:
### Kernel / OS Tracking
- Start OS version:
- Start kernel version:
- Kernel list before first upgrade (full candidate list, filtered by scope rule):
- Kernel selected for step-up upgrade:
- Kernel after step-up reboot:
- Kernel list before latest upgrade (full candidate list, filtered by scope rule):
- Kernel selected for latest upgrade:
- Kernel after latest reboot:
### Execution Summary (Short Bullets)
- Clone created / FC PCI detached: `PASS|FAIL` - notes
- Hostname/IP DHCP conversion: `PASS|FAIL` - notes
- CMC reinstall #1: `PASS|FAIL` - notes
- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
- Step-up kernel upgrade: `PASS|FAIL` - notes
- Online in skidamarink after step-up: `PASS|FAIL` - notes
- MTDI/Galaxy Migrate service+driver health after step-up: `PASS|FAIL` - notes
- Write data + tracking status: `PASS|FAIL` - notes
- CMC uninstall: `PASS|FAIL` - notes
- Latest kernel upgrade: `PASS|FAIL` - notes
- CMC reinstall #2: `PASS|FAIL` - notes
- Local migration #2 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
- Online in skidamarink after latest upgrade: `PASS|FAIL` - notes
- MTDI/Galaxy Migrate service+driver health after latest upgrade: `PASS|FAIL` - notes
- Clone power off and deletion (success path only): `PASS|FAIL|N/A` - notes
### Final Outcome
- Overall result: `PASS|FAIL|PARTIAL`
- Outcome interpretation:
- `PASS`: full planned test flow completed and core validation goals passed (CMC install/uninstall/reinstall, kernel step-up/latest upgrade, and post-upgrade service/driver health checks), even if non-blocking warnings occurred.
- `FAIL`: a true blocker prevented completion of required validation goals.
- `PARTIAL`: use only when execution stops early by operator choice or scope is intentionally reduced, not for non-blocking warnings in a completed run.
- Blocking issue summary:
- Follow-up actions:
## Timestamp Standard
- All recorded test timestamps must use UTC.
- Format: `YYYY-MM-DD HH:MM UTC`
## Result Storage Location
Store and append all per-host results in:
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-results.md`
Also generate a run summary file in the same directory:
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/cmc-upgrade-kernel-test-summary.md`
Summary file requirements:
- Title: `CMC Upgrade Kernel Test Summary`
- Include UTC date/time for the run
- Include a short workflow summary (current kernel -> install CMC -> kernel upgrade -> uninstall CMC -> kernel upgrade -> install CMC)
- Include host tested, kernel progression (start, step-up, latest), and overall result