docs(test): enforce cirrusdata-vs-mcp CMC workflow and add skidamarink offline-host cleanup checkpoints
This commit is contained in:
144
tests/cmc-upgrade-test.md
Normal file
144
tests/cmc-upgrade-test.md
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
# CMC Upgrade Test Template
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
- Run per source host provided by operator.
|
||||||
|
- Work only on the cloned VM created for this test.
|
||||||
|
|
||||||
|
## Inputs
|
||||||
|
- Source VM hostname: `<atvmxxx-...>`
|
||||||
|
- vCenter target/source location: `<cluster/datastore/folder>`
|
||||||
|
- Required clone datastore: `AutomatedTest-UnitTesting`
|
||||||
|
- Initial clone access host/IP: `<INITIAL_CLONE_HOST_OR_IP>`
|
||||||
|
- SSH username variable: `<SSH_USER_VAR>`
|
||||||
|
- SSH password variable: `<SSH_PASSWORD_VAR>`
|
||||||
|
- Cirrus profile/project: `gcstage` / `skidamarink`
|
||||||
|
|
||||||
|
## Credential Source
|
||||||
|
- Use credentials from: `/home/aw/code/cds/.env.credentials.local`
|
||||||
|
- Do not hardcode usernames/passwords in test records or commands.
|
||||||
|
|
||||||
|
## CMC Tooling Rule (Global)
|
||||||
|
- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path.
|
||||||
|
- Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
|
||||||
|
- Apply this rule to every relevant step in this procedure.
|
||||||
|
|
||||||
|
## Naming Rule
|
||||||
|
- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]`
|
||||||
|
- Before cloning, verify the clone VM name is not already in use.
|
||||||
|
- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found.
|
||||||
|
- Use plain VM name only (no `/CDSHQ-Eng/vm/` prefix) for clone destination name, and set folder separately if needed.
|
||||||
|
- OS hostname on clone: same clone name but replace `.` with `-`
|
||||||
|
|
||||||
|
## Safety Rules
|
||||||
|
- Delete only the clone created for this test.
|
||||||
|
- If the clone is missing or identity is uncertain, stop and do not delete any other VM.
|
||||||
|
- If kernel availability checks do not meet criteria, stop, power off clone, and remove clone/disks.
|
||||||
|
|
||||||
|
## Test Procedure
|
||||||
|
1. Remove offline hosts in `skidamarink` using MCP offline-host cleanup.
|
||||||
|
2. Confirm source host is powered off.
|
||||||
|
3. Determine base clone name: `aw999-[source-without-atvmxxx-]`.
|
||||||
|
4. Before cloning, check whether that clone name already exists in vCenter.
|
||||||
|
5. If the name exists, choose the next available suffixed name: `aw999-[source-without-atvmxxx-]-1`, then `-2`, then `-N` as needed.
|
||||||
|
6. Clone source VM using the resolved unique clone name on datastore `AutomatedTest-UnitTesting` only.
|
||||||
|
7. For the clone command destination name, pass only the VM name (for example `aw999-ubuntu24.04-1`), not an inventory path like `/CDSHQ-Eng/vm/...`; set folder separately if needed.
|
||||||
|
8. Detach the 2 FC PCI adapters from the cloned VM.
|
||||||
|
9. Power on clone.
|
||||||
|
10. SSH to `<INITIAL_CLONE_HOST_OR_IP>` using credentials from `/home/aw/code/cds/.env.credentials.local`.
|
||||||
|
11. Change OS hostname to clone name, replacing `.` with `-`.
|
||||||
|
12. Convert networking from static IP to DHCP.
|
||||||
|
13. Remove/clean static IP configuration references.
|
||||||
|
14. Reboot clone.
|
||||||
|
15. Find DHCP address and verify it is not `<INITIAL_CLONE_HOST_OR_IP>`.
|
||||||
|
16. If still `<INITIAL_CLONE_HOST_OR_IP>`, fix static config cleanup and repeat reboot/verify.
|
||||||
|
17. Continue all remaining steps using DHCP IP and credentials from `/home/aw/code/cds/.env.credentials.local`.
|
||||||
|
18. Check available kernel versions.
|
||||||
|
19. Verify at least 2 upgrade candidates exist.
|
||||||
|
20. If fewer than 2 candidates: stop test, power off clone, delete clone and its disks, end run.
|
||||||
|
21. Gate check:
|
||||||
|
- If step 20 triggered a stop condition, execute no further steps.
|
||||||
|
- If no stop condition was triggered, continue with the next step.
|
||||||
|
22. Using `cirrusdata` (`gcstage`, project `skidamarink`), reinstall CMC on clone.
|
||||||
|
23. Create local migration from 10GB source disk to 11GB destination disk using `cirrusdata`.
|
||||||
|
24. Wait for initial sync completion.
|
||||||
|
25. Check available kernels again.
|
||||||
|
26. Select upgrade target one step above current kernel (not latest).
|
||||||
|
27. If only 1 available version, stop test.
|
||||||
|
28. Install selected kernel and reboot.
|
||||||
|
29. After reboot, verify clone is online in `skidamarink` using `cirrusdata`.
|
||||||
|
30. SSH to clone and verify MTDI, Galaxy Migrate services/driver are up.
|
||||||
|
31. Write sample data to source 10GB disk.
|
||||||
|
32. Trigger sync and confirm tracking status using `cirrusdata`.
|
||||||
|
33. Uninstall CMC.
|
||||||
|
34. Post-uninstall cleanup checkpoint:
|
||||||
|
- Run MCP offline-host cleanup for `skidamarink`.
|
||||||
|
- If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically.
|
||||||
|
35. Check available kernels.
|
||||||
|
36. Upgrade to latest kernel and reboot.
|
||||||
|
37. Reinstall CMC via `cirrusdata` (`gcstage`, `skidamarink`).
|
||||||
|
38. Recreate local migration (10GB -> 11GB) via `cirrusdata` and wait for initial sync completion.
|
||||||
|
39. Confirm machine is online in `skidamarink` using `cirrusdata`.
|
||||||
|
40. SSH and verify MTDI, Galaxy Migrate services/driver are up.
|
||||||
|
41. Power off cloned machine.
|
||||||
|
42. Delete cloned VM and its disks from vCenter inventory.
|
||||||
|
43. Final cleanup checkpoint:
|
||||||
|
- Run MCP offline-host cleanup for `skidamarink`.
|
||||||
|
- If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically.
|
||||||
|
|
||||||
|
## Stop Conditions
|
||||||
|
- Cannot verify clone identity.
|
||||||
|
- Cannot detach required FC PCI adapters.
|
||||||
|
- Clone cannot be created on datastore `AutomatedTest-UnitTesting`.
|
||||||
|
- DHCP transition cannot be completed (clone remains static at `<INITIAL_CLONE_HOST_OR_IP>`).
|
||||||
|
- Kernel upgrade candidate criteria not met.
|
||||||
|
- Any critical migration/service validation failure that blocks continuation.
|
||||||
|
|
||||||
|
## Per-Host Test Result Record
|
||||||
|
Create one report per tested host.
|
||||||
|
|
||||||
|
### Host Metadata
|
||||||
|
- Test date/time:
|
||||||
|
- Operator:
|
||||||
|
- Source VM:
|
||||||
|
- Cloned VM name:
|
||||||
|
- Clone origin (vCenter path/folder/cluster):
|
||||||
|
- Final DHCP IP of clone:
|
||||||
|
|
||||||
|
### Kernel / OS Tracking
|
||||||
|
- Start OS version:
|
||||||
|
- Start kernel version:
|
||||||
|
- Kernel list before first upgrade:
|
||||||
|
- Kernel selected for step-up upgrade:
|
||||||
|
- Kernel after step-up reboot:
|
||||||
|
- Kernel list before latest upgrade:
|
||||||
|
- Kernel selected for latest upgrade:
|
||||||
|
- Kernel after latest reboot:
|
||||||
|
|
||||||
|
### Execution Summary (Short Bullets)
|
||||||
|
- Clone created / FC PCI detached: `PASS|FAIL` - notes
|
||||||
|
- Hostname/IP DHCP conversion: `PASS|FAIL` - notes
|
||||||
|
- CMC reinstall #1: `PASS|FAIL` - notes
|
||||||
|
- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
|
||||||
|
- Step-up kernel upgrade: `PASS|FAIL` - notes
|
||||||
|
- Online in skidamarink after step-up: `PASS|FAIL` - notes
|
||||||
|
- MTDI/Galaxy Migrate service+driver health after step-up: `PASS|FAIL` - notes
|
||||||
|
- Write data + tracking status: `PASS|FAIL` - notes
|
||||||
|
- CMC uninstall: `PASS|FAIL` - notes
|
||||||
|
- Latest kernel upgrade: `PASS|FAIL` - notes
|
||||||
|
- CMC reinstall #2: `PASS|FAIL` - notes
|
||||||
|
- Local migration #2 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes
|
||||||
|
- Online in skidamarink after latest upgrade: `PASS|FAIL` - notes
|
||||||
|
- MTDI/Galaxy Migrate service+driver health after latest upgrade: `PASS|FAIL` - notes
|
||||||
|
- Clone power off and deletion: `PASS|FAIL` - notes
|
||||||
|
|
||||||
|
### Final Outcome
|
||||||
|
- Overall result: `PASS|FAIL|PARTIAL`
|
||||||
|
- Blocking issue summary:
|
||||||
|
- Follow-up actions:
|
||||||
|
|
||||||
|
## Result Storage Location
|
||||||
|
Store per-host test results under:
|
||||||
|
- `/home/aw/code/cds/tmp/tests/cmc upgrade test/`
|
||||||
Reference in New Issue
Block a user