From 909e50828a00893957a4cb990e69da9af3aa8b2b Mon Sep 17 00:00:00 2001 From: "anthony.wen" Date: Tue, 12 May 2026 11:16:27 -0400 Subject: [PATCH] docs(test): enforce cirrusdata-vs-mcp CMC workflow and add skidamarink offline-host cleanup checkpoints --- tests/cmc-upgrade-test.md | 144 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 tests/cmc-upgrade-test.md diff --git a/tests/cmc-upgrade-test.md b/tests/cmc-upgrade-test.md new file mode 100644 index 0000000..2eebec8 --- /dev/null +++ b/tests/cmc-upgrade-test.md @@ -0,0 +1,144 @@ +# CMC Upgrade Test Template + +## Purpose +Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup. + +## Scope +- Run per source host provided by operator. +- Work only on the cloned VM created for this test. + +## Inputs +- Source VM hostname: `` +- vCenter target/source location: `` +- Required clone datastore: `AutomatedTest-UnitTesting` +- Initial clone access host/IP: `` +- SSH username variable: `` +- SSH password variable: `` +- Cirrus profile/project: `gcstage` / `skidamarink` + +## Credential Source +- Use credentials from: `/home/aw/code/cds/.env.credentials.local` +- Do not hardcode usernames/passwords in test records or commands. + +## CMC Tooling Rule (Global) +- For all CMC-related actions in this test, use the `cirrusdata` skill/CLI path. +- Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal. +- Apply this rule to every relevant step in this procedure. + +## Naming Rule +- Base clone VM name in vCenter: `aw999-[source hostname without atvmxxx- prefix]` +- Before cloning, verify the clone VM name is not already in use. +- If already in use, append a numeric suffix to the base name: `-1`, `-2`, ... `-N` until an unused name is found. +- Use plain VM name only (no `/CDSHQ-Eng/vm/` prefix) for clone destination name, and set folder separately if needed. +- OS hostname on clone: same clone name but replace `.` with `-` + +## Safety Rules +- Delete only the clone created for this test. +- If the clone is missing or identity is uncertain, stop and do not delete any other VM. +- If kernel availability checks do not meet criteria, stop, power off clone, and remove clone/disks. + +## Test Procedure +1. Remove offline hosts in `skidamarink` using MCP offline-host cleanup. +2. Confirm source host is powered off. +3. Determine base clone name: `aw999-[source-without-atvmxxx-]`. +4. Before cloning, check whether that clone name already exists in vCenter. +5. If the name exists, choose the next available suffixed name: `aw999-[source-without-atvmxxx-]-1`, then `-2`, then `-N` as needed. +6. Clone source VM using the resolved unique clone name on datastore `AutomatedTest-UnitTesting` only. +7. For the clone command destination name, pass only the VM name (for example `aw999-ubuntu24.04-1`), not an inventory path like `/CDSHQ-Eng/vm/...`; set folder separately if needed. +8. Detach the 2 FC PCI adapters from the cloned VM. +9. Power on clone. +10. SSH to `` using credentials from `/home/aw/code/cds/.env.credentials.local`. +11. Change OS hostname to clone name, replacing `.` with `-`. +12. Convert networking from static IP to DHCP. +13. Remove/clean static IP configuration references. +14. Reboot clone. +15. Find DHCP address and verify it is not ``. +16. If still ``, fix static config cleanup and repeat reboot/verify. +17. Continue all remaining steps using DHCP IP and credentials from `/home/aw/code/cds/.env.credentials.local`. +18. Check available kernel versions. +19. Verify at least 2 upgrade candidates exist. +20. If fewer than 2 candidates: stop test, power off clone, delete clone and its disks, end run. +21. Gate check: + - If step 20 triggered a stop condition, execute no further steps. + - If no stop condition was triggered, continue with the next step. +22. Using `cirrusdata` (`gcstage`, project `skidamarink`), reinstall CMC on clone. +23. Create local migration from 10GB source disk to 11GB destination disk using `cirrusdata`. +24. Wait for initial sync completion. +25. Check available kernels again. +26. Select upgrade target one step above current kernel (not latest). +27. If only 1 available version, stop test. +28. Install selected kernel and reboot. +29. After reboot, verify clone is online in `skidamarink` using `cirrusdata`. +30. SSH to clone and verify MTDI, Galaxy Migrate services/driver are up. +31. Write sample data to source 10GB disk. +32. Trigger sync and confirm tracking status using `cirrusdata`. +33. Uninstall CMC. +34. Post-uninstall cleanup checkpoint: + - Run MCP offline-host cleanup for `skidamarink`. + - If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically. +35. Check available kernels. +36. Upgrade to latest kernel and reboot. +37. Reinstall CMC via `cirrusdata` (`gcstage`, `skidamarink`). +38. Recreate local migration (10GB -> 11GB) via `cirrusdata` and wait for initial sync completion. +39. Confirm machine is online in `skidamarink` using `cirrusdata`. +40. SSH and verify MTDI, Galaxy Migrate services/driver are up. +41. Power off cloned machine. +42. Delete cloned VM and its disks from vCenter inventory. +43. Final cleanup checkpoint: + - Run MCP offline-host cleanup for `skidamarink`. + - If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically. + +## Stop Conditions +- Cannot verify clone identity. +- Cannot detach required FC PCI adapters. +- Clone cannot be created on datastore `AutomatedTest-UnitTesting`. +- DHCP transition cannot be completed (clone remains static at ``). +- Kernel upgrade candidate criteria not met. +- Any critical migration/service validation failure that blocks continuation. + +## Per-Host Test Result Record +Create one report per tested host. + +### Host Metadata +- Test date/time: +- Operator: +- Source VM: +- Cloned VM name: +- Clone origin (vCenter path/folder/cluster): +- Final DHCP IP of clone: + +### Kernel / OS Tracking +- Start OS version: +- Start kernel version: +- Kernel list before first upgrade: +- Kernel selected for step-up upgrade: +- Kernel after step-up reboot: +- Kernel list before latest upgrade: +- Kernel selected for latest upgrade: +- Kernel after latest reboot: + +### Execution Summary (Short Bullets) +- Clone created / FC PCI detached: `PASS|FAIL` - notes +- Hostname/IP DHCP conversion: `PASS|FAIL` - notes +- CMC reinstall #1: `PASS|FAIL` - notes +- Local migration #1 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes +- Step-up kernel upgrade: `PASS|FAIL` - notes +- Online in skidamarink after step-up: `PASS|FAIL` - notes +- MTDI/Galaxy Migrate service+driver health after step-up: `PASS|FAIL` - notes +- Write data + tracking status: `PASS|FAIL` - notes +- CMC uninstall: `PASS|FAIL` - notes +- Latest kernel upgrade: `PASS|FAIL` - notes +- CMC reinstall #2: `PASS|FAIL` - notes +- Local migration #2 (10GB -> 11GB) initial sync: `PASS|FAIL` - notes +- Online in skidamarink after latest upgrade: `PASS|FAIL` - notes +- MTDI/Galaxy Migrate service+driver health after latest upgrade: `PASS|FAIL` - notes +- Clone power off and deletion: `PASS|FAIL` - notes + +### Final Outcome +- Overall result: `PASS|FAIL|PARTIAL` +- Blocking issue summary: +- Follow-up actions: + +## Result Storage Location +Store per-host test results under: +- `/home/aw/code/cds/tmp/tests/cmc upgrade test/`