Files
cds-ai/tests/cmc-upgrade-test.md

6.8 KiB

CMC Upgrade Test Template

Purpose

Validate CMC behavior across staged kernel upgrades on a cloned VM, including reinstall, migration health, service health, and cleanup.

Scope

  • Run per source host provided by operator.
  • Work only on the cloned VM created for this test.

Inputs

  • Source VM hostname: <atvmxxx-...>
  • vCenter target/source location: <cluster/datastore/folder>
  • Required clone datastore: AutomatedTest-UnitTesting
  • Initial clone access host/IP: <INITIAL_CLONE_HOST_OR_IP>
  • SSH username variable: <SSH_USER_VAR>
  • SSH password variable: <SSH_PASSWORD_VAR>
  • Cirrus profile/project: gcstage / skidamarink

Credential Source

  • Use credentials from: /home/aw/code/cds/.env.credentials.local
  • Do not hardcode usernames/passwords in test records or commands.

CMC Tooling Rule (Global)

  • For all CMC-related actions in this test, use the cirrusdata skill/CLI path.
  • Exception: offline-host cleanup is not handled by that skill yet; use the MCP connection for offline-host removal.
  • Apply this rule to every relevant step in this procedure.

Naming Rule

  • Base clone VM name in vCenter: aw999-[source hostname without atvmxxx- prefix]
  • Before cloning, verify the clone VM name is not already in use.
  • If already in use, append a numeric suffix to the base name: -1, -2, ... -N until an unused name is found.
  • Use plain VM name only (no /CDSHQ-Eng/vm/ prefix) for clone destination name, and set folder separately if needed.
  • OS hostname on clone: same clone name but replace . with -

Safety Rules

  • Delete only the clone created for this test.
  • If the clone is missing or identity is uncertain, stop and do not delete any other VM.
  • If kernel availability checks do not meet criteria, stop, power off clone, and remove clone/disks.

Test Procedure

  1. Remove offline hosts in skidamarink using MCP offline-host cleanup.
  2. Confirm source host is powered off.
  3. Determine base clone name: aw999-[source-without-atvmxxx-].
  4. Before cloning, check whether that clone name already exists in vCenter.
  5. If the name exists, choose the next available suffixed name: aw999-[source-without-atvmxxx-]-1, then -2, then -N as needed.
  6. Clone source VM using the resolved unique clone name on datastore AutomatedTest-UnitTesting only.
  7. For the clone command destination name, pass only the VM name (for example aw999-ubuntu24.04-1), not an inventory path like /CDSHQ-Eng/vm/...; set folder separately if needed.
  8. Detach the 2 FC PCI adapters from the cloned VM.
  9. Power on clone.
  10. SSH to <INITIAL_CLONE_HOST_OR_IP> using credentials from /home/aw/code/cds/.env.credentials.local.
  11. Change OS hostname to clone name, replacing . with -.
  12. Convert networking from static IP to DHCP.
  13. Remove/clean static IP configuration references.
  14. Reboot clone.
  15. Find DHCP address and verify it is not <INITIAL_CLONE_HOST_OR_IP>.
  16. If still <INITIAL_CLONE_HOST_OR_IP>, fix static config cleanup and repeat reboot/verify.
  17. Continue all remaining steps using DHCP IP and credentials from /home/aw/code/cds/.env.credentials.local.
  18. Check available kernel versions.
  19. Verify at least 2 upgrade candidates exist.
  20. If fewer than 2 candidates: stop test, power off clone, delete clone and its disks, end run.
  21. Gate check:
  • If step 20 triggered a stop condition, execute no further steps.
  • If no stop condition was triggered, continue with the next step.
  1. Using cirrusdata (gcstage, project skidamarink), reinstall CMC on clone.
  2. Create local migration from 10GB source disk to 11GB destination disk using cirrusdata.
  3. Wait for initial sync completion.
  4. Check available kernels again.
  5. Select upgrade target one step above current kernel (not latest).
  6. If only 1 available version, stop test.
  7. Install selected kernel and reboot.
  8. After reboot, verify clone is online in skidamarink using cirrusdata.
  9. SSH to clone and verify MTDI, Galaxy Migrate services/driver are up.
  10. Write sample data to source 10GB disk.
  11. Trigger sync and confirm tracking status using cirrusdata.
  12. Uninstall CMC.
  13. Post-uninstall cleanup checkpoint:
  • Run MCP offline-host cleanup for skidamarink.
  • If the cloned VM is still marked online after uninstall, remove that cloned VM host entry specifically.
  1. Check available kernels.
  2. Upgrade to latest kernel and reboot.
  3. Reinstall CMC via cirrusdata (gcstage, skidamarink).
  4. Recreate local migration (10GB -> 11GB) via cirrusdata and wait for initial sync completion.
  5. Confirm machine is online in skidamarink using cirrusdata.
  6. SSH and verify MTDI, Galaxy Migrate services/driver are up.
  7. Power off cloned machine.
  8. Delete cloned VM and its disks from vCenter inventory.
  9. Final cleanup checkpoint:
  • Run MCP offline-host cleanup for skidamarink.
  • If the cloned VM is still marked online at the end of the test, remove that cloned VM host entry specifically.

Stop Conditions

  • Cannot verify clone identity.
  • Cannot detach required FC PCI adapters.
  • Clone cannot be created on datastore AutomatedTest-UnitTesting.
  • DHCP transition cannot be completed (clone remains static at <INITIAL_CLONE_HOST_OR_IP>).
  • Kernel upgrade candidate criteria not met.
  • Any critical migration/service validation failure that blocks continuation.

Per-Host Test Result Record

Create one report per tested host.

Host Metadata

  • Test date/time:
  • Operator:
  • Source VM:
  • Cloned VM name:
  • Clone origin (vCenter path/folder/cluster):
  • Final DHCP IP of clone:

Kernel / OS Tracking

  • Start OS version:
  • Start kernel version:
  • Kernel list before first upgrade:
  • Kernel selected for step-up upgrade:
  • Kernel after step-up reboot:
  • Kernel list before latest upgrade:
  • Kernel selected for latest upgrade:
  • Kernel after latest reboot:

Execution Summary (Short Bullets)

  • Clone created / FC PCI detached: PASS|FAIL - notes
  • Hostname/IP DHCP conversion: PASS|FAIL - notes
  • CMC reinstall #1: PASS|FAIL - notes
  • Local migration #1 (10GB -> 11GB) initial sync: PASS|FAIL - notes
  • Step-up kernel upgrade: PASS|FAIL - notes
  • Online in skidamarink after step-up: PASS|FAIL - notes
  • MTDI/Galaxy Migrate service+driver health after step-up: PASS|FAIL - notes
  • Write data + tracking status: PASS|FAIL - notes
  • CMC uninstall: PASS|FAIL - notes
  • Latest kernel upgrade: PASS|FAIL - notes
  • CMC reinstall #2: PASS|FAIL - notes
  • Local migration #2 (10GB -> 11GB) initial sync: PASS|FAIL - notes
  • Online in skidamarink after latest upgrade: PASS|FAIL - notes
  • MTDI/Galaxy Migrate service+driver health after latest upgrade: PASS|FAIL - notes
  • Clone power off and deletion: PASS|FAIL - notes

Final Outcome

  • Overall result: PASS|FAIL|PARTIAL
  • Blocking issue summary:
  • Follow-up actions:

Result Storage Location

Store per-host test results under:

  • /home/aw/code/cds/tmp/tests/cmc upgrade test/