Files
cds-ai/cdssync/migration-test-manifest.md
anthony.wen bb1cb37dc2 Add cdssync migration test dataset tooling
Add the cdssync migration test dataset manifest, generator script,
workspace instructions, and gitignore.

This sets the default workflow to:
- generate the dataset locally
- copy it to the test machine with metadata preserved
- verify the copied data before migration testing
2026-04-20 11:49:41 -04:00

4.0 KiB

Migration Test Dataset Manifest

This manifest defines a compact, high-value filesystem test set for validating file migration behavior. It is intended to cover common file-content, naming, metadata, and directory edge cases without generating an unnecessarily large corpus.

  • regular/
  • hidden/
  • spaces in name/
  • deep/tree/level1/level2/level3/
  • readonly-dir/
  • links/
  • metadata/
  • empty-dirs/

Test Objects

Regular Files

  • regular/text_1mb_644.txt
  • regular/text_3mb_600.txt
  • regular/text_5mb_755.txt
  • regular/random_1mb_600.bin
  • regular/random_3mb_644.bin
  • regular/random_5mb_755.bin
  • regular/compressible_1mb_644.log
  • regular/compressible_3mb_600.log
  • regular/compressible_5mb_755.log
  • regular/script_1mb_755.sh
  • regular/script_3mb_700.sh
  • regular/script_5mb_755.sh
  • regular/sparse_1mb_600.img
  • regular/sparse_3mb_600.img
  • regular/sparse_5mb_600.img
  • regular/empty_000_644.txt
  • regular/empty_001_600.txt
  • regular/empty_002_755.txt

Hidden Files

  • hidden/.hidden_text_1mb_644.txt
  • hidden/.hidden_random_3mb_600.bin
  • hidden/.hidden_script_1mb_755.sh
  • hidden/.hidden_empty_644
  • hidden/.hidden_sparse_5mb_600.img

Files With Spaces

  • spaces in name/file with spaces text 1mb 644.txt
  • spaces in name/file with spaces random 3mb 600.bin
  • spaces in name/file with spaces script 1mb 755.sh
  • spaces in name/file with spaces empty 644
  • spaces in name/file with spaces sparse 5mb 600.img

Long-Name Files

  • regular/longname_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_text_1mb_644.txt
  • regular/longname_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_random_3mb_600.bin
  • regular/longname_cccccccccccccccccccccccccccccccc_compressible_5mb_755.log

Deep Path Files

  • deep/tree/level1/level2/level3/deep_text_1mb_644.txt
  • deep/tree/level1/level2/level3/deep_random_3mb_600.bin
  • deep/tree/level1/level2/level3/deep_script_1mb_755.sh
  • deep/tree/level1/level2/level3/deep_sparse_5mb_600.img

Duplicate-Content Cases

  • regular/dup_source_text_3mb_644.txt
  • regular/dup_copy_a_text_3mb_600.txt
  • deep/tree/level1/level2/dup_copy_b_text_3mb_755.txt

Timestamp Variants

  • regular/old_text_1mb_644.txt
  • regular/recent_text_1mb_644.txt
  • regular/futureish_text_1mb_644.txt

Read-Only Or Awkward Placement Cases

  • readonly-dir/locked_text_1mb_444.txt
  • readonly-dir/locked_random_3mb_400.bin
  • readonly-dir/locked_script_1mb_500.sh
  • links/symlink_to_text_1mb_644.txt
  • links/symlink_to_deep_random_3mb_600.bin
  • links/symlink_to_hidden_file
  • links/hardlink_to_random_3mb_644.bin
  • links/hardlink_to_compressible_5mb_755.log

Directories

  • empty-dirs/empty_a/
  • empty-dirs/empty_b/
  • empty-dirs/.hidden_empty_dir/
  • readonly-dir/no_write_subdir/
  • deep/tree/level1/level2/level3/

Metadata Cases

These should only be created if the source filesystem supports them and the test environment allows them.

  • metadata/xattr_text_1mb_644.txt
  • metadata/xattr_random_3mb_600.bin
  • metadata/acl_text_1mb_644.txt
  • metadata/acl_script_1mb_755.sh

Approximate Storage

Estimated real disk usage for this manifest:

  • core allocated files: about 95 MiB to 125 MiB
  • with filesystem overhead and modest headroom: plan for about 150 MiB
  • comfortable reserve for later additions: 250 MiB

Important notes:

  • sparse files may report a logical size of 1 MiB to 5 MiB while using much less physical disk space
  • symlinks, hard links, directories, ACLs, xattrs, and empty files add little compared with regular allocated files
  • if you later expand this set with more size permutations or more metadata variants, storage will grow mostly with the fully allocated non-sparse files

Usage Recommendation

Use this directory as the canonical definition of the source dataset. Generate the files once, preserve the original unchanged, and transfer a copy to the source test machine using metadata-preserving tooling such as rsync -aH, cp -a, or a tar archive workflow.