Files
cds-ai/cdssync/migration-test-manifest.md
anthony.wen 7c27535e2a Add update-only mode for test dataset generator
Add support for running content updates against an existing migration
test dataset without recreating the filesystem structure.

Also make ACL/xattr updates non-fatal on filesystems that do not
support those operations.
2026-04-21 13:21:22 -04:00

4.9 KiB

Migration Test Dataset Manifest

This manifest defines a compact, high-value filesystem test set for validating file migration behavior. It is intended to cover common file-content, naming, metadata, and directory edge cases without generating an unnecessarily large corpus.

The generator script can also run in continuous update mode after initial creation. In that mode, mutable content files are rewritten with random data on a fixed interval:

  • omit the interval argument to create the dataset once and exit
  • use 0 for continuous rewrites with no sleep between passes
  • use any integer greater than 0 to rewrite mutable files every N seconds
  • use --update-only to run updates against an already-existing dataset without recreating the special-case filesystem objects first

Important implementation detail for update mode:

  • the update loop rewrites content-bearing regular files that are intended to simulate active data churn
  • it does not rewrite script files, sparse files, symlinks, hard links, or empty files
  • this preserves the special-case filesystem structure while still generating ongoing content changes
  • if ACL/xattr assignment is unsupported on the target filesystem, the script logs that condition and continues
  • regular/
  • hidden/
  • spaces in name/
  • deep/tree/level1/level2/level3/
  • readonly-dir/
  • links/
  • metadata/
  • empty-dirs/

Test Objects

Regular Files

  • regular/text_1mb_644.txt
  • regular/text_3mb_600.txt
  • regular/text_5mb_755.txt
  • regular/random_1mb_600.bin
  • regular/random_3mb_644.bin
  • regular/random_5mb_755.bin
  • regular/compressible_1mb_644.log
  • regular/compressible_3mb_600.log
  • regular/compressible_5mb_755.log
  • regular/script_1mb_755.sh
  • regular/script_3mb_700.sh
  • regular/script_5mb_755.sh
  • regular/sparse_1mb_600.img
  • regular/sparse_3mb_600.img
  • regular/sparse_5mb_600.img
  • regular/empty_000_644.txt
  • regular/empty_001_600.txt
  • regular/empty_002_755.txt

Hidden Files

  • hidden/.hidden_text_1mb_644.txt
  • hidden/.hidden_random_3mb_600.bin
  • hidden/.hidden_script_1mb_755.sh
  • hidden/.hidden_empty_644
  • hidden/.hidden_sparse_5mb_600.img

Files With Spaces

  • spaces in name/file with spaces text 1mb 644.txt
  • spaces in name/file with spaces random 3mb 600.bin
  • spaces in name/file with spaces script 1mb 755.sh
  • spaces in name/file with spaces empty 644
  • spaces in name/file with spaces sparse 5mb 600.img

Long-Name Files

  • regular/longname_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_text_1mb_644.txt
  • regular/longname_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_random_3mb_600.bin
  • regular/longname_cccccccccccccccccccccccccccccccc_compressible_5mb_755.log

Deep Path Files

  • deep/tree/level1/level2/level3/deep_text_1mb_644.txt
  • deep/tree/level1/level2/level3/deep_random_3mb_600.bin
  • deep/tree/level1/level2/level3/deep_script_1mb_755.sh
  • deep/tree/level1/level2/level3/deep_sparse_5mb_600.img

Duplicate-Content Cases

  • regular/dup_source_text_3mb_644.txt
  • regular/dup_copy_a_text_3mb_600.txt
  • deep/tree/level1/level2/dup_copy_b_text_3mb_755.txt

Timestamp Variants

  • regular/old_text_1mb_644.txt
  • regular/recent_text_1mb_644.txt
  • regular/futureish_text_1mb_644.txt

Read-Only Or Awkward Placement Cases

  • readonly-dir/locked_text_1mb_444.txt
  • readonly-dir/locked_random_3mb_400.bin
  • readonly-dir/locked_script_1mb_500.sh
  • links/symlink_to_text_1mb_644.txt
  • links/symlink_to_deep_random_3mb_600.bin
  • links/symlink_to_hidden_file
  • links/hardlink_to_random_3mb_644.bin
  • links/hardlink_to_compressible_5mb_755.log

Directories

  • empty-dirs/empty_a/
  • empty-dirs/empty_b/
  • empty-dirs/.hidden_empty_dir/
  • readonly-dir/no_write_subdir/
  • deep/tree/level1/level2/level3/

Metadata Cases

These should only be created if the source filesystem supports them and the test environment allows them.

  • metadata/xattr_text_1mb_644.txt
  • metadata/xattr_random_3mb_600.bin
  • metadata/acl_text_1mb_644.txt
  • metadata/acl_script_1mb_755.sh

Approximate Storage

Estimated real disk usage for this manifest:

  • core allocated files: about 95 MiB to 125 MiB
  • with filesystem overhead and modest headroom: plan for about 150 MiB
  • comfortable reserve for later additions: 250 MiB

Important notes:

  • sparse files may report a logical size of 1 MiB to 5 MiB while using much less physical disk space
  • symlinks, hard links, directories, ACLs, xattrs, and empty files add little compared with regular allocated files
  • if you later expand this set with more size permutations or more metadata variants, storage will grow mostly with the fully allocated non-sparse files

Usage Recommendation

Use this directory as the canonical definition of the source dataset. Generate the files once, preserve the original unchanged, and transfer a copy to the source test machine using metadata-preserving tooling such as rsync -aH, cp -a, or a tar archive workflow.