Files
cds-ai/cdssync/migration-test-manifest.md
anthony.wen bb1cb37dc2 Add cdssync migration test dataset tooling
Add the cdssync migration test dataset manifest, generator script,
workspace instructions, and gitignore.

This sets the default workflow to:
- generate the dataset locally
- copy it to the test machine with metadata preserved
- verify the copied data before migration testing
2026-04-20 11:49:41 -04:00

128 lines
4.0 KiB
Markdown

# Migration Test Dataset Manifest
This manifest defines a compact, high-value filesystem test set for validating file migration behavior. It is intended to cover common file-content, naming, metadata, and directory edge cases without generating an unnecessarily large corpus.
## Recommended Root Layout
- `regular/`
- `hidden/`
- `spaces in name/`
- `deep/tree/level1/level2/level3/`
- `readonly-dir/`
- `links/`
- `metadata/`
- `empty-dirs/`
## Test Objects
### Regular Files
- `regular/text_1mb_644.txt`
- `regular/text_3mb_600.txt`
- `regular/text_5mb_755.txt`
- `regular/random_1mb_600.bin`
- `regular/random_3mb_644.bin`
- `regular/random_5mb_755.bin`
- `regular/compressible_1mb_644.log`
- `regular/compressible_3mb_600.log`
- `regular/compressible_5mb_755.log`
- `regular/script_1mb_755.sh`
- `regular/script_3mb_700.sh`
- `regular/script_5mb_755.sh`
- `regular/sparse_1mb_600.img`
- `regular/sparse_3mb_600.img`
- `regular/sparse_5mb_600.img`
- `regular/empty_000_644.txt`
- `regular/empty_001_600.txt`
- `regular/empty_002_755.txt`
### Hidden Files
- `hidden/.hidden_text_1mb_644.txt`
- `hidden/.hidden_random_3mb_600.bin`
- `hidden/.hidden_script_1mb_755.sh`
- `hidden/.hidden_empty_644`
- `hidden/.hidden_sparse_5mb_600.img`
### Files With Spaces
- `spaces in name/file with spaces text 1mb 644.txt`
- `spaces in name/file with spaces random 3mb 600.bin`
- `spaces in name/file with spaces script 1mb 755.sh`
- `spaces in name/file with spaces empty 644`
- `spaces in name/file with spaces sparse 5mb 600.img`
### Long-Name Files
- `regular/longname_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_text_1mb_644.txt`
- `regular/longname_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_random_3mb_600.bin`
- `regular/longname_cccccccccccccccccccccccccccccccc_compressible_5mb_755.log`
### Deep Path Files
- `deep/tree/level1/level2/level3/deep_text_1mb_644.txt`
- `deep/tree/level1/level2/level3/deep_random_3mb_600.bin`
- `deep/tree/level1/level2/level3/deep_script_1mb_755.sh`
- `deep/tree/level1/level2/level3/deep_sparse_5mb_600.img`
### Duplicate-Content Cases
- `regular/dup_source_text_3mb_644.txt`
- `regular/dup_copy_a_text_3mb_600.txt`
- `deep/tree/level1/level2/dup_copy_b_text_3mb_755.txt`
### Timestamp Variants
- `regular/old_text_1mb_644.txt`
- `regular/recent_text_1mb_644.txt`
- `regular/futureish_text_1mb_644.txt`
### Read-Only Or Awkward Placement Cases
- `readonly-dir/locked_text_1mb_444.txt`
- `readonly-dir/locked_random_3mb_400.bin`
- `readonly-dir/locked_script_1mb_500.sh`
### Links
- `links/symlink_to_text_1mb_644.txt`
- `links/symlink_to_deep_random_3mb_600.bin`
- `links/symlink_to_hidden_file`
- `links/hardlink_to_random_3mb_644.bin`
- `links/hardlink_to_compressible_5mb_755.log`
### Directories
- `empty-dirs/empty_a/`
- `empty-dirs/empty_b/`
- `empty-dirs/.hidden_empty_dir/`
- `readonly-dir/no_write_subdir/`
- `deep/tree/level1/level2/level3/`
### Metadata Cases
These should only be created if the source filesystem supports them and the test environment allows them.
- `metadata/xattr_text_1mb_644.txt`
- `metadata/xattr_random_3mb_600.bin`
- `metadata/acl_text_1mb_644.txt`
- `metadata/acl_script_1mb_755.sh`
## Approximate Storage
Estimated real disk usage for this manifest:
- core allocated files: about `95 MiB` to `125 MiB`
- with filesystem overhead and modest headroom: plan for about `150 MiB`
- comfortable reserve for later additions: `250 MiB`
Important notes:
- sparse files may report a logical size of `1 MiB` to `5 MiB` while using much less physical disk space
- symlinks, hard links, directories, ACLs, xattrs, and empty files add little compared with regular allocated files
- if you later expand this set with more size permutations or more metadata variants, storage will grow mostly with the fully allocated non-sparse files
## Usage Recommendation
Use this directory as the canonical definition of the source dataset. Generate the files once, preserve the original unchanged, and transfer a copy to the source test machine using metadata-preserving tooling such as `rsync -aH`, `cp -a`, or a tar archive workflow.