Files
cds-ai/cdssync/migration-test-manifest.md
anthony.wen 548beaa3ec Add bulk dataset generation options to test data script
Add bulk data generation controls for folder count, files per folder,
file size range, and bulk dataset size limits.

Also update the cdssync docs to describe the new options and how
update mode applies to generated bulk files.
2026-04-21 13:31:17 -04:00

150 lines
5.4 KiB
Markdown

# Migration Test Dataset Manifest
This manifest defines a compact, high-value filesystem test set for validating file migration behavior. It is intended to cover common file-content, naming, metadata, and directory edge cases without generating an unnecessarily large corpus.
The generator script can also run in continuous update mode after initial creation. In that mode, mutable content files are rewritten with random data on a fixed interval:
- omit the interval argument to create the dataset once and exit
- use `0` for continuous rewrites with no sleep between passes
- use any integer greater than `0` to rewrite mutable files every `N` seconds
- use `--update-only` to run updates against an already-existing dataset without recreating the special-case filesystem objects first
The generator script can also create additional bulk test data under `bulk/`:
- `--folder-count N` creates `N` numbered bulk folders
- `--files-per-folder N` creates `N` bulk files in each bulk folder
- `--min-file-size-mib N` and `--max-file-size-mib N` control the random bulk file size range
- `--max-dataset-size-mib N` caps the total size of generated bulk files only and stops creation when the cap is reached
Important implementation detail for update mode:
- the update loop rewrites content-bearing regular files that are intended to simulate active data churn
- if bulk files exist under `bulk/`, the update loop rewrites those bulk files too
- it does not rewrite script files, sparse files, symlinks, hard links, or empty files
- this preserves the special-case filesystem structure while still generating ongoing content changes
- if ACL/xattr assignment is unsupported on the target filesystem, the script logs that condition and continues
## Recommended Root Layout
- `regular/`
- `hidden/`
- `spaces in name/`
- `deep/tree/level1/level2/level3/`
- `readonly-dir/`
- `links/`
- `metadata/`
- `empty-dirs/`
## Test Objects
### Regular Files
- `regular/text_1mb_644.txt`
- `regular/text_3mb_600.txt`
- `regular/text_5mb_755.txt`
- `regular/random_1mb_600.bin`
- `regular/random_3mb_644.bin`
- `regular/random_5mb_755.bin`
- `regular/compressible_1mb_644.log`
- `regular/compressible_3mb_600.log`
- `regular/compressible_5mb_755.log`
- `regular/script_1mb_755.sh`
- `regular/script_3mb_700.sh`
- `regular/script_5mb_755.sh`
- `regular/sparse_1mb_600.img`
- `regular/sparse_3mb_600.img`
- `regular/sparse_5mb_600.img`
- `regular/empty_000_644.txt`
- `regular/empty_001_600.txt`
- `regular/empty_002_755.txt`
### Hidden Files
- `hidden/.hidden_text_1mb_644.txt`
- `hidden/.hidden_random_3mb_600.bin`
- `hidden/.hidden_script_1mb_755.sh`
- `hidden/.hidden_empty_644`
- `hidden/.hidden_sparse_5mb_600.img`
### Files With Spaces
- `spaces in name/file with spaces text 1mb 644.txt`
- `spaces in name/file with spaces random 3mb 600.bin`
- `spaces in name/file with spaces script 1mb 755.sh`
- `spaces in name/file with spaces empty 644`
- `spaces in name/file with spaces sparse 5mb 600.img`
### Long-Name Files
- `regular/longname_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_text_1mb_644.txt`
- `regular/longname_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_random_3mb_600.bin`
- `regular/longname_cccccccccccccccccccccccccccccccc_compressible_5mb_755.log`
### Deep Path Files
- `deep/tree/level1/level2/level3/deep_text_1mb_644.txt`
- `deep/tree/level1/level2/level3/deep_random_3mb_600.bin`
- `deep/tree/level1/level2/level3/deep_script_1mb_755.sh`
- `deep/tree/level1/level2/level3/deep_sparse_5mb_600.img`
### Duplicate-Content Cases
- `regular/dup_source_text_3mb_644.txt`
- `regular/dup_copy_a_text_3mb_600.txt`
- `deep/tree/level1/level2/dup_copy_b_text_3mb_755.txt`
### Timestamp Variants
- `regular/old_text_1mb_644.txt`
- `regular/recent_text_1mb_644.txt`
- `regular/futureish_text_1mb_644.txt`
### Read-Only Or Awkward Placement Cases
- `readonly-dir/locked_text_1mb_444.txt`
- `readonly-dir/locked_random_3mb_400.bin`
- `readonly-dir/locked_script_1mb_500.sh`
### Links
- `links/symlink_to_text_1mb_644.txt`
- `links/symlink_to_deep_random_3mb_600.bin`
- `links/symlink_to_hidden_file`
- `links/hardlink_to_random_3mb_644.bin`
- `links/hardlink_to_compressible_5mb_755.log`
### Directories
- `empty-dirs/empty_a/`
- `empty-dirs/empty_b/`
- `empty-dirs/.hidden_empty_dir/`
- `readonly-dir/no_write_subdir/`
- `deep/tree/level1/level2/level3/`
### Metadata Cases
These should only be created if the source filesystem supports them and the test environment allows them.
- `metadata/xattr_text_1mb_644.txt`
- `metadata/xattr_random_3mb_600.bin`
- `metadata/acl_text_1mb_644.txt`
- `metadata/acl_script_1mb_755.sh`
## Approximate Storage
Estimated real disk usage for this manifest:
- core allocated files: about `95 MiB` to `125 MiB`
- with filesystem overhead and modest headroom: plan for about `150 MiB`
- comfortable reserve for later additions: `250 MiB`
Important notes:
- sparse files may report a logical size of `1 MiB` to `5 MiB` while using much less physical disk space
- symlinks, hard links, directories, ACLs, xattrs, and empty files add little compared with regular allocated files
- if you later expand this set with more size permutations or more metadata variants, storage will grow mostly with the fully allocated non-sparse files
## Usage Recommendation
Use this directory as the canonical definition of the source dataset. Generate the files once, preserve the original unchanged, and transfer a copy to the source test machine using metadata-preserving tooling such as `rsync -aH`, `cp -a`, or a tar archive workflow.