Files
cds-ai/cdssync/migration-test-manifest.md
anthony.wen 4275956259 Add interval-based update mode for test dataset generation
Add optional interval-based random content updates to the cdssync
migration test dataset generator and document the new behavior.

This allows the dataset to be created once and then updated either
continuously or every N seconds while preserving the intended
special-case file structure.
2026-04-21 11:12:37 -04:00

140 lines
4.7 KiB
Markdown

# Migration Test Dataset Manifest
This manifest defines a compact, high-value filesystem test set for validating file migration behavior. It is intended to cover common file-content, naming, metadata, and directory edge cases without generating an unnecessarily large corpus.
The generator script can also run in continuous update mode after initial creation. In that mode, mutable content files are rewritten with random data on a fixed interval:
- omit the interval argument to create the dataset once and exit
- use `0` for continuous rewrites with no sleep between passes
- use any integer greater than `0` to rewrite mutable files every `N` seconds
Important implementation detail for update mode:
- the update loop rewrites content-bearing regular files that are intended to simulate active data churn
- it does not rewrite script files, sparse files, symlinks, hard links, or empty files
- this preserves the special-case filesystem structure while still generating ongoing content changes
## Recommended Root Layout
- `regular/`
- `hidden/`
- `spaces in name/`
- `deep/tree/level1/level2/level3/`
- `readonly-dir/`
- `links/`
- `metadata/`
- `empty-dirs/`
## Test Objects
### Regular Files
- `regular/text_1mb_644.txt`
- `regular/text_3mb_600.txt`
- `regular/text_5mb_755.txt`
- `regular/random_1mb_600.bin`
- `regular/random_3mb_644.bin`
- `regular/random_5mb_755.bin`
- `regular/compressible_1mb_644.log`
- `regular/compressible_3mb_600.log`
- `regular/compressible_5mb_755.log`
- `regular/script_1mb_755.sh`
- `regular/script_3mb_700.sh`
- `regular/script_5mb_755.sh`
- `regular/sparse_1mb_600.img`
- `regular/sparse_3mb_600.img`
- `regular/sparse_5mb_600.img`
- `regular/empty_000_644.txt`
- `regular/empty_001_600.txt`
- `regular/empty_002_755.txt`
### Hidden Files
- `hidden/.hidden_text_1mb_644.txt`
- `hidden/.hidden_random_3mb_600.bin`
- `hidden/.hidden_script_1mb_755.sh`
- `hidden/.hidden_empty_644`
- `hidden/.hidden_sparse_5mb_600.img`
### Files With Spaces
- `spaces in name/file with spaces text 1mb 644.txt`
- `spaces in name/file with spaces random 3mb 600.bin`
- `spaces in name/file with spaces script 1mb 755.sh`
- `spaces in name/file with spaces empty 644`
- `spaces in name/file with spaces sparse 5mb 600.img`
### Long-Name Files
- `regular/longname_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_text_1mb_644.txt`
- `regular/longname_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_random_3mb_600.bin`
- `regular/longname_cccccccccccccccccccccccccccccccc_compressible_5mb_755.log`
### Deep Path Files
- `deep/tree/level1/level2/level3/deep_text_1mb_644.txt`
- `deep/tree/level1/level2/level3/deep_random_3mb_600.bin`
- `deep/tree/level1/level2/level3/deep_script_1mb_755.sh`
- `deep/tree/level1/level2/level3/deep_sparse_5mb_600.img`
### Duplicate-Content Cases
- `regular/dup_source_text_3mb_644.txt`
- `regular/dup_copy_a_text_3mb_600.txt`
- `deep/tree/level1/level2/dup_copy_b_text_3mb_755.txt`
### Timestamp Variants
- `regular/old_text_1mb_644.txt`
- `regular/recent_text_1mb_644.txt`
- `regular/futureish_text_1mb_644.txt`
### Read-Only Or Awkward Placement Cases
- `readonly-dir/locked_text_1mb_444.txt`
- `readonly-dir/locked_random_3mb_400.bin`
- `readonly-dir/locked_script_1mb_500.sh`
### Links
- `links/symlink_to_text_1mb_644.txt`
- `links/symlink_to_deep_random_3mb_600.bin`
- `links/symlink_to_hidden_file`
- `links/hardlink_to_random_3mb_644.bin`
- `links/hardlink_to_compressible_5mb_755.log`
### Directories
- `empty-dirs/empty_a/`
- `empty-dirs/empty_b/`
- `empty-dirs/.hidden_empty_dir/`
- `readonly-dir/no_write_subdir/`
- `deep/tree/level1/level2/level3/`
### Metadata Cases
These should only be created if the source filesystem supports them and the test environment allows them.
- `metadata/xattr_text_1mb_644.txt`
- `metadata/xattr_random_3mb_600.bin`
- `metadata/acl_text_1mb_644.txt`
- `metadata/acl_script_1mb_755.sh`
## Approximate Storage
Estimated real disk usage for this manifest:
- core allocated files: about `95 MiB` to `125 MiB`
- with filesystem overhead and modest headroom: plan for about `150 MiB`
- comfortable reserve for later additions: `250 MiB`
Important notes:
- sparse files may report a logical size of `1 MiB` to `5 MiB` while using much less physical disk space
- symlinks, hard links, directories, ACLs, xattrs, and empty files add little compared with regular allocated files
- if you later expand this set with more size permutations or more metadata variants, storage will grow mostly with the fully allocated non-sparse files
## Usage Recommendation
Use this directory as the canonical definition of the source dataset. Generate the files once, preserve the original unchanged, and transfer a copy to the source test machine using metadata-preserving tooling such as `rsync -aH`, `cp -a`, or a tar archive workflow.