Add bulk dataset generation options to test data script

Add bulk data generation controls for folder count, files per folder, file size range, and bulk dataset size limits. Also update the cdssync docs to describe the new options and how update mode applies to generated bulk files.
2026-04-21 13:31:17 -04:00
parent 7c27535e2a
commit 548beaa3ec
3 changed files with 169 additions and 5 deletions
--- a/cdssync/migration-test-manifest.md
+++ b/cdssync/migration-test-manifest.md
@@ -9,9 +9,17 @@ The generator script can also run in continuous update mode after initial creati
 - use any integer greater than `0` to rewrite mutable files every `N` seconds
 - use `--update-only` to run updates against an already-existing dataset without recreating the special-case filesystem objects first

+The generator script can also create additional bulk test data under `bulk/`:
+
+- `--folder-count N` creates `N` numbered bulk folders
+- `--files-per-folder N` creates `N` bulk files in each bulk folder
+- `--min-file-size-mib N` and `--max-file-size-mib N` control the random bulk file size range
+- `--max-dataset-size-mib N` caps the total size of generated bulk files only and stops creation when the cap is reached
+
 Important implementation detail for update mode:

 - the update loop rewrites content-bearing regular files that are intended to simulate active data churn
+- if bulk files exist under `bulk/`, the update loop rewrites those bulk files too
 - it does not rewrite script files, sparse files, symlinks, hard links, or empty files
 - this preserves the special-case filesystem structure while still generating ongoing content changes
 - if ACL/xattr assignment is unsupported on the target filesystem, the script logs that condition and continues