-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Upgrade dataset port script to paralllel and distributed proccesing a… #2036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Upgrade dataset port script to paralllel and distributed proccesing a… #2036
Conversation
…nd add bench_logs to gitignore
Update: bug fix + additional benchmarks & validationTL;DR: My hypothesis is now CONFIRMED: the orchestrator outperforms the baseline as dataset size scale.Bug fix
Benchmark setup
ResultsBaselines
|
I don't have 16GB RAM let alone half that, although 823x338 looks like a reasonable diff... should I bother reading? |
|
@adlai I think so, you could just try it our with smaller DS. |
Signed-off-by: eDeveloperOZ <[email protected]>
|
|
||
| # -------------------------------------------------------------------------------- | ||
| # Legacy helpers (unaltered behavior; reused by all modes) | ||
| # -------------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is good Python style to include at least one blank line before lines of reduced indentation.
⚡️ Distributed + parallel v2.1→v3 converter with manifest orchestration (
--orchestrate) and safe benchmarking (--no-push)Addresses: [lerobot#1998](#1998)
Label: (⚡️ Performance)
What this does
This PR upgrades
src/lerobot/datasets/v30/convert_dataset_v21_to_v30.pyto support parallel and distributed conversion of large LeRobot datasets from v2.1 to v3.0, while preserving the existing single-process behavior.Highlights
Manifest-based orchestration (
--orchestrate)Deterministic v3 packing (unchanged layout)
data/chunk-XXX/file-YYY.parquet.videos/<camera>/chunk-XXX/file-YYY.mp4.meta/episodesrebuilt with accurate(chunk_index, file_index)and per-camera from/to timestamps.New CLI flags
--orchestrate– enable manifest-based distributed flow.--episodes-per-batch <int>– batch size for planning.--num-workers <int>– parallel workers on this machine.--work-dir <path>– optional external work directory (keeps_workout of the cache tree).--no-push– skip Hub mutations (no delete/commit/tag/push), ideal for local benchmarking/CI.Backward compatible by default
--orchestratekeeps the current single-process flow (including Hub updates).--data-file-size-in-mb,--video-file-size-in-mb) and file layout remain the same.Robust orchestration
_work/; final swap at the dataset root is atomic.--no-pushprevents permissions issues during local tests.How it was tested
Environment: macOS (Apple Silicon M1), 16 GB RAM, Python via
uvvenv.Dataset:
lerobot/svla_so101_pickplace(revisionv2.1).Auth: authenticated; used
--no-pushfor benchmarks.Commands
Baseline (sequential):
Single-machine parallel (no manifest; intra-process fan-out):
Orchestrated (distributed-style with manifest):
Results (this small dataset)
real--max-workers 2)--orchestrate, 10 ep/batch, 4 workers)Why the orchestrator looks slower here: with only ~50 episodes and default thresholds, you end up with one v3 data file and one MP4 per camera. The orchestrator introduces process startup, manifest I/O, and a final pack pass—overhead that is amortized on large datasets where many final
file-00Xoutputs are produced and workers keep the writer busy.Correctness checks performed:
v3 layout present under cache:
meta/info.jsoncontains"codebase_version": "v3.0".meta/episodes/chunk-000/file-000.parquethas correct mappings fordata/*and per-cameravideos/*columns.For this DS:
data/chunk-000/file-000.parquet, and per cameravideos/.../chunk-000/file-000.mp4as expected.How to checkout & try? (for the reviewer)
Install & auth
Run sequential (baseline)
Run single-machine parallel
Run orchestrated (distributed-style)
Outputs sanity
Multi-host usage (optional)
Point multiple machines at a shared
--work-dir(e.g., NFS / object-storage mount). Each process leases batches from the manifest and works independently, enabling safe conversion of TB-scale datasets.Design notes & decisions
pending/leased/done.--no-pushmode supports read-only benchmarking and avoids Hub permission errors; default mode keeps the current push behavior.Benchmarks snapshot (MBP M1, 16 GB RAM)
realconvert_dataset_v21_to_v30 --no-push--max-workers 2 --no-push--orchestrate --episodes-per-batch 10 --num-workers 4 --no-pushTiny housekeeping
bench_logs/to.gitignore(local benchmarking artifacts).Thanks for reviewing! Happy to adjust defaults, add tests, or extend CI coverage for the orchestrator path if that helps.