Upgrade dataset port script to paralllel and distributed proccesing a… #2036

eDeveloperOZ · 2025-09-25T03:32:21Z

⚡️ Distributed + parallel v2.1→v3 converter with manifest orchestration (`--orchestrate`) and safe benchmarking (`--no-push`)

Addresses: [lerobot#1998](#1998)
Label: (⚡️ Performance)

What this does

This PR upgrades src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py to support parallel and distributed conversion of large LeRobot datasets from v2.1 to v3.0, while preserving the existing single-process behavior.

Highlights

Manifest-based orchestration (--orchestrate)
- Plans the dataset into batches (user-tunable episodes per batch) and writes a work manifest to disk.
- A pool of workers leases batches and converts them in parallel, writing temp shards.
- The main process packs shards into the final v3 layout with the exact same file-size policies as the original script.
- Workers read & write independently and concurrently (reads and writes happen at the same time, safely).
Deterministic v3 packing (unchanged layout)
- Data: per-episode parquet files are packed into data/chunk-XXX/file-YYY.parquet.
- Videos: per-camera MP4s are concatenated into videos/<camera>/chunk-XXX/file-YYY.mp4.
- Meta: meta/episodes rebuilt with accurate (chunk_index, file_index) and per-camera from/to timestamps.
New CLI flags
- --orchestrate – enable manifest-based distributed flow.
- --episodes-per-batch <int> – batch size for planning.
- --num-workers <int> – parallel workers on this machine.
- --work-dir <path> – optional external work directory (keeps _work out of the cache tree).
- --no-push – skip Hub mutations (no delete/commit/tag/push), ideal for local benchmarking/CI.
Backward compatible by default
- Running without --orchestrate keeps the current single-process flow (including Hub updates).
- Thresholds (--data-file-size-in-mb, --video-file-size-in-mb) and file layout remain the same.
Robust orchestration
- Manifest records pending/leased/done per batch for safe multi-worker execution and resumability.
- Temp output isolated in _work/; final swap at the dataset root is atomic.
- --no-push prevents permissions issues during local tests.

How it was tested

Environment: macOS (Apple Silicon M1), 16 GB RAM, Python via uv venv.
Dataset: lerobot/svla_so101_pickplace (revision v2.1).
Auth: authenticated; used --no-push for benchmarks.

Commands

Baseline (sequential):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --no-push \
  2> bench_logs/baseline_time.log | tee bench_logs/baseline.log

Single-machine parallel (no manifest; intra-process fan-out):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --max-workers 2 \
  --no-push \
  2> bench_logs/parallel_time.log | tee bench_logs/parallel.log

Orchestrated (distributed-style with manifest):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --orchestrate --episodes-per-batch 10 \
  --num-workers 4 \
  --no-push \
  2> bench_logs/orch_time.log | tee bench_logs/orch.log

We also used psrecord/hyperfine, but time -l covers the core metrics for this PR.

Results (this small dataset)

Mode	`real`	Max RSS	Notes
Baseline (sequential)	7.70 s	449 MB	Reference
Single-machine parallel (`--max-workers 2`)	4.51 s	431 MB	~1.7× faster
Orchestrated (`--orchestrate`, 10 ep/batch, 4 workers)	9.84 s	415 MB	Small DS → orchestration overhead dominates

Why the orchestrator looks slower here: with only ~50 episodes and default thresholds, you end up with one v3 data file and one MP4 per camera. The orchestrator introduces process startup, manifest I/O, and a final pack pass—overhead that is amortized on large datasets where many final file-00X outputs are produced and workers keep the writer busy.

Correctness checks performed:

v3 layout present under cache:

~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/{data,meta,videos}

meta/info.json contains "codebase_version": "v3.0".
meta/episodes/chunk-000/file-000.parquet has correct mappings for data/* and per-camera videos/* columns.
For this DS: data/chunk-000/file-000.parquet, and per camera videos/.../chunk-000/file-000.mp4 as expected.

How to checkout & try? (for the reviewer)

Install & auth

pip install psutil psrecord hyperfine   # optional
huggingface-cli login                   # only needed if you intend to push; otherwise use --no-push

Run sequential (baseline)

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace --no-push

Run single-machine parallel

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace --max-workers 2 --no-push

Run orchestrated (distributed-style)

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --orchestrate --episodes-per-batch 10 --num-workers 2 --no-push

Tip: add --work-dir /tmp/lerobot_work/svla_so101_pickplace to keep _work/manifest and temp shards outside the dataset cache.
For large datasets, tune --episodes-per-batch and consider lowering --data-file-size-in-mb / --video-file-size-in-mb to force multiple output files and expose more parallelism.

Outputs sanity

# Final v3 files
ls -R ~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/{data,meta,videos} | head -n 50

# Info + thresholds
cat ~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/meta/info.json

Multi-host usage (optional)
Point multiple machines at a shared --work-dir (e.g., NFS / object-storage mount). Each process leases batches from the manifest and works independently, enabling safe conversion of TB-scale datasets.

Design notes & decisions

Manifest rationale: deterministic planning, safe lease-based scheduling for many workers, and resumability via pending/leased/done.
Concurrency model: workers prepare shards in parallel while the main process writes final outputs; this keeps workers isolated and the writer authoritative for the v3 layout.
Compatibility: strictly preserves the v3 naming/layout and default thresholds (data 100 MB, video 500 MB).
Safety: --no-push mode supports read-only benchmarking and avoids Hub permission errors; default mode keeps the current push behavior.

Benchmarks snapshot (MBP M1, 16 GB RAM)

Mode	Command (abridged)	`real`	Notes
Baseline	`convert_dataset_v21_to_v30 --no-push`	7.70 s	Sequential
Parallel	`--max-workers 2 --no-push`	4.51 s	~1.7× speedup
Orchestrated	`--orchestrate --episodes-per-batch 10 --num-workers 4 --no-push`	9.84 s	Small DS ⇒ overhead dominates; shines on large DS

Expect larger speedups on big datasets where many final file-00X outputs are produced; orchestration overlaps read/pack work and amortizes per-batch overhead.

Tiny housekeeping

Add bench_logs/ to .gitignore (local benchmarking artifacts).

Thanks for reviewing! Happy to adjust defaults, add tests, or extend CI coverage for the orchestrator path if that helps.

…nd add bench_logs to gitignore

eDeveloperOZ · 2025-09-26T22:56:16Z

Update: bug fix + additional benchmarks & validation

TL;DR: My hypothesis is now CONFIRMED: the orchestrator outperforms the baseline as dataset size scale.

Bug fix

Fixed a regression in the new orchestrator path where convert_tasks wasn’t referenced correctly (raised a NameError), which caused the writer to fail before producing the final output.

The orchestrator now calls the same convert_tasks logic as the sequential path and completes successfully.
Hardened local benchmarking mode: --no-push reliably skips all Hub mutations while performing the full local conversion.

Benchmark setup

Dataset: unitreerobotics/G1_Brainco_GraspOreo_Dataset (201 episodes, ~9GB)
Machine: MacBook Pro (Apple M1, 16GB RAM, local SSD)
Mode: local, --no-push
Metric: wall-clock “real” time from /usr/bin/time -l

Results

Baselines

Baseline A (sequential, cold-ish): 31.87 s

Orchestrator grid (episodes-per-batch × workers)

Mode/Config	Episodes/Batch	Workers	Real	Speedup
Baseline (Reference)	—	—	31.87	1.00×
Orchestrator	25	2	30.86	1.03×
Orchestrator	25	4	31.60	1.01×
Orchestrator	25	6	106.68	0.30×
Orchestrator	50	2	28.10	1.13×
Orchestrator	50	4	28.82	1.11×
Orchestrator	50	6	30.07	1.06×
Orchestrator	75	2	26.38	1.21×
Orchestrator	75	4	28.28	1.13×
Orchestrator	75	6	29.80	1.07×
Orchestrator	100	2	27.10	1.18×
Orchestrator	100	4	28.35	1.12×
Orchestrator	100	6	29.69	1.07×
Orchestrator	75	2	28.03	1.14×

Highlights

Best observed: 26.38 s (episodes-per-batch=75, workers=2) → ~1.21× vs 31.87 baseline (~1.11× vs 29.15).
Too many local workers (e.g., 6) degraded performance on this single-node SSD setup (contention & overhead).
Larger output file sizes (512/2048 MB) improved over baseline but weren’t the single-node best here.

Validation (correctness)

We ran a baseline sequential conversion and the orchestrator conversion, both --no-push.
We asserted final directories match (ignoring the orchestrator’s transient _work/ and OS metadata).

Additionally, we normalized and compared meta/stats.json—no meaningful differences after normalization.

Conclusion

Our original small-dataset (50 episodes) test showed the sequential path slightly faster, so we treated “orchestrator wins” as a hypothesis.
On a larger dataset (201 episodes), that hypothesis is now confirmed: the orchestrator outperforms the baseline (up to ~1.21× here).
On single-node, fast-SSD workloads of this size, both paths are largely I/O-bound, so gains are modest locally.

We expect larger speedups on bigger datasets and/or multi-node or network storage, where overlapping read/compute/write and true distribution remove sequential choke points.

adlai · 2025-10-03T07:39:20Z

Machine: MacBook Pro (Apple M1, 16GB RAM, local SSD)

I don't have 16GB RAM let alone half that, although 823x338 looks like a reasonable diff... should I bother reading?

eDeveloperOZ · 2025-10-05T22:42:12Z

@adlai I think so, you could just try it our with smaller DS.

Signed-off-by: eDeveloperOZ <[email protected]>

adlai · 2025-11-04T11:08:15Z

src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py


 # --------------------------------------------------------------------------------
 # Legacy helpers (unaltered behavior; reused by all modes)
 # --------------------------------------------------------------------------------


I believe it is good Python style to include at least one blank line before lines of reduced indentation.

Upgrade dataset port script to paralllel and distributed proccesing a…

086591d

…nd add bench_logs to gitignore

eDeveloperOZ mentioned this pull request Sep 25, 2025

Distributed v2.1 -> v3.0 conversion #1998

Open

bug fix and further testing

00af183

Merge branch 'main' into feat/upgrade_DS_port_script

4529ba1

Signed-off-by: eDeveloperOZ <[email protected]>

imstevenpmwork added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets performance Issues aimed at improving speed or resource usage labels Oct 17, 2025

adlai reviewed Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade dataset port script to paralllel and distributed proccesing a… #2036

Upgrade dataset port script to paralllel and distributed proccesing a… #2036

eDeveloperOZ commented Sep 25, 2025

Uh oh!

eDeveloperOZ commented Sep 26, 2025

Validation (correctness)

Conclusion

Uh oh!

adlai commented Oct 3, 2025

Uh oh!

eDeveloperOZ commented Oct 5, 2025

Uh oh!

adlai Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Upgrade dataset port script to paralllel and distributed proccesing a… #2036

Are you sure you want to change the base?

Upgrade dataset port script to paralllel and distributed proccesing a… #2036

Conversation

eDeveloperOZ commented Sep 25, 2025

⚡️ Distributed + parallel v2.1→v3 converter with manifest orchestration (--orchestrate) and safe benchmarking (--no-push)

What this does

Highlights

How it was tested

Commands

Results (this small dataset)

How to checkout & try? (for the reviewer)

Design notes & decisions

Benchmarks snapshot (MBP M1, 16 GB RAM)

Tiny housekeeping

Uh oh!

eDeveloperOZ commented Sep 26, 2025

Update: bug fix + additional benchmarks & validation

TL;DR: My hypothesis is now CONFIRMED: the orchestrator outperforms the baseline as dataset size scale.

Bug fix

Benchmark setup

Results

Validation (correctness)

Conclusion

Uh oh!

adlai commented Oct 3, 2025

Uh oh!

eDeveloperOZ commented Oct 5, 2025

Uh oh!

adlai Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Distributed + parallel v2.1→v3 converter with manifest orchestration (`--orchestrate`) and safe benchmarking (`--no-push`)