Skip to content

Conversation

@eDeveloperOZ
Copy link
Contributor

⚡️ Distributed + parallel v2.1→v3 converter with manifest orchestration (--orchestrate) and safe benchmarking (--no-push)

Addresses: [lerobot#1998](#1998)
Label: (⚡️ Performance)


What this does

This PR upgrades src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py to support parallel and distributed conversion of large LeRobot datasets from v2.1 to v3.0, while preserving the existing single-process behavior.

Highlights

  • Manifest-based orchestration (--orchestrate)

    • Plans the dataset into batches (user-tunable episodes per batch) and writes a work manifest to disk.
    • A pool of workers leases batches and converts them in parallel, writing temp shards.
    • The main process packs shards into the final v3 layout with the exact same file-size policies as the original script.
    • Workers read & write independently and concurrently (reads and writes happen at the same time, safely).
  • Deterministic v3 packing (unchanged layout)

    • Data: per-episode parquet files are packed into data/chunk-XXX/file-YYY.parquet.
    • Videos: per-camera MP4s are concatenated into videos/<camera>/chunk-XXX/file-YYY.mp4.
    • Meta: meta/episodes rebuilt with accurate (chunk_index, file_index) and per-camera from/to timestamps.
  • New CLI flags

    • --orchestrate – enable manifest-based distributed flow.
    • --episodes-per-batch <int> – batch size for planning.
    • --num-workers <int> – parallel workers on this machine.
    • --work-dir <path> – optional external work directory (keeps _work out of the cache tree).
    • --no-push – skip Hub mutations (no delete/commit/tag/push), ideal for local benchmarking/CI.
  • Backward compatible by default

    • Running without --orchestrate keeps the current single-process flow (including Hub updates).
    • Thresholds (--data-file-size-in-mb, --video-file-size-in-mb) and file layout remain the same.
  • Robust orchestration

    • Manifest records pending/leased/done per batch for safe multi-worker execution and resumability.
    • Temp output isolated in _work/; final swap at the dataset root is atomic.
    • --no-push prevents permissions issues during local tests.

How it was tested

Environment: macOS (Apple Silicon M1), 16 GB RAM, Python via uv venv.
Dataset: lerobot/svla_so101_pickplace (revision v2.1).
Auth: authenticated; used --no-push for benchmarks.

Commands

Baseline (sequential):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --no-push \
  2> bench_logs/baseline_time.log | tee bench_logs/baseline.log

Single-machine parallel (no manifest; intra-process fan-out):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --max-workers 2 \
  --no-push \
  2> bench_logs/parallel_time.log | tee bench_logs/parallel.log

Orchestrated (distributed-style with manifest):

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --orchestrate --episodes-per-batch 10 \
  --num-workers 4 \
  --no-push \
  2> bench_logs/orch_time.log | tee bench_logs/orch.log

We also used psrecord/hyperfine, but time -l covers the core metrics for this PR.

Results (this small dataset)

Mode real Max RSS Notes
Baseline (sequential) 7.70 s 449 MB Reference
Single-machine parallel (--max-workers 2) 4.51 s 431 MB ~1.7× faster
Orchestrated (--orchestrate, 10 ep/batch, 4 workers) 9.84 s 415 MB Small DS → orchestration overhead dominates

Why the orchestrator looks slower here: with only ~50 episodes and default thresholds, you end up with one v3 data file and one MP4 per camera. The orchestrator introduces process startup, manifest I/O, and a final pack pass—overhead that is amortized on large datasets where many final file-00X outputs are produced and workers keep the writer busy.

Correctness checks performed:

  • v3 layout present under cache:

    ~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/{data,meta,videos}
    
  • meta/info.json contains "codebase_version": "v3.0".

  • meta/episodes/chunk-000/file-000.parquet has correct mappings for data/* and per-camera videos/* columns.

  • For this DS: data/chunk-000/file-000.parquet, and per camera videos/.../chunk-000/file-000.mp4 as expected.


How to checkout & try? (for the reviewer)

Install & auth

pip install psutil psrecord hyperfine   # optional
huggingface-cli login                   # only needed if you intend to push; otherwise use --no-push

Run sequential (baseline)

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace --no-push

Run single-machine parallel

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace --max-workers 2 --no-push

Run orchestrated (distributed-style)

/usr/bin/time -l python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \
  --repo-id lerobot/svla_so101_pickplace \
  --orchestrate --episodes-per-batch 10 --num-workers 2 --no-push

Tip: add --work-dir /tmp/lerobot_work/svla_so101_pickplace to keep _work/manifest and temp shards outside the dataset cache.
For large datasets, tune --episodes-per-batch and consider lowering --data-file-size-in-mb / --video-file-size-in-mb to force multiple output files and expose more parallelism.

Outputs sanity

# Final v3 files
ls -R ~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/{data,meta,videos} | head -n 50

# Info + thresholds
cat ~/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace/meta/info.json

Multi-host usage (optional)
Point multiple machines at a shared --work-dir (e.g., NFS / object-storage mount). Each process leases batches from the manifest and works independently, enabling safe conversion of TB-scale datasets.


Design notes & decisions

  • Manifest rationale: deterministic planning, safe lease-based scheduling for many workers, and resumability via pending/leased/done.
  • Concurrency model: workers prepare shards in parallel while the main process writes final outputs; this keeps workers isolated and the writer authoritative for the v3 layout.
  • Compatibility: strictly preserves the v3 naming/layout and default thresholds (data 100 MB, video 500 MB).
  • Safety: --no-push mode supports read-only benchmarking and avoids Hub permission errors; default mode keeps the current push behavior.

Benchmarks snapshot (MBP M1, 16 GB RAM)

Mode Command (abridged) real Notes
Baseline convert_dataset_v21_to_v30 --no-push 7.70 s Sequential
Parallel --max-workers 2 --no-push 4.51 s ~1.7× speedup
Orchestrated --orchestrate --episodes-per-batch 10 --num-workers 4 --no-push 9.84 s Small DS ⇒ overhead dominates; shines on large DS

Expect larger speedups on big datasets where many final file-00X outputs are produced; orchestration overlaps read/pack work and amortizes per-batch overhead.


Tiny housekeeping

  • Add bench_logs/ to .gitignore (local benchmarking artifacts).

Thanks for reviewing! Happy to adjust defaults, add tests, or extend CI coverage for the orchestrator path if that helps.

@eDeveloperOZ
Copy link
Contributor Author

Update: bug fix + additional benchmarks & validation

TL;DR: My hypothesis is now CONFIRMED: the orchestrator outperforms the baseline as dataset size scale.


Bug fix

  • Fixed a regression in the new orchestrator path where convert_tasks wasn’t referenced correctly (raised a NameError), which caused the writer to fail before producing the final output.

    The orchestrator now calls the same convert_tasks logic as the sequential path and completes successfully.

  • Hardened local benchmarking mode: --no-push reliably skips all Hub mutations while performing the full local conversion.


Benchmark setup

  • Dataset: unitreerobotics/G1_Brainco_GraspOreo_Dataset (201 episodes, ~9GB)

  • Machine: MacBook Pro (Apple M1, 16GB RAM, local SSD)

  • Mode: local, --no-push

  • Metric: wall-clock “real” time from /usr/bin/time -l


Results


Baselines

  • Baseline A (sequential, cold-ish): 31.87 s

  • Orchestrator grid (episodes-per-batch × workers)

    Mode/Config Episodes/Batch Workers Real Speedup
    Baseline (Reference) 31.87 1.00×
    Orchestrator 25 2 30.86 1.03×
    Orchestrator 25 4 31.60 1.01×
    Orchestrator 25 6 106.68 0.30×
    Orchestrator 50 2 28.10 1.13×
    Orchestrator 50 4 28.82 1.11×
    Orchestrator 50 6 30.07 1.06×
    Orchestrator 75 2 26.38 1.21×
    Orchestrator 75 4 28.28 1.13×
    Orchestrator 75 6 29.80 1.07×
    Orchestrator 100 2 27.10 1.18×
    Orchestrator 100 4 28.35 1.12×
    Orchestrator 100 6 29.69 1.07×
    Orchestrator 75 2 28.03 1.14×

    Highlights

    • Best observed: 26.38 s (episodes-per-batch=75, workers=2) → ~1.21× vs 31.87 baseline (~1.11× vs 29.15).

    • Too many local workers (e.g., 6) degraded performance on this single-node SSD setup (contention & overhead).

    • Larger output file sizes (512/2048 MB) improved over baseline but weren’t the single-node best here.


    Validation (correctness)

    • We ran a baseline sequential conversion and the orchestrator conversion, both --no-push.

    • We asserted final directories match (ignoring the orchestrator’s transient _work/ and OS metadata).

      Additionally, we normalized and compared meta/stats.json—no meaningful differences after normalization.


    Conclusion

    • Our original small-dataset (50 episodes) test showed the sequential path slightly faster, so we treated “orchestrator wins” as a hypothesis.

    • On a larger dataset (201 episodes), that hypothesis is now confirmed: the orchestrator outperforms the baseline (up to ~1.21× here).

    • On single-node, fast-SSD workloads of this size, both paths are largely I/O-bound, so gains are modest locally.

      We expect larger speedups on bigger datasets and/or multi-node or network storage, where overlapping read/compute/write and true distribution remove sequential choke points.


@adlai
Copy link

adlai commented Oct 3, 2025

  • Machine: MacBook Pro (Apple M1, 16GB RAM, local SSD)

I don't have 16GB RAM let alone half that, although 823x338 looks like a reasonable diff... should I bother reading?

@eDeveloperOZ
Copy link
Contributor Author

@adlai I think so, you could just try it our with smaller DS.

@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets performance Issues aimed at improving speed or resource usage labels Oct 17, 2025

# --------------------------------------------------------------------------------
# Legacy helpers (unaltered behavior; reused by all modes)
# --------------------------------------------------------------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is good Python style to include at least one blank line before lines of reduced indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants