Distributed v2.1 -> v3.0 conversion

We recently introduced a new dataset format, [LeRobotDataset-v3](https://huggingface.co/blog/lerobot-datasets-v3). The format is built for scale, and supports a new feature we're quite excited about: streaming, allowing users to process data on the fly without storing it on disk (prohibitive for large scale datasets, ~TB of data).

We have also released a [porting script](https://github.com/huggingface/lerobot/blob/main/src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py), which we have used to port many datasets from the old 2.1 format to the more modern 3.0. However, the conversion script is not built for large scale datasets and performs the conversion sequentially.

We need to modify it so that it runs in a distributed way, spawning multiple workers each aggregating a subportion of the data first, with a final pooling of all the aggregate datasets.
A good starting point for this would be taking any dataset currently on the hub in v3.0 like `lerobot/svla_so101_pickplace`, access it in v2.1 (just use the `revision="v2.1"` argument when you're instantiating it with `LeRobotDataset`) and start playing around with distributed conversion script on a small scale. Then, the result could be tested (possibly asserting frame by frame) against the ground truth v3.0 dataset, making testing easier.

This would be very impactful because we currently support many large scale datasets which would be otherwise computationally prohibitive to port! Feel free to ping @fracapuano here on on x.com/_fracapuano for any help on this :))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed v2.1 -> v3.0 conversion #1998

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed v2.1 -> v3.0 conversion #1998

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions