Skip to content

Distributed v2.1 -> v3.0 conversion #1998

@fracapuano

Description

@fracapuano

We recently introduced a new dataset format, LeRobotDataset-v3. The format is built for scale, and supports a new feature we're quite excited about: streaming, allowing users to process data on the fly without storing it on disk (prohibitive for large scale datasets, ~TB of data).

We have also released a porting script, which we have used to port many datasets from the old 2.1 format to the more modern 3.0. However, the conversion script is not built for large scale datasets and performs the conversion sequentially.

We need to modify it so that it runs in a distributed way, spawning multiple workers each aggregating a subportion of the data first, with a final pooling of all the aggregate datasets.
A good starting point for this would be taking any dataset currently on the hub in v3.0 like lerobot/svla_so101_pickplace, access it in v2.1 (just use the revision="v2.1" argument when you're instantiating it with LeRobotDataset) and start playing around with distributed conversion script on a small scale. Then, the result could be tested (possibly asserting frame by frame) against the ground truth v3.0 dataset, making testing easier.

This would be very impactful because we currently support many large scale datasets which would be otherwise computationally prohibitive to port! Feel free to ping @fracapuano here on on x.com/_fracapuano for any help on this :))

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetIssues regarding data inputs, processing, or datasetsgood first issueperformanceIssues aimed at improving speed or resource usage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions