-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
We recently introduced a new dataset format, LeRobotDataset-v3. The format is built for scale, and supports a new feature we're quite excited about: streaming, allowing users to process data on the fly without storing it on disk (prohibitive for large scale datasets, ~TB of data).
We have also released a porting script, which we have used to port many datasets from the old 2.1 format to the more modern 3.0. However, the conversion script is not built for large scale datasets and performs the conversion sequentially.
We need to modify it so that it runs in a distributed way, spawning multiple workers each aggregating a subportion of the data first, with a final pooling of all the aggregate datasets.
A good starting point for this would be taking any dataset currently on the hub in v3.0 like lerobot/svla_so101_pickplace, access it in v2.1 (just use the revision="v2.1" argument when you're instantiating it with LeRobotDataset) and start playing around with distributed conversion script on a small scale. Then, the result could be tested (possibly asserting frame by frame) against the ground truth v3.0 dataset, making testing easier.
This would be very impactful because we currently support many large scale datasets which would be otherwise computationally prohibitive to port! Feel free to ping @fracapuano here on on x.com/_fracapuano for any help on this :))