feat: Integrate Ultralytics Support with LitData #651

deependujha · 2025-07-05T04:43:53Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes # (issue).

This PR introduces support for training Ultralytics models (e.g., YOLO) directly from LitData’s optimized datasets. With this integration, large datasets (e.g., 500GB+) no longer need to be fully downloaded or stored locally. Instead, you can stream data from cloud storage (like S3) efficiently with minimal local disk usage.

🔧 How It Works

Step 1: Optimize Your Dataset (One-time)

Before training, run the optimization step on a machine that can access the full dataset. This converts the dataset into a cloud-friendly format and uploads it to your preferred storage (e.g., S3, GCS, or local disk).

from litdata.integrations.ultralytics import optimize_ultralytics_dataset 

if __name__ == "__main__":
    optimize_ultralytics_dataset(
        "coco128.yaml",                          # Your original Ultralytics-style YAML
        "s3://some-bucket/optimized-data",       # Where to store optimized chunks
        num_workers=4,                           # The number of workers on the same machine
        chunk_bytes="64MB"                       # size of each chunk
    )

This step creates an optimized dataset and generates a litdata_coco128.yaml file to be used during training.

Step 2: Patch Ultralytics for Streaming

Just once before training, call:

from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

This monkey-patches Ultralytics internals to use LitData’s streaming + caching system under the hood.

Step 3: Train as Usual

from ultralytics import YOLO

if __name__ == "__main__":
    model = YOLO("yolo11n.pt")  # Or any other YOLO model
    model.train(data="litdata_coco128.yaml", epochs=100, imgsz=640)

Complete code for training ⚡️

from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

from ultralytics import YOLO

if __name__ == "__main__":
    # Load a pretrained YOLO11n model
    model = YOLO("yolo11n.pt")

    # Train the model on COCO8
    results = model.train(data="litdata_coco128.yaml", epochs=100, imgsz=640)

That’s it, you're now training directly on a cloud-optimized, streamable dataset.

📊 Performance Benchmarks

Original Ultralytics training

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load a pretrained model (recommended for training)

# Train the model
results = model.train(data="coco.yaml", epochs=3, imgsz=640)

LitData patched Ultralytics training

Optimize code

from litdata.support.ultralytics.optimize import optimize_ultralytics_dataset

if __name__ == "__main__":
    optimize_ultralytics_dataset("coco.yaml", "fast_data", num_workers=8, chunk_bytes="64MB")

Training code

from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

from ultralytics import YOLO

if __name__ == "__main__":
    # Load a pretrained YOLO11n model
    model = YOLO("yolo11n.pt")

    # Train the model on COCO8
    results = model.train(data="litdata_coco.yaml", epochs=3, imgsz=640)

Metric	Patched Ultralytics (LitData)	Original Ultralytics
Training Epochs Completed	~15 minutes	14 to 19 minutes
Training Speed (it/s)	~7.97-8.17 it/s	~6.3-8.7 it/s
Val mAP50 (final eval)	54%	53.2%
GPU	4.32 G	4.1 G
mAP50-95 (final eval)	0.376	0.377

Verdict: While the performance gains aren't as significant as we initially anticipated, this integration unlocks all the streaming benefits of LitData, including smoother data handling, better scalability, and cleaner architecture. We're now working on a custom dataloader optimized for LitData, and if it shows substantial improvements, those will directly reflect in future benchmark results as well.

✅ Benefits

No need to store full dataset locally.
Easily train on datasets of hundreds of GBs even on a small machine.
Compatible with Ultralytics training loop — minimal code changes.
Supports local paths, S3, GCS, HTTP(S), and more.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2025-07-05T05:10:26Z

Codecov Report

Attention: Patch coverage is 50.84746% with 145 lines in your changes missing coverage. Please review.

Project coverage is 82%. Comparing base (fa2020e) to head (15d445e).

Additional details and impacted files

@@         Coverage Diff          @@
##           main   #651    +/-   ##
====================================
- Coverage    83%    82%    -1%     
====================================
  Files        49     52     +3     
  Lines      6812   7103   +291     
====================================
+ Hits       5686   5834   +148     
- Misses     1126   1269   +143

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

README.md

Copilot

Pull Request Overview

This PR adds support for streaming Ultralytics (YOLO) datasets through LitData, enabling training directly on cloud-optimized datasets without full local downloads.

Introduced optimize_ultralytics_dataset and patch_ultralytics integrations for dataset optimization and monkey-patching Ultralytics internals.
Enhanced StreamingDataset to accept multiple transform functions with shared keyword arguments.
Added a verbose flag to processing utilities and extensive new tests for streaming and Ultralytics workflows.

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/streaming/test_dataset.py	New parameterized test for multiple transforms
tests/processing/test_functions.py	Added `verbose` parameter to `test_optimize_append_overwrite`
tests/integrations/ultralytics_support/test_patch.py	Added tests for label parsing and detection transform
tests/integrations/ultralytics_support/test_optimize.py	Added tests for dataset optimization utilities
tests/conftest.py	Added `mock_ultralytics` fixture
src/litdata/streaming/dataset.py	Support for list of `transform` functions and `transform_kwargs`
src/litdata/processing/functions.py	Added `verbose` parameter to `optimize`
src/litdata/processing/data_processor.py	Added `verbose` flag to `DataProcessor` and controlled prints
src/litdata/integrations/ultralytics/patch.py	New patch implementation for Ultralytics integration
src/litdata/integrations/ultralytics/optimize.py	New optimization logic for Ultralytics datasets
src/litdata/integrations/ultralytics/init.py	Exposed `optimize_ultralytics_dataset` and `patch_ultralytics`
src/litdata/constants.py	Added `_ULTRALYTICS_AVAILABLE` constant
requirements/test.txt	Added `ultralytics >=8.3.16` for tests
README.md	Documented Ultralytics streaming integration

Comments suppressed due to low confidence (1)

src/litdata/integrations/ultralytics/optimize.py:132

[nitpick] The variable name dir shadows the built-in dir function and can be confusing. Consider renaming it to something like output_entry or mode_dir.

    for mode, dir in mode_to_dir.items():

src/litdata/integrations/ultralytics/patch.py

src/litdata/streaming/dataset.py

README.md

Co-authored-by: Copilot <[email protected]>

bhimrazy

Added a few comments and queries. Looking great so far!

sorry, couldn’t check all files changes—will review fully again later.

src/litdata/integrations/ultralytics/optimize.py

src/litdata/processing/data_processor.py

src/litdata/streaming/dataset.py

for more information, see https://pre-commit.ci

lantiga

Great work, added a couple of comments

src/litdata/integrations/ultralytics/optimize.py

lantiga · 2025-07-17T06:03:39Z

src/litdata/integrations/ultralytics/patch.py

+from litdata.streaming.dataset import StreamingDataset
+
+
+def patch_ultralytics() -> None:


There's a bit of a licensing pickle here, since Ultralytics has a custom license and we are taking code from it here in order to patch it. I don't think this is a material problem per se, since we are enhancing the use of the Ultralytics library and not copying functionality. I would still point it out in a comment in the file, and try to minimize the amount of Ultralytics code we copy here.

In addition, I would add a few regression tests that will break if the API in Ultralytics changes, because we are making quite a few assumptions on the way Ultralytics is laid out today.

…lity parameter

add verbose option in optimize_fn

18327c1

deependujha added 15 commits July 5, 2025 13:31

optimize yolo dataset

202b88c

update

fd01f61

update

6f9d631

update

1560225

patching works. verified for check_det_dataset of ultralytics

2ab3184

ready to patch ultralytics now

678f048

getting closer

da75d42

update

8667e52

yolo model train end to end

b2f5677

update

39c7bf3

Merge branch 'main' into feat/integrate-ultralytics-support

41175c2

fix mypy errors

3159a07

update

08b0683

update

ee8d179

despacito

90d828e

tchaton changed the title ~~[WIP]: Integrate Ultralytics Support~~ [WIP]: Integrate Ultralytics Support in LitData Jul 10, 2025

tchaton changed the title ~~[WIP]: Integrate Ultralytics Support in LitData~~ [WIP]: Integrate Ultralytics Support with LitData Jul 10, 2025

deependujha added 2 commits July 10, 2025 15:09

update

5ce76d6

write tests

cbe7ef4

deependujha marked this pull request as ready for review July 10, 2025 18:12

deependujha requested review from tchaton, lantiga, justusschock and Borda as code owners July 10, 2025 18:12

deependujha changed the title ~~[WIP]: Integrate Ultralytics Support with LitData~~ feat: Integrate Ultralytics Support with LitData Jul 10, 2025

deependujha added 3 commits July 11, 2025 00:11

update

c5e177d

update

9c8842a

update

74f88f3

deependujha added 4 commits July 11, 2025 01:17

test-cov

74aa0ae

update

5f85ccf

add readme

3b2f398

Update README.md

8d06209

deependujha commented Jul 10, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

deependujha added 3 commits July 11, 2025 01:37

Update README.md

e1eefc5

remove redundant comment

b000474

test-cov

0f2a4c4

deependujha requested a review from Copilot July 11, 2025 04:45

Copilot AI reviewed Jul 11, 2025

View reviewed changes

src/litdata/integrations/ultralytics/patch.py Show resolved Hide resolved

src/litdata/integrations/ultralytics/patch.py Show resolved Hide resolved

src/litdata/streaming/dataset.py Outdated Show resolved Hide resolved

src/litdata/streaming/dataset.py Outdated Show resolved Hide resolved

tchaton mentioned this pull request Jul 11, 2025

integrate litdata streaming dataloader into ultralytics ultralytics/ultralytics#15982

Closed

2 tasks

tchaton reviewed Jul 11, 2025

View reviewed changes

README.md Show resolved Hide resolved

deependujha and others added 3 commits July 12, 2025 11:45

update readme

8bb10b4

update

6113878

Update src/litdata/streaming/dataset.py

36c46f1

Co-authored-by: Copilot <[email protected]>

bhimrazy reviewed Jul 12, 2025

View reviewed changes

deependujha mentioned this pull request Jul 14, 2025

feat: add verbose option in optimize_fn #654

Merged

4 tasks

Borda assigned tchaton and deependujha Jul 14, 2025

deependujha and others added 8 commits July 15, 2025 11:20

Merge branch 'main' into feat/integrate-ultralytics-support

b0dc2cd

[pre-commit.ci] auto fixes from pre-commit.com hooks

be4aab7

for more information, see https://pre-commit.ci

update

5d315a0

update pr

b16d4aa

update

9ca3dbd

update

d9a319f

update

bc760b2

update

928d6a3

lantiga reviewed Jul 17, 2025

View reviewed changes

deependujha added 2 commits July 17, 2025 15:35

Refactor image optimization function to accept customizable image qua…

eeb25fd

…lity parameter

Merge branch 'main' into feat/integrate-ultralytics-support

15d445e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Integrate Ultralytics Support with LitData #651

feat: Integrate Ultralytics Support with LitData #651

Uh oh!

deependujha commented Jul 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhimrazy left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lantiga left a comment

Uh oh!

Uh oh!

lantiga Jul 17, 2025

Uh oh!

Uh oh!

		from litdata.streaming.dataset import StreamingDataset


		def patch_ultralytics() -> None:

feat: Integrate Ultralytics Support with LitData #651

Are you sure you want to change the base?

feat: Integrate Ultralytics Support with LitData #651

Uh oh!

Conversation

deependujha commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

🔧 How It Works

Step 1: Optimize Your Dataset (One-time)

Step 2: Patch Ultralytics for Streaming

Step 3: Train as Usual

Complete code for training ⚡️

📊 Performance Benchmarks

✅ Benefits

PR review

Did you have fun?

Uh oh!

codecov bot commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhimrazy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lantiga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lantiga Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deependujha commented Jul 5, 2025 •

edited

Loading

codecov bot commented Jul 5, 2025 •

edited

Loading

bhimrazy left a comment •

edited

Loading