Skip to content

feat: Integrate Ultralytics Support with LitData #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

deependujha
Copy link
Collaborator

@deependujha deependujha commented Jul 5, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes # (issue).

This PR introduces support for training Ultralytics models (e.g., YOLO) directly from LitData’s optimized datasets. With this integration, large datasets (e.g., 500GB+) no longer need to be fully downloaded or stored locally. Instead, you can stream data from cloud storage (like S3) efficiently with minimal local disk usage.


🔧 How It Works

Step 1: Optimize Your Dataset (One-time)

Before training, run the optimization step on a machine that can access the full dataset. This converts the dataset into a cloud-friendly format and uploads it to your preferred storage (e.g., S3, GCS, or local disk).

from litdata.integrations.ultralytics import optimize_ultralytics_dataset 

if __name__ == "__main__":
    optimize_ultralytics_dataset(
        "coco128.yaml",                          # Your original Ultralytics-style YAML
        "s3://some-bucket/optimized-data",       # Where to store optimized chunks
        num_workers=4,                           # The number of workers on the same machine
        chunk_bytes="64MB"                       # size of each chunk
    )

This step creates an optimized dataset and generates a litdata_coco128.yaml file to be used during training.


Step 2: Patch Ultralytics for Streaming

Just once before training, call:

from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

This monkey-patches Ultralytics internals to use LitData’s streaming + caching system under the hood.


Step 3: Train as Usual

from ultralytics import YOLO

if __name__ == "__main__":
    model = YOLO("yolo11n.pt")  # Or any other YOLO model
    model.train(data="litdata_coco128.yaml", epochs=100, imgsz=640)

Complete code for training ⚡️

from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

from ultralytics import YOLO

if __name__ == "__main__":
    # Load a pretrained YOLO11n model
    model = YOLO("yolo11n.pt")

    # Train the model on COCO8
    results = model.train(data="litdata_coco128.yaml", epochs=100, imgsz=640)

That’s it, you're now training directly on a cloud-optimized, streamable dataset.


📊 Performance Benchmarks

Original Ultralytics training
from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load a pretrained model (recommended for training)

# Train the model
results = model.train(data="coco.yaml", epochs=3, imgsz=640)
original-ultralytics-lightning-studio
LitData patched Ultralytics training
  • Optimize code
from litdata.support.ultralytics.optimize import optimize_ultralytics_dataset

if __name__ == "__main__":
    optimize_ultralytics_dataset("coco.yaml", "fast_data", num_workers=8, chunk_bytes="64MB")
  • Training code
from litdata.integrations.ultralytics import patch_ultralytics

patch_ultralytics()

from ultralytics import YOLO

if __name__ == "__main__":
    # Load a pretrained YOLO11n model
    model = YOLO("yolo11n.pt")

    # Train the model on COCO8
    results = model.train(data="litdata_coco.yaml", epochs=3, imgsz=640)
litdata-patch-lightning-studio
Metric Patched Ultralytics (LitData) Original Ultralytics
Training Epochs Completed ~15 minutes 14 to 19 minutes
Training Speed (it/s) ~7.97-8.17 it/s ~6.3-8.7 it/s
Val mAP50 (final eval) 54% 53.2%
GPU 4.32 G 4.1 G
mAP50-95 (final eval) 0.376 0.377

Verdict: While the performance gains aren't as significant as we initially anticipated, this integration unlocks all the streaming benefits of LitData, including smoother data handling, better scalability, and cleaner architecture. We're now working on a custom dataloader optimized for LitData, and if it shows substantial improvements, those will directly reflect in future benchmark results as well.


✅ Benefits

  • No need to store full dataset locally.
  • Easily train on datasets of hundreds of GBs even on a small machine.
  • Compatible with Ultralytics training loop — minimal code changes.
  • Supports local paths, S3, GCS, HTTP(S), and more.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Copy link

codecov bot commented Jul 5, 2025

Codecov Report

Attention: Patch coverage is 50.84746% with 145 lines in your changes missing coverage. Please review.

Project coverage is 82%. Comparing base (fa2020e) to head (15d445e).

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #651    +/-   ##
====================================
- Coverage    83%    82%    -1%     
====================================
  Files        49     52     +3     
  Lines      6812   7103   +291     
====================================
+ Hits       5686   5834   +148     
- Misses     1126   1269   +143     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tchaton tchaton changed the title [WIP]: Integrate Ultralytics Support [WIP]: Integrate Ultralytics Support in LitData Jul 10, 2025
@tchaton tchaton changed the title [WIP]: Integrate Ultralytics Support in LitData [WIP]: Integrate Ultralytics Support with LitData Jul 10, 2025
@deependujha deependujha marked this pull request as ready for review July 10, 2025 18:12
@deependujha deependujha changed the title [WIP]: Integrate Ultralytics Support with LitData feat: Integrate Ultralytics Support with LitData Jul 10, 2025
@deependujha deependujha requested a review from Copilot July 11, 2025 04:45
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for streaming Ultralytics (YOLO) datasets through LitData, enabling training directly on cloud-optimized datasets without full local downloads.

  • Introduced optimize_ultralytics_dataset and patch_ultralytics integrations for dataset optimization and monkey-patching Ultralytics internals.
  • Enhanced StreamingDataset to accept multiple transform functions with shared keyword arguments.
  • Added a verbose flag to processing utilities and extensive new tests for streaming and Ultralytics workflows.

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/streaming/test_dataset.py New parameterized test for multiple transforms
tests/processing/test_functions.py Added verbose parameter to test_optimize_append_overwrite
tests/integrations/ultralytics_support/test_patch.py Added tests for label parsing and detection transform
tests/integrations/ultralytics_support/test_optimize.py Added tests for dataset optimization utilities
tests/conftest.py Added mock_ultralytics fixture
src/litdata/streaming/dataset.py Support for list of transform functions and transform_kwargs
src/litdata/processing/functions.py Added verbose parameter to optimize
src/litdata/processing/data_processor.py Added verbose flag to DataProcessor and controlled prints
src/litdata/integrations/ultralytics/patch.py New patch implementation for Ultralytics integration
src/litdata/integrations/ultralytics/optimize.py New optimization logic for Ultralytics datasets
src/litdata/integrations/ultralytics/init.py Exposed optimize_ultralytics_dataset and patch_ultralytics
src/litdata/constants.py Added _ULTRALYTICS_AVAILABLE constant
requirements/test.txt Added ultralytics >=8.3.16 for tests
README.md Documented Ultralytics streaming integration
Comments suppressed due to low confidence (1)

src/litdata/integrations/ultralytics/optimize.py:132

  • [nitpick] The variable name dir shadows the built-in dir function and can be confusing. Consider renaming it to something like output_entry or mode_dir.
    for mode, dir in mode_to_dir.items():

Copy link
Collaborator

@bhimrazy bhimrazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments and queries. Looking great so far!

sorry, couldn’t check all files changes—will review fully again later.

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, added a couple of comments

from litdata.streaming.dataset import StreamingDataset


def patch_ultralytics() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of a licensing pickle here, since Ultralytics has a custom license and we are taking code from it here in order to patch it. I don't think this is a material problem per se, since we are enhancing the use of the Ultralytics library and not copying functionality. I would still point it out in a comment in the file, and try to minimize the amount of Ultralytics code we copy here.

In addition, I would add a few regression tests that will break if the API in Ultralytics changes, because we are making quite a few assumptions on the way Ultralytics is laid out today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants