-
Notifications
You must be signed in to change notification settings - Fork 69
feat: Integrate Ultralytics Support with LitData #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Integrate Ultralytics Support with LitData #651
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #651 +/- ##
====================================
- Coverage 83% 82% -1%
====================================
Files 49 52 +3
Lines 6812 7103 +291
====================================
+ Hits 5686 5834 +148
- Misses 1126 1269 +143 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for streaming Ultralytics (YOLO) datasets through LitData, enabling training directly on cloud-optimized datasets without full local downloads.
- Introduced
optimize_ultralytics_dataset
andpatch_ultralytics
integrations for dataset optimization and monkey-patching Ultralytics internals. - Enhanced
StreamingDataset
to accept multiple transform functions with shared keyword arguments. - Added a
verbose
flag to processing utilities and extensive new tests for streaming and Ultralytics workflows.
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
tests/streaming/test_dataset.py | New parameterized test for multiple transforms |
tests/processing/test_functions.py | Added verbose parameter to test_optimize_append_overwrite |
tests/integrations/ultralytics_support/test_patch.py | Added tests for label parsing and detection transform |
tests/integrations/ultralytics_support/test_optimize.py | Added tests for dataset optimization utilities |
tests/conftest.py | Added mock_ultralytics fixture |
src/litdata/streaming/dataset.py | Support for list of transform functions and transform_kwargs |
src/litdata/processing/functions.py | Added verbose parameter to optimize |
src/litdata/processing/data_processor.py | Added verbose flag to DataProcessor and controlled prints |
src/litdata/integrations/ultralytics/patch.py | New patch implementation for Ultralytics integration |
src/litdata/integrations/ultralytics/optimize.py | New optimization logic for Ultralytics datasets |
src/litdata/integrations/ultralytics/init.py | Exposed optimize_ultralytics_dataset and patch_ultralytics |
src/litdata/constants.py | Added _ULTRALYTICS_AVAILABLE constant |
requirements/test.txt | Added ultralytics >=8.3.16 for tests |
README.md | Documented Ultralytics streaming integration |
Comments suppressed due to low confidence (1)
src/litdata/integrations/ultralytics/optimize.py:132
- [nitpick] The variable name
dir
shadows the built-indir
function and can be confusing. Consider renaming it to something likeoutput_entry
ormode_dir
.
for mode, dir in mode_to_dir.items():
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments and queries. Looking great so far!
sorry, couldn’t check all files changes—will review fully again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, added a couple of comments
from litdata.streaming.dataset import StreamingDataset | ||
|
||
|
||
def patch_ultralytics() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of a licensing pickle here, since Ultralytics has a custom license and we are taking code from it here in order to patch it. I don't think this is a material problem per se, since we are enhancing the use of the Ultralytics library and not copying functionality. I would still point it out in a comment in the file, and try to minimize the amount of Ultralytics code we copy here.
In addition, I would add a few regression tests that will break if the API in Ultralytics changes, because we are making quite a few assumptions on the way Ultralytics is laid out today.
Before submitting
What does this PR do?
Fixes # (issue).
This PR introduces support for training Ultralytics models (e.g., YOLO) directly from LitData’s optimized datasets. With this integration, large datasets (e.g., 500GB+) no longer need to be fully downloaded or stored locally. Instead, you can stream data from cloud storage (like S3) efficiently with minimal local disk usage.
🔧 How It Works
Step 1: Optimize Your Dataset (One-time)
Before training, run the optimization step on a machine that can access the full dataset. This converts the dataset into a cloud-friendly format and uploads it to your preferred storage (e.g., S3, GCS, or local disk).
This step creates an optimized dataset and generates a
litdata_coco128.yaml
file to be used during training.Step 2: Patch Ultralytics for Streaming
Just once before training, call:
This monkey-patches Ultralytics internals to use LitData’s streaming + caching system under the hood.
Step 3: Train as Usual
Complete code for training ⚡️
That’s it, you're now training directly on a cloud-optimized, streamable dataset.
📊 Performance Benchmarks
Original Ultralytics training
LitData patched Ultralytics training
✅ Benefits
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃