-
Notifications
You must be signed in to change notification settings - Fork 69
[wip] feat: Add StreamingRawDataset for cloud storage streaming (early stage) #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[wip] feat: Add StreamingRawDataset for cloud storage streaming (early stage) #652
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new StreamingRawDataset
for streaming raw files directly from cloud storage (S3/GCS) with local caching, multithreaded indexing, and preloading.
- Introduces
CacheManager
for directory-structured caching and file downloading - Builds or loads a file index in parallel and saves it to cache
- Implements adaptive preloading and cache‐hit statistics for performance monitoring
Comments suppressed due to low confidence (2)
src/litdata/streaming/raw_dataset.py:323
- The fallback return dict uses
class_name
while the success path useslabel
for the class key. This inconsistency can confuse consumers; unify on a single key (e.g. alwayslabel
).
return {"path": file_path, "class_name": class_name, "index": index}
src/litdata/streaming/raw_dataset.py:94
- There are no tests accompanying this new streaming dataset. Please add unit tests for index building (fresh and cached), caching behavior,
__getitem__
, and fallback loading to ensure correct and robust behavior.
class StreamingRawDataset(IterableDataset):
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (0%) is below the target coverage (50%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #652 +/- ##
====================================
- Coverage 83% 81% -3%
====================================
Files 49 50 +1
Lines 6785 7007 +222
====================================
+ Hits 5662 5665 +3
- Misses 1123 1342 +219 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Before submitting
What does this PR do?
Overview
Adds a new
StreamingRawDataset
class that enables efficient streaming of raw files from cloud storage (S3/GCS) without requiring data optimization.Current State
⚠️ Early Stage & Testing Phase - May change significantly based on feedback and testings.
Usage Example
Benchmarks
Initial testing is done using Caltech-101 (4cpu machine), but final benchmarking will be done using ImageNet.
Caltech-101
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃