-
Notifications
You must be signed in to change notification settings - Fork 69
feat: Integrate Ultralytics Support with LitData #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
deependujha
wants to merge
44
commits into
Lightning-AI:main
Choose a base branch
from
deependujha:feat/integrate-ultralytics-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
18327c1
add verbose option in optimize_fn
deependujha 202b88c
optimize yolo dataset
deependujha fd01f61
update
deependujha 6f9d631
update
deependujha 1560225
update
deependujha 2ab3184
patching works. verified for check_det_dataset of ultralytics
deependujha 678f048
ready to patch ultralytics now
deependujha da75d42
getting closer
deependujha 8667e52
update
deependujha b2f5677
yolo model train end to end
deependujha 39c7bf3
update
deependujha 41175c2
Merge branch 'main' into feat/integrate-ultralytics-support
deependujha 3159a07
fix mypy errors
deependujha 08b0683
update
deependujha ee8d179
update
deependujha 90d828e
despacito
deependujha 5ce76d6
update
deependujha cbe7ef4
write tests
deependujha c5e177d
update
deependujha 9c8842a
update
deependujha 74f88f3
update
deependujha 2c00abf
update
deependujha 5d8a704
update
deependujha 867d8d0
update
deependujha 74aa0ae
test-cov
deependujha 5f85ccf
update
deependujha 3b2f398
add readme
deependujha 8d06209
Update README.md
deependujha e1eefc5
Update README.md
deependujha b000474
remove redundant comment
deependujha 0f2a4c4
test-cov
deependujha 8bb10b4
update readme
deependujha 6113878
update
deependujha 36c46f1
Update src/litdata/streaming/dataset.py
deependujha b0dc2cd
Merge branch 'main' into feat/integrate-ultralytics-support
deependujha be4aab7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5d315a0
update
deependujha b16d4aa
update pr
deependujha 9ca3dbd
update
deependujha d9a319f
update
deependujha bc760b2
update
deependujha 928d6a3
update
deependujha eeb25fd
Refactor image optimization function to accept customizable image qua…
deependujha 15d445e
Merge branch 'main' into feat/integrate-ultralytics-support
deependujha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Copyright The Lightning AI team. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Copyright The Lightning AI team. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from litdata.integrations.ultralytics.optimize import optimize_ultralytics_dataset | ||
from litdata.integrations.ultralytics.patch import patch_ultralytics | ||
|
||
__all__ = ["optimize_ultralytics_dataset", "patch_ultralytics"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
# Copyright The Lightning AI team. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import os | ||
from functools import partial | ||
from pathlib import Path | ||
from typing import Optional, Union | ||
|
||
import yaml | ||
|
||
from litdata.constants import _PIL_AVAILABLE, _ULTRALYTICS_AVAILABLE | ||
from litdata.processing.functions import optimize | ||
from litdata.streaming.resolver import Dir, _resolve_dir | ||
|
||
|
||
def _ultralytics_optimize_fn(img_path: str, img_quality: int) -> Optional[dict]: | ||
"""Optimized function for Ultralytics that reads image + label and optionally re-encodes to reduce size.""" | ||
if not img_path.endswith((".jpg", ".jpeg", ".png")): | ||
raise ValueError(f"Unsupported image format: {img_path}. Supported formats are .jpg, .jpeg, and .png.") | ||
|
||
import cv2 | ||
|
||
img_ext = os.path.splitext(img_path)[-1].lower() | ||
|
||
# Read image using OpenCV | ||
img = cv2.imread(img_path, cv2.IMREAD_COLOR) | ||
if img is None: | ||
raise ValueError(f"Failed to read image: {img_path}") | ||
|
||
# JPEG re-encode if image is jpeg or png | ||
if img_ext in [".jpg", ".jpeg", ".png"]: | ||
# Reduce quality to specified value of img_quality | ||
encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), img_quality] | ||
success, encoded = cv2.imencode(".jpg", img, encode_param) | ||
if not success: | ||
raise ValueError(f"JPEG encoding failed for: {img_path}") | ||
|
||
# Decode it back to a numpy array (OpenCV default format) | ||
img = cv2.imdecode(encoded, cv2.IMREAD_COLOR) | ||
|
||
deependujha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Load the label | ||
label = "" | ||
label_path = img_path.replace("images", "labels").replace(img_ext, ".txt") | ||
if os.path.isfile(label_path): | ||
with open(label_path) as f: | ||
label = f.read().strip() | ||
else: | ||
return None # skip this sample | ||
|
||
return { | ||
"img": img, | ||
"label": label, | ||
} | ||
|
||
|
||
def optimize_ultralytics_dataset( | ||
yaml_path: str, | ||
deependujha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
output_dir: str, | ||
chunk_size: Optional[int] = None, | ||
chunk_bytes: Optional[Union[int, str]] = None, | ||
num_workers: int = 1, | ||
img_quality: int = 90, | ||
verbose: bool = False, | ||
) -> None: | ||
"""Optimize an Ultralytics dataset by converting it into chunks and resizing images. | ||
|
||
Args: | ||
yaml_path: Path to the dataset YAML file. | ||
output_dir: Directory where the optimized dataset will be saved. | ||
chunk_size: Number of samples per chunk. If None, no chunking is applied. | ||
chunk_bytes: Maximum size of each chunk in bytes. If None, no size limit is applied. | ||
num_workers: Number of worker processes to use for optimization. Defaults to 1. | ||
img_quality: Quality of the JPEG images after optimization (0-100). Defaults to 90. | ||
verbose: Whether to print progress messages. Defaults to False. | ||
""" | ||
if not _ULTRALYTICS_AVAILABLE: | ||
raise ImportError( | ||
"Ultralytics is not installed. Please install it with `pip install ultralytics` to use this function." | ||
) | ||
if not _PIL_AVAILABLE: | ||
raise ImportError("PIL is not installed. Please install it with `pip install pillow` to use this function.") | ||
|
||
# check if the YAML file exists and is a file | ||
if not os.path.isfile(yaml_path): | ||
raise FileNotFoundError(f"YAML file not found: {yaml_path}") | ||
|
||
if chunk_bytes is None and chunk_size is None: | ||
raise ValueError("Either chunk_bytes or chunk_size must be specified.") | ||
|
||
if chunk_bytes is not None and chunk_size is not None: | ||
raise ValueError("Only one of chunk_bytes or chunk_size should be specified, not both.") | ||
|
||
from ultralytics.data.utils import check_det_dataset | ||
|
||
# parse the YAML file & make sure data exists, else download it | ||
dataset_config = check_det_dataset(yaml_path) | ||
|
||
output_dir = _resolve_dir(output_dir) | ||
|
||
mode_to_dir = {} | ||
|
||
for mode in ("train", "val", "test"): | ||
if dataset_config[mode] is None: | ||
continue | ||
deependujha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if not os.path.exists(dataset_config[mode]): | ||
raise FileNotFoundError(f"Dataset directory not found for {mode}: {dataset_config[mode]}") | ||
mode_output_dir = get_output_dir(output_dir, mode) | ||
inputs = list_all_files(dataset_config[mode]) | ||
|
||
optimize( | ||
fn=partial(_ultralytics_optimize_fn, img_quality=img_quality), | ||
inputs=inputs, | ||
output_dir=mode_output_dir.url or mode_output_dir.path or "optimized_data", | ||
chunk_bytes=chunk_bytes, | ||
chunk_size=chunk_size, | ||
num_workers=num_workers, | ||
mode="overwrite", | ||
verbose=verbose, | ||
) | ||
|
||
mode_to_dir[mode] = mode_output_dir | ||
print(f"Optimized {mode} dataset and saved to {mode_output_dir} ✅") | ||
|
||
# update the YAML file with the new paths | ||
for mode, dir in mode_to_dir.items(): | ||
if mode in dataset_config: | ||
dataset_config[mode] = dir.url if dir.url else dir.path | ||
else: | ||
raise ValueError(f"Mode '{mode}' not found in dataset configuration.") | ||
|
||
# convert path to string if it's a Path object | ||
for key, value in dataset_config.items(): | ||
if isinstance(value, Path): | ||
dataset_config[key] = str(value) | ||
|
||
# save the updated YAML file | ||
output_yaml = Path(yaml_path).with_name("litdata_" + Path(yaml_path).name) | ||
with open(output_yaml, "w") as f: | ||
yaml.dump(dataset_config, f) | ||
|
||
|
||
def get_output_dir(output_dir: Dir, mode: str) -> Dir: | ||
if not isinstance(output_dir, Dir): | ||
raise TypeError(f"Expected output_dir to be of type Dir, got {type(output_dir)} instead.") | ||
url, path = output_dir.url, output_dir.path | ||
if url is not None: | ||
url = url.rstrip("/") + f"/{mode}" | ||
if path is not None: | ||
path = os.path.join(path, f"{mode}") | ||
|
||
return Dir(url=url, path=path) | ||
|
||
|
||
def list_all_files(_path: str) -> list[str]: | ||
path = Path(_path) | ||
|
||
if path.is_dir(): | ||
# Recursively list all files under the directory | ||
return [str(p) for p in path.rglob("*") if p.is_file()] | ||
|
||
if path.is_file() and path.suffix == ".txt": | ||
# Read lines and return cleaned-up paths | ||
base_dir = path.parent # use the parent of the txt file to resolve relative paths | ||
with open(path) as f: | ||
return [str((base_dir / line.strip()).resolve()) for line in f if line.strip()] | ||
|
||
else: | ||
raise ValueError(f"Unsupported path: {path}") |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.