-
Notifications
You must be signed in to change notification settings - Fork 7
[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30
Conversation
Signed-off-by: roburishabh <[email protected]>
Signed-off-by: roburishabh <[email protected]>
WalkthroughAdds a new subset_selection package with CLI, encoder registry and Arctic encoder, utilities, a configuration-driven core pipeline for embedding generation and facility-location subset selection, README, requirements, and gitignore; exposes package-level API via init. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as CLI (cli.py)
participant Orchestrator as subset_datasets
participant Processor as DataProcessor
participant Registry as Encoder Registry
participant Encoder as ArcticEmbedEncoder
participant Utils as Utils
participant FS as Filesystem
User->>CLI: run with inputs & options
CLI->>Orchestrator: subset_datasets(input_files, subset_sizes, **kwargs)
Orchestrator->>Processor: instantiate & configure
Processor->>FS: load dataset(s)
Processor->>Registry: get_encoder_class("arctic")
Registry-->>Processor: ArcticEmbedEncoder
Processor->>Encoder: init(model, device, fp16, testing)
loop per shard
Processor->>Encoder: encode(batch)
Encoder-->>Processor: embeddings
Processor->>FS: save shard embeddings
end
Processor->>FS: merge shard embeddings
loop per fold / subset size
Processor->>Utils: compute_pairwise_dense / selection
Processor->>FS: save subset indices & metadata
end
Processor-->>Orchestrator: results
Orchestrator-->>CLI: completion
CLI-->>User: exit code / message
sequenceDiagram
autonumber
participant Caller
participant Registry as Encoder Registry
participant Arctic as ArcticEmbedEncoder
Caller->>Registry: get_encoder_class(type)
alt supported
Registry-->>Caller: ArcticEmbedEncoder
Caller->>Arctic: __init__(...)
Arctic-->>Caller: encoder instance
else unsupported
Registry-->>Caller: ValueError
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
scripts/subset_selection/.gitignore
(1 hunks)scripts/subset_selection/README.md
(1 hunks)scripts/subset_selection/__init__.py
(1 hunks)scripts/subset_selection/cli.py
(1 hunks)scripts/subset_selection/encoders/__init__.py
(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py
(1 hunks)scripts/subset_selection/requirements.txt
(1 hunks)scripts/subset_selection/subset_selection.py
(1 hunks)scripts/subset_selection/utils/__init__.py
(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense
(86-147)get_default_num_gpus
(66-83)retry_on_exception
(20-63)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
subset_datasets
(864-923)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
BasicConfig
(39-78)DataProcessor
(168-576)EncoderConfig
(82-93)ProcessingConfig
(127-165)SystemConfig
(112-123)TemplateConfig
(97-108)get_supported_encoders
(853-861)subset_datasets
(864-923)scripts/subset_selection/encoders/arctic_encoder.py (1)
EncoderConfig
(47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
ArcticEmbedEncoder
(58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig
(82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
get_encoder_class
(12-23)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense
(86-147)get_default_num_gpus
(66-83)retry_on_exception
(20-63)scripts/subset_selection/encoders/arctic_encoder.py (2)
EncoderConfig
(47-55)encode
(170-204)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
168-168: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Pylint (4.0.0)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 86-86: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 59-59: Too many positional arguments (6/5)
(R0917)
[refactor] 58-58: Too few public methods (1/2)
(R0903)
scripts/subset_selection/subset_selection.py
[refactor] 738-738: Too many branches (14/12)
(R0912)
🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py
1-1: Shebang is present but file is not executable
(EXE001)
116-116: f-string without any placeholders
Remove extraneous f
prefix
(F541)
147-147: Consider moving this statement to an else
block
(TRY300)
148-148: Do not catch blind exception: Exception
(BLE001)
scripts/subset_selection/utils/subset_selection_utils.py
34-34: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
34-34: Use explicit conversion flag
Replace with conversion flag
(RUF010)
38-40: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
39-39: Use explicit conversion flag
Replace with conversion flag
(RUF010)
44-44: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
44-44: Use explicit conversion flag
Replace with conversion flag
(RUF010)
48-48: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
48-48: Use explicit conversion flag
Replace with conversion flag
(RUF010)
52-52: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
52-52: Use explicit conversion flag
Replace with conversion flag
(RUF010)
80-82: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/encoders/__init__.py
17-20: Abstract raise
to an inner function
(TRY301)
17-20: Avoid specifying long messages outside the exception class
(TRY003)
23-23: Avoid specifying long messages outside the exception class
(TRY003)
23-23: Use explicit conversion flag
Replace with conversion flag
(RUF010)
scripts/subset_selection/encoders/arctic_encoder.py
69-71: Avoid specifying long messages outside the exception class
(TRY003)
117-120: Avoid specifying long messages outside the exception class
(TRY003)
148-151: Avoid specifying long messages outside the exception class
(TRY003)
161-164: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/subset_selection.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
155-155: Prefer TypeError
exception for invalid type
(TRY004)
155-155: Avoid specifying long messages outside the exception class
(TRY003)
159-159: Prefer TypeError
exception for invalid type
(TRY004)
159-159: Avoid specifying long messages outside the exception class
(TRY003)
161-163: Avoid specifying long messages outside the exception class
(TRY003)
165-165: Avoid specifying long messages outside the exception class
(TRY003)
181-181: By default, jinja2 sets autoescape
to False
. Consider using autoescape=True
or the select_autoescape
function to mitigate XSS vulnerabilities.
(S701)
206-206: Avoid specifying long messages outside the exception class
(TRY003)
235-237: Avoid specifying long messages outside the exception class
(TRY003)
256-258: Avoid specifying long messages outside the exception class
(TRY003)
345-345: Avoid specifying long messages outside the exception class
(TRY003)
412-412: Loop control variable fold_idx
not used within loop body
Rename unused fold_idx
to _fold_idx
(B007)
424-427: zip()
without an explicit strict=
parameter
Add explicit value for parameter strict=
(B905)
494-494: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
494-494: Use explicit conversion flag
Replace with conversion flag
(RUF010)
558-558: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
558-558: Use explicit conversion flag
Replace with conversion flag
(RUF010)
614-614: By default, jinja2 sets autoescape
to False
. Consider using autoescape=True
or the select_autoescape
function to mitigate XSS vulnerabilities.
(S701)
640-640: Abstract raise
to an inner function
(TRY301)
640-640: Avoid specifying long messages outside the exception class
(TRY003)
683-683: Consider moving this statement to an else
block
(TRY300)
688-688: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
688-688: Use explicit conversion flag
Replace with conversion flag
(RUF010)
762-762: Abstract raise
to an inner function
(TRY301)
762-762: Avoid specifying long messages outside the exception class
(TRY003)
831-833: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
832-832: Use explicit conversion flag
Replace with conversion flag
(RUF010)
847-847: Consider moving this statement to an else
block
(TRY300)
849-849: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
849-849: Use explicit conversion flag
Replace with conversion flag
(RUF010)
916-916: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
916-916: Use explicit conversion flag
Replace with conversion flag
(RUF010)
if isinstance(size_spec, float): | ||
# if not in range of 0 to 1, raise error | ||
if size_spec <= 0 or size_spec > 1: | ||
raise ValueError( | ||
"Percentage values must be between 0(non-inclusive) and 1(inclusive)" | ||
) | ||
# If between 0 and 1, treat as decimal percentage (0.5 = 50%) | ||
return max(1, int((size_spec) * total_samples)) | ||
# Treat as absolute number | ||
return min(size_spec, total_samples) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix subset percentage normalization
ProcessingConfig.__post_init__
explicitly allows float subset_sizes
up to 100 to represent percentages, but calculate_subset_size
then rejects anything greater than 1. A config such as subset_sizes=[10.0]
sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.
if isinstance(size_spec, float):
- # if not in range of 0 to 1, raise error
- if size_spec <= 0 or size_spec > 1:
- raise ValueError(
- "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
- )
- # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
- return max(1, int((size_spec) * total_samples))
+ percentage = size_spec / 100 if size_spec > 1 else size_spec
+ if percentage <= 0 or percentage > 1:
+ raise ValueError(
+ "Percentage values must be between 0 (exclusive) and 100 (inclusive)"
+ )
+ return max(1, int(percentage * total_samples))
Make the same normalization when computing per-fold budgets so both code paths agree.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if isinstance(size_spec, float): | |
# if not in range of 0 to 1, raise error | |
if size_spec <= 0 or size_spec > 1: | |
raise ValueError( | |
"Percentage values must be between 0(non-inclusive) and 1(inclusive)" | |
) | |
# If between 0 and 1, treat as decimal percentage (0.5 = 50%) | |
return max(1, int((size_spec) * total_samples)) | |
# Treat as absolute number | |
return min(size_spec, total_samples) | |
if isinstance(size_spec, float): | |
percentage = size_spec / 100 if size_spec > 1 else size_spec | |
if percentage <= 0 or percentage > 1: | |
raise ValueError( | |
"Percentage values must be between 0 (exclusive) and 100 (inclusive)" | |
) | |
return max(1, int(percentage * total_samples)) | |
# Treat as absolute number | |
return min(size_spec, total_samples) |
🧰 Tools
🪛 Ruff (0.14.0)
256-258: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 253 to 263, float
size_spec values greater than 1 are currently rejected even though config allows
percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust
calculate_subset_size to first normalize float percentage values >1 by dividing
by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and
compute the sample count; also apply the same normalization logic in the
per-fold budget computation path so both code paths treat percentage semantics
consistently.
for fold_idx, fold_indices in gpu_folds_info: | ||
try: | ||
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | ||
|
||
fold_embeddings = embeddings[fold_indices].to(device) | ||
|
||
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | ||
max_sim_mat = compute_pairwise_dense( | ||
fold_embeddings, | ||
batch_size=50000, | ||
metric="cosine", | ||
device=device, | ||
scaling="additive", | ||
) | ||
similarity_matrix = max_sim_mat.cpu().numpy() | ||
|
||
subsets = {} | ||
ds_func = FacilityLocationFunction( | ||
n=similarity_matrix.shape[0], | ||
sijs=similarity_matrix, | ||
mode="dense", | ||
separate_rep=False, | ||
) | ||
|
||
for size_spec in subset_sizes: | ||
if isinstance(size_spec, float): | ||
# Percentage-based selection | ||
budget = max( | ||
1, math.ceil(size_spec * similarity_matrix.shape[0]) | ||
) | ||
else: | ||
# Absolute number-based selection | ||
budget = max( | ||
1, | ||
math.ceil( | ||
size_spec * (similarity_matrix.shape[0] / total_samples) | ||
), | ||
) | ||
|
||
logger.info( | ||
f"Selecting subset of size {budget} for fold {fold_idx + 1}" | ||
) | ||
|
||
subset_result = ds_func.maximize( | ||
budget=budget, | ||
optimizer="LazierThanLazyGreedy", | ||
epsilon=epsilon, | ||
stopIfZeroGain=False, | ||
stopIfNegativeGain=False, | ||
verbose=False, | ||
) | ||
|
||
subset_indices = [fold_indices[x[0]] for x in subset_result] | ||
subset_gains = [x[1] for x in subset_result] | ||
subsets[size_spec] = { | ||
"indices": subset_indices, | ||
"gains": subset_gains, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip empty folds and cap per-fold budgets
When len(embeddings) < num_folds
we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500]
on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize
is asked to pick at least one item from a fold that has zero (or fewer than budget
) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.
- for fold_idx, fold_indices in gpu_folds_info:
+ for fold_idx, fold_indices in gpu_folds_info:
+ fold_size = len(fold_indices)
+ if fold_size == 0:
+ logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+ continue
try:
@@
- for size_spec in subset_sizes:
+ for size_spec in subset_sizes:
if isinstance(size_spec, float):
- # Percentage-based selection
- budget = max(
- 1, math.ceil(size_spec * similarity_matrix.shape[0])
- )
+ percentage = size_spec / 100 if size_spec > 1 else size_spec
+ budget = max(1, math.ceil(percentage * fold_size))
else:
# Absolute number-based selection
- budget = max(
- 1,
- math.ceil(
- size_spec * (similarity_matrix.shape[0] / total_samples)
- ),
- )
+ budget = max(
+ 1,
+ math.ceil(size_spec * (fold_size / total_samples)),
+ )
+ budget = min(budget, fold_size)
+ if budget == 0:
+ continue
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for fold_idx, fold_indices in gpu_folds_info: | |
try: | |
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | |
fold_embeddings = embeddings[fold_indices].to(device) | |
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | |
max_sim_mat = compute_pairwise_dense( | |
fold_embeddings, | |
batch_size=50000, | |
metric="cosine", | |
device=device, | |
scaling="additive", | |
) | |
similarity_matrix = max_sim_mat.cpu().numpy() | |
subsets = {} | |
ds_func = FacilityLocationFunction( | |
n=similarity_matrix.shape[0], | |
sijs=similarity_matrix, | |
mode="dense", | |
separate_rep=False, | |
) | |
for size_spec in subset_sizes: | |
if isinstance(size_spec, float): | |
# Percentage-based selection | |
budget = max( | |
1, math.ceil(size_spec * similarity_matrix.shape[0]) | |
) | |
else: | |
# Absolute number-based selection | |
budget = max( | |
1, | |
math.ceil( | |
size_spec * (similarity_matrix.shape[0] / total_samples) | |
), | |
) | |
logger.info( | |
f"Selecting subset of size {budget} for fold {fold_idx + 1}" | |
) | |
subset_result = ds_func.maximize( | |
budget=budget, | |
optimizer="LazierThanLazyGreedy", | |
epsilon=epsilon, | |
stopIfZeroGain=False, | |
stopIfNegativeGain=False, | |
verbose=False, | |
) | |
subset_indices = [fold_indices[x[0]] for x in subset_result] | |
subset_gains = [x[1] for x in subset_result] | |
subsets[size_spec] = { | |
"indices": subset_indices, | |
"gains": subset_gains, | |
} | |
for fold_idx, fold_indices in gpu_folds_info: | |
fold_size = len(fold_indices) | |
if fold_size == 0: | |
logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") | |
continue | |
try: | |
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | |
fold_embeddings = embeddings[fold_indices].to(device) | |
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | |
max_sim_mat = compute_pairwise_dense( | |
fold_embeddings, | |
batch_size=50000, | |
metric="cosine", | |
device=device, | |
scaling="additive", | |
) | |
similarity_matrix = max_sim_mat.cpu().numpy() | |
subsets = {} | |
ds_func = FacilityLocationFunction( | |
n=similarity_matrix.shape[0], | |
sijs=similarity_matrix, | |
mode="dense", | |
separate_rep=False, | |
) | |
for size_spec in subset_sizes: | |
if isinstance(size_spec, float): | |
percentage = size_spec / 100 if size_spec > 1 else size_spec | |
budget = max(1, math.ceil(percentage * fold_size)) | |
else: | |
# Absolute number-based selection | |
budget = max( | |
1, | |
math.ceil(size_spec * (fold_size / total_samples)), | |
) | |
budget = min(budget, fold_size) | |
if budget == 0: | |
continue | |
logger.info( | |
f"Selecting subset of size {budget} for fold {fold_idx + 1}" | |
) | |
subset_result = ds_func.maximize( | |
budget=budget, | |
optimizer="LazierThanLazyGreedy", | |
epsilon=epsilon, | |
stopIfZeroGain=False, | |
stopIfNegativeGain=False, | |
verbose=False, | |
) | |
subset_indices = [fold_indices[x[0]] for x in subset_result] | |
subset_gains = [x[1] for x in subset_result] | |
subsets[size_spec] = { | |
"indices": subset_indices, | |
"gains": subset_gains, | |
} |
distance = torch.cdist(a, b) | ||
squared_distance = distance**2 | ||
avg_dist = torch.mean(squared_distance) | ||
torch.div(squared_distance, kw * avg_dist, out=squared_distance) | ||
torch.exp(-squared_distance, out=squared_distance) | ||
return squared_distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clamp the RBF scaling factor to avoid NaNs
When all embeddings in a batch are identical (or nearly so), squared_distance
becomes 0 everywhere, making avg_dist
zero. Dividing by kw * avg_dist
(Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.
if metric == "rbf":
distance = torch.cdist(a, b)
squared_distance = distance**2
- avg_dist = torch.mean(squared_distance)
- torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+ avg_dist = torch.mean(squared_distance)
+ eps = torch.finfo(squared_distance.dtype).eps
+ scale = kw * torch.clamp(avg_dist, min=eps)
+ torch.div(squared_distance, scale, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
distance = torch.cdist(a, b) | |
squared_distance = distance**2 | |
avg_dist = torch.mean(squared_distance) | |
torch.div(squared_distance, kw * avg_dist, out=squared_distance) | |
torch.exp(-squared_distance, out=squared_distance) | |
return squared_distance | |
if metric == "rbf": | |
distance = torch.cdist(a, b) | |
squared_distance = distance**2 | |
- avg_dist = torch.mean(squared_distance) | |
avg_dist = torch.mean(squared_distance) | |
eps = torch.finfo(squared_distance.dtype).eps | |
scale = kw * torch.clamp(avg_dist, min=eps) | |
torch.div(squared_distance, scale, out=squared_distance) | |
torch.exp(-squared_distance, out=squared_distance) | |
return squared_distance |
🤖 Prompt for AI Agents
In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to
127, the RBF scaling divides by kw * avg_dist which can be zero for identical
embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g.
compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel()
or a small constant like 1e-12 on the same device/dtype, then use
avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and
divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains
device/dtype consistency.
…s.txt Signed-off-by: EC2 Default User <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
scripts/subset_selection/README.md (3)
103-103
: Add language specifier to code block.The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.
Apply this diff:
-``` +```text Required: --input <file> [<file> ...] Input file(s) to process (JSONL, JSON, CSV, Parquet) --subset-sizes <sizes> Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")Based on static analysis hints.
207-207
: Add language specifier to code block.The code block should specify a language for proper syntax highlighting.
Apply this diff:
-``` +```text scripts/ ├── __init__.py # Top-level package initialization └── subset_selection/Based on static analysis hints.
270-270
: Add language specifier to code block.The code block should specify a language for proper syntax highlighting.
Apply this diff:
-``` +```text RuntimeError: Cannot re-initialize CUDA in forked subprocessBased on static analysis hints. </blockquote></details> <details> <summary>scripts/subset_selection/cli.py (1)</summary><blockquote> `124-124`: **Remove unnecessary f-string prefix.** The string has no placeholders, so the `f` prefix is unnecessary. Apply this diff: ```diff - print(f"Starting subset selection...") + print("Starting subset selection...")
Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/subset_selection/README.md
(1 hunks)scripts/subset_selection/cli.py
(1 hunks)scripts/subset_selection/requirements.txt
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
subset_datasets
(864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
103-103: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
207-207: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt
[CRITICAL] 1-1: torch 2.5.1+cu121: undefined
(PYSEC-2025-41)
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load
with weights_only=True
leads to remote code execution
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability
🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py
1-1: Shebang is present but file is not executable
(EXE001)
124-124: f-string without any placeholders
Remove extraneous f
prefix
(F541)
155-155: Consider moving this statement to an else
block
(TRY300)
156-156: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (10)
scripts/subset_selection/README.md (7)
1-58
: LGTM! Clear installation and model setup documentation.The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.
59-100
: LGTM! Comprehensive CLI usage examples.The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.
122-161
: LGTM! Clear Python API documentation.The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.
162-204
: LGTM! Thorough configuration documentation.The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.
224-243
: LGTM! Clear documentation of encoders and outputs.The supported encoders and output files sections are concise and informative.
280-312
: LGTM! Comprehensive troubleshooting guidance.The troubleshooting section provides clear solutions for common issues users may encounter.
313-328
: LGTM! Helpful notes for users.The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.
scripts/subset_selection/cli.py (3)
13-101
: LGTM! Well-structured argument parsing.The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.
103-122
: LGTM! Proper multiprocessing setup and argument parsing.The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.
147-162
: LGTM! Appropriate exception handling for CLI entry point.The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).
@@ -0,0 +1,162 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify file permissions for executable script.
The shebang is present, but the file might not have executable permissions. If this file is intended to be run directly, set the executable bit.
Run the following script to check and optionally fix file permissions:
Based on static analysis hints.
🏁 Script executed:
#!/bin/bash
# Check if the file is executable
if [ -x scripts/subset_selection/cli.py ]; then
echo "File is executable"
else
echo "File is NOT executable. To make it executable, run:"
echo "chmod +x scripts/subset_selection/cli.py"
fi
Length of output: 92
Make scripts/subset_selection/cli.py
executable.
The file lacks the executable bit; run
chmod +x scripts/subset_selection/cli.py
to allow direct execution via its shebang.
🧰 Tools
🪛 Ruff (0.14.0)
1-1: Shebang is present but file is not executable
(EXE001)
🤖 Prompt for AI Agents
In scripts/subset_selection/cli.py around line 1, the script has a shebang but
is missing the executable permission; make the file executable by running chmod
+x scripts/subset_selection/cli.py (or set the executable bit in your VCS/CI),
and verify the shebang (#!/usr/bin/env python3) is correct so the script can be
run directly.
|
||
## Quick Start Example | ||
|
||
Using your data file: | ||
|
||
```bash | ||
# Navigate to project root | ||
cd /Users/roburishabh/Github/odh-data-processing | ||
|
||
# Run subset selection - Select 10% and 50% subsets | ||
python -m scripts.subset_selection.cli \ | ||
--input scripts/subset_selection/data/combined_cut_50x.jsonl \ | ||
--subset-sizes "0.1,0.5" \ | ||
--output-dir scripts/subset_selection/data/output \ | ||
--epsilon 0.1 \ | ||
--num-folds 10 | ||
|
||
# Check results | ||
ls scripts/subset_selection/data/output/ | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use relative path instead of absolute developer-specific path.
Line 251 contains an absolute path specific to a developer's machine, which could confuse users.
Apply this diff:
```bash
# Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+cd odh-data-processing
# Run subset selection - Select 10% and 50% subsets
python -m scripts.subset_selection.cli \
🤖 Prompt for AI Agents
In scripts/subset_selection/README.md around lines 244 to 264, the Quick Start
uses an absolute, developer-specific path; change the example to use a relative
path (e.g., "cd odh-data-processing" or simply "cd <project-root>") so the
instructions work for other users and do not expose a local filesystem location;
update the README example accordingly and ensure the rest of the commands remain
unchanged.
nvidia-cublas-cu12==12.1.3.1 | ||
nvidia-cuda-cupti-cu12==12.1.105 | ||
nvidia-cuda-nvrtc-cu12==12.1.105 | ||
nvidia-cuda-runtime-cu12==12.1.105 | ||
nvidia-cudnn-cu12==9.1.0.70 | ||
nvidia-cufft-cu12==11.0.2.54 | ||
nvidia-curand-cu12==10.3.2.106 | ||
nvidia-cusolver-cu12==11.4.5.107 | ||
nvidia-cusparse-cu12==12.1.0.106 | ||
nvidia-nccl-cu12==2.21.5 | ||
nvidia-nvjitlink-cu12==12.9.86 | ||
nvidia-nvtx-cu12==12.1.105 | ||
packaging==25.0 | ||
pandas==2.3.3 | ||
pillow==11.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU-only wheels make local CPU setup impossible
Pinning the +cu121
torch stack plus the nvidia-*
CUDA wheels forces Linux + CUDA 12.1; pip will fail outright on CPU-only laptops, contradicting the stated “tested locally without GPU”. Provide a CPU-friendly requirements set (e.g., torch==...
w/o +cu121
, gate CUDA deps behind an extra, or split into cpu/gpu files) so contributors can install and run the scripts on non-GPU environments.
🤖 Prompt for AI Agents
In scripts/subset_selection/requirements.txt around lines 38 to 52, the current
file pins CUDA-specific nvidia wheel packages and a +cu121 torch build which
prevents installation on CPU-only machines; update the requirements so
contributors without GPUs can install: remove or make optional the nvidia-*
CUDA-specific packages and avoid a +cu121 torch pin (use a generic torch version
or leave torch unpinned), and provide one of the following fixes — either split
into two files (requirements.txt for CPU-friendly deps and requirements-gpu.txt
for CUDA deps), or gate CUDA packages behind an extras marker (e.g., [gpu]) and
document installing with pip install -r requirements.txt (and optionally pip
install -r requirements-gpu.txt) or pip install .[gpu]; ensure the default
requirements.txt installs and runs on CPU-only systems.
torch==2.5.1+cu121 | ||
torchaudio==2.5.1+cu121 | ||
torchvision==0.20.1+cu121 | ||
tqdm==4.67.1 | ||
transformers==4.57.0 | ||
triton==3.1.0 | ||
typing_extensions==4.15.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: torch 2.5.1+cu121 is under multiple critical CVEs
OSV flags this exact build for GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp (see static analysis hints). Until the project bumps to a patched PyTorch release (and aligns torchaudio/torchvision accordingly), we ship with RCE/DoS exposure. Please update to a non-vulnerable version or document a mitigation before merging.
🤖 Prompt for AI Agents
In scripts/subset_selection/requirements.txt around lines 70 to 76, the pinned
torch==2.5.1+cu121 (and matching torchaudio/torchvision) is flagged by multiple
CVEs; update torch to a patched, non-vulnerable release (choose the minimum safe
version per OSV/GHSA advisories), and align torchaudio and torchvision to the
matching compatible builds (CUDA tags if needed), update the lock/build/test
config accordingly, run full CI and smoke tests to verify compatibility, and if
you cannot upgrade now add a clear mitigation note in the requirements or repo
security doc explaining the temporary risk and any runtime/workaround steps
until upgraded.
Description
https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts
.py
files only.How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Documentation
Chores