Skip to content

Conversation

RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Oct 14, 2025

Description

https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts .py files only.

How Has This Been Tested?

  • Tested it in local environment without GPU on CPU only.
  • Tested it on AWS with GPU.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added a subset selection tool with CLI and Python API, multi-file inputs, percentage/absolute subset sizes, testing mode, and progress output.
    • Embedding-based selection with facility-location optimization, multi-GPU support, and an Arctic encoder option.
    • Produces embeddings, metadata, and subsets in common formats.
  • Documentation

    • Added a comprehensive README with installation, quick start, CLI examples, API usage, configuration, and troubleshooting.
  • Chores

    • Added project-specific ignore rules and a requirements file for the workflow.

Signed-off-by: roburishabh <[email protected]>
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds a new subset_selection package with CLI, encoder registry and Arctic encoder, utilities, a configuration-driven core pipeline for embedding generation and facility-location subset selection, README, requirements, and gitignore; exposes package-level API via init.

Changes

Cohort / File(s) Summary of Changes
Docs & Project Files
scripts/subset_selection/README.md, scripts/subset_selection/.gitignore, scripts/subset_selection/requirements.txt
Added README describing workflow, usage, config and troubleshooting; added .gitignore for local/data/dev artifacts; added pinned requirements for ML, data, and tooling dependencies.
Package API Surface
scripts/subset_selection/__init__.py
Re-exports public symbols (BasicConfig, DataProcessor, EncoderConfig, ProcessingConfig, SystemConfig, TemplateConfig, get_supported_encoders, subset_datasets) and defines all.
CLI
scripts/subset_selection/cli.py
New argparse-based CLI with parse_args() and main() that parses inputs/options, configures multiprocessing for CUDA, and invokes subset_datasets.
Encoder Registry
scripts/subset_selection/encoders/__init__.py
Introduces ENCODER_REGISTRY mapping and get_encoder_class(encoder_type) with error handling; registers "arctic".
Arctic Encoder
scripts/subset_selection/encoders/arctic_encoder.py
New ArcticEmbedEncoder, EncoderConfig dataclass and ModelConfig TypedDict; model loading (testing/local), device/FP16 handling, batched tokenization/encoding, normalization, and cleanup utilities.
Core Pipeline
scripts/subset_selection/subset_selection.py
Adds config dataclasses (BasicConfig, EncoderConfig, TemplateConfig, SystemConfig, ProcessingConfig), DataProcessor, shard-based embedding generation and saving, fold-based facility-location subset selection, templating, merging, and subset_datasets orchestrator.
Utilities
scripts/subset_selection/utils/__init__.py, scripts/subset_selection/utils/subset_selection_utils.py
New utilities: retry_on_exception decorator, get_default_num_gpus, compute_pairwise_dense (cosine/dot/euclidean/rbf, batching, scaling); utils package re-exports key helpers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (cli.py)
  participant Orchestrator as subset_datasets
  participant Processor as DataProcessor
  participant Registry as Encoder Registry
  participant Encoder as ArcticEmbedEncoder
  participant Utils as Utils
  participant FS as Filesystem

  User->>CLI: run with inputs & options
  CLI->>Orchestrator: subset_datasets(input_files, subset_sizes, **kwargs)
  Orchestrator->>Processor: instantiate & configure
  Processor->>FS: load dataset(s)
  Processor->>Registry: get_encoder_class("arctic")
  Registry-->>Processor: ArcticEmbedEncoder
  Processor->>Encoder: init(model, device, fp16, testing)
  loop per shard
    Processor->>Encoder: encode(batch)
    Encoder-->>Processor: embeddings
    Processor->>FS: save shard embeddings
  end
  Processor->>FS: merge shard embeddings
  loop per fold / subset size
    Processor->>Utils: compute_pairwise_dense / selection
    Processor->>FS: save subset indices & metadata
  end
  Processor-->>Orchestrator: results
  Orchestrator-->>CLI: completion
  CLI-->>User: exit code / message
Loading
sequenceDiagram
  autonumber
  participant Caller
  participant Registry as Encoder Registry
  participant Arctic as ArcticEmbedEncoder

  Caller->>Registry: get_encoder_class(type)
  alt supported
    Registry-->>Caller: ArcticEmbedEncoder
    Caller->>Arctic: __init__(...)
    Arctic-->>Caller: encoder instance
  else unsupported
    Registry-->>Caller: ValueError
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

I thump through files and GPU glow,
I cup embeddings row by row.
Fold by fold I pick with care,
Arctic winds comb data hair.
A rabbit's hop, a tidy chest—subset found, now time to rest. 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The current title “[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts)” includes a ticket identifier and a vague reference to baseline testing but fails to clearly describe the primary change of adding a new subset selection scripts package with CLI, encoders, and utilities. It does not convey the core functionality introduced by the PR. Please rename the pull request to a concise, descriptive title that highlights the main change, for example “Add subset_selection scripts package with CLI, encoders, and utilities.”
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@RobuRishabh RobuRishabh changed the title R1146 only scripts for subset selection [RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) Oct 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cd429 and 0300fc6.

📒 Files selected for processing (10)
  • scripts/subset_selection/.gitignore (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__init__.py (1 hunks)
  • scripts/subset_selection/cli.py (1 hunks)
  • scripts/subset_selection/encoders/__init__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/__init__.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
  • subset_datasets (864-923)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
  • BasicConfig (39-78)
  • DataProcessor (168-576)
  • EncoderConfig (82-93)
  • ProcessingConfig (127-165)
  • SystemConfig (112-123)
  • TemplateConfig (97-108)
  • get_supported_encoders (853-861)
  • subset_datasets (864-923)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • EncoderConfig (47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • ArcticEmbedEncoder (58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
  • get_encoder_class (12-23)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/encoders/arctic_encoder.py (2)
  • EncoderConfig (47-55)
  • encode (170-204)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.0)
scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)


[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)


116-116: f-string without any placeholders

Remove extraneous f prefix

(F541)


147-147: Consider moving this statement to an else block

(TRY300)


148-148: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)


38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)


44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)


48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)


52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)


80-82: Avoid specifying long messages outside the exception class

(TRY003)


128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/__init__.py

17-20: Abstract raise to an inner function

(TRY301)


17-20: Avoid specifying long messages outside the exception class

(TRY003)


23-23: Avoid specifying long messages outside the exception class

(TRY003)


23-23: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)


117-120: Avoid specifying long messages outside the exception class

(TRY003)


148-151: Avoid specifying long messages outside the exception class

(TRY003)


161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)


155-155: Prefer TypeError exception for invalid type

(TRY004)


155-155: Avoid specifying long messages outside the exception class

(TRY003)


159-159: Prefer TypeError exception for invalid type

(TRY004)


159-159: Avoid specifying long messages outside the exception class

(TRY003)


161-163: Avoid specifying long messages outside the exception class

(TRY003)


165-165: Avoid specifying long messages outside the exception class

(TRY003)


181-181: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


206-206: Avoid specifying long messages outside the exception class

(TRY003)


235-237: Avoid specifying long messages outside the exception class

(TRY003)


256-258: Avoid specifying long messages outside the exception class

(TRY003)


345-345: Avoid specifying long messages outside the exception class

(TRY003)


412-412: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)


424-427: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


494-494: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


494-494: Use explicit conversion flag

Replace with conversion flag

(RUF010)


558-558: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


558-558: Use explicit conversion flag

Replace with conversion flag

(RUF010)


614-614: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


640-640: Abstract raise to an inner function

(TRY301)


640-640: Avoid specifying long messages outside the exception class

(TRY003)


683-683: Consider moving this statement to an else block

(TRY300)


688-688: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


688-688: Use explicit conversion flag

Replace with conversion flag

(RUF010)


762-762: Abstract raise to an inner function

(TRY301)


762-762: Avoid specifying long messages outside the exception class

(TRY003)


831-833: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


832-832: Use explicit conversion flag

Replace with conversion flag

(RUF010)


847-847: Consider moving this statement to an else block

(TRY300)


849-849: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


849-849: Use explicit conversion flag

Replace with conversion flag

(RUF010)


916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)

Comment on lines +253 to +263
if isinstance(size_spec, float):
# if not in range of 0 to 1, raise error
if size_spec <= 0 or size_spec > 1:
raise ValueError(
"Percentage values must be between 0(non-inclusive) and 1(inclusive)"
)
# If between 0 and 1, treat as decimal percentage (0.5 = 50%)
return max(1, int((size_spec) * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix subset percentage normalization

ProcessingConfig.__post_init__ explicitly allows float subset_sizes up to 100 to represent percentages, but calculate_subset_size then rejects anything greater than 1. A config such as subset_sizes=[10.0] sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.

         if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError(
+                    "Percentage values must be between 0 (exclusive) and 100 (inclusive)"
+                )
+            return max(1, int(percentage * total_samples))

Make the same normalization when computing per-fold budgets so both code paths agree.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(size_spec, float):
# if not in range of 0 to 1, raise error
if size_spec <= 0 or size_spec > 1:
raise ValueError(
"Percentage values must be between 0(non-inclusive) and 1(inclusive)"
)
# If between 0 and 1, treat as decimal percentage (0.5 = 50%)
return max(1, int((size_spec) * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)
if isinstance(size_spec, float):
percentage = size_spec / 100 if size_spec > 1 else size_spec
if percentage <= 0 or percentage > 1:
raise ValueError(
"Percentage values must be between 0 (exclusive) and 100 (inclusive)"
)
return max(1, int(percentage * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)
🧰 Tools
🪛 Ruff (0.14.0)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 253 to 263, float
size_spec values greater than 1 are currently rejected even though config allows
percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust
calculate_subset_size to first normalize float percentage values >1 by dividing
by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and
compute the sample count; also apply the same normalization logic in the
per-fold budget computation path so both code paths treat percentage semantics
consistently.

Comment on lines +769 to +827
for fold_idx, fold_indices in gpu_folds_info:
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)

for size_spec in subset_sizes:
if isinstance(size_spec, float):
# Percentage-based selection
budget = max(
1, math.ceil(size_spec * similarity_matrix.shape[0])
)
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(
size_spec * (similarity_matrix.shape[0] / total_samples)
),
)

logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)

subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)

subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Skip empty folds and cap per-fold budgets

When len(embeddings) < num_folds we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500] on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize is asked to pick at least one item from a fold that has zero (or fewer than budget) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                continue
             try:
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(
+                            1,
+                            math.ceil(size_spec * (fold_size / total_samples)),
+                        )
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for fold_idx, fold_indices in gpu_folds_info:
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
fold_embeddings = embeddings[fold_indices].to(device)
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()
subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)
for size_spec in subset_sizes:
if isinstance(size_spec, float):
# Percentage-based selection
budget = max(
1, math.ceil(size_spec * similarity_matrix.shape[0])
)
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(
size_spec * (similarity_matrix.shape[0] / total_samples)
),
)
logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)
subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)
subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}
for fold_idx, fold_indices in gpu_folds_info:
fold_size = len(fold_indices)
if fold_size == 0:
logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
continue
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
fold_embeddings = embeddings[fold_indices].to(device)
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()
subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)
for size_spec in subset_sizes:
if isinstance(size_spec, float):
percentage = size_spec / 100 if size_spec > 1 else size_spec
budget = max(1, math.ceil(percentage * fold_size))
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(size_spec * (fold_size / total_samples)),
)
budget = min(budget, fold_size)
if budget == 0:
continue
logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)
subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)
subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}

Comment on lines +122 to +127
distance = torch.cdist(a, b)
squared_distance = distance**2
avg_dist = torch.mean(squared_distance)
torch.div(squared_distance, kw * avg_dist, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clamp the RBF scaling factor to avoid NaNs

When all embeddings in a batch are identical (or nearly so), squared_distance becomes 0 everywhere, making avg_dist zero. Dividing by kw * avg_dist (Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
distance = torch.cdist(a, b)
squared_distance = distance**2
avg_dist = torch.mean(squared_distance)
torch.div(squared_distance, kw * avg_dist, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
if metric == "rbf":
distance = torch.cdist(a, b)
squared_distance = distance**2
- avg_dist = torch.mean(squared_distance)
avg_dist = torch.mean(squared_distance)
eps = torch.finfo(squared_distance.dtype).eps
scale = kw * torch.clamp(avg_dist, min=eps)
torch.div(squared_distance, scale, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
🤖 Prompt for AI Agents
In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to
127, the RBF scaling divides by kw * avg_dist which can be zero for identical
embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g.
compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel()
or a small constant like 1e-12 on the same device/dtype, then use
avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and
divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains
device/dtype consistency.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
scripts/subset_selection/README.md (3)

103-103: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.

Apply this diff:

-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")

Based on static analysis hints.


207-207: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:

-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/

Based on static analysis hints.


270-270: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:

-```
+```text
 RuntimeError: Cannot re-initialize CUDA in forked subprocess

Based on static analysis hints.

</blockquote></details>
<details>
<summary>scripts/subset_selection/cli.py (1)</summary><blockquote>

`124-124`: **Remove unnecessary f-string prefix.**

The string has no placeholders, so the `f` prefix is unnecessary.

Apply this diff:

```diff
-    print(f"Starting subset selection...")
+    print("Starting subset selection...")

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0300fc6 and eac077a.

📒 Files selected for processing (3)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/cli.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
  • subset_datasets (864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

103-103: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


207-207: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)


124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)


155-155: Consider moving this statement to an else block

(TRY300)


156-156: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (10)
scripts/subset_selection/README.md (7)

1-58: LGTM! Clear installation and model setup documentation.

The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.


59-100: LGTM! Comprehensive CLI usage examples.

The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.


122-161: LGTM! Clear Python API documentation.

The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.


162-204: LGTM! Thorough configuration documentation.

The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.


224-243: LGTM! Clear documentation of encoders and outputs.

The supported encoders and output files sections are concise and informative.


280-312: LGTM! Comprehensive troubleshooting guidance.

The troubleshooting section provides clear solutions for common issues users may encounter.


313-328: LGTM! Helpful notes for users.

The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.

scripts/subset_selection/cli.py (3)

13-101: LGTM! Well-structured argument parsing.

The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.


103-122: LGTM! Proper multiprocessing setup and argument parsing.

The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.


147-162: LGTM! Appropriate exception handling for CLI entry point.

The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).

@@ -0,0 +1,162 @@
#!/usr/bin/env python3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify file permissions for executable script.

The shebang is present, but the file might not have executable permissions. If this file is intended to be run directly, set the executable bit.

Run the following script to check and optionally fix file permissions:

Based on static analysis hints.


🏁 Script executed:

#!/bin/bash
# Check if the file is executable
if [ -x scripts/subset_selection/cli.py ]; then
  echo "File is executable"
else
  echo "File is NOT executable. To make it executable, run:"
  echo "chmod +x scripts/subset_selection/cli.py"
fi

Length of output: 92


Make scripts/subset_selection/cli.py executable.

The file lacks the executable bit; run

chmod +x scripts/subset_selection/cli.py

to allow direct execution via its shebang.

🧰 Tools
🪛 Ruff (0.14.0)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents
In scripts/subset_selection/cli.py around line 1, the script has a shebang but
is missing the executable permission; make the file executable by running chmod
+x scripts/subset_selection/cli.py (or set the executable bit in your VCS/CI),
and verify the shebang (#!/usr/bin/env python3) is correct so the script can be
run directly.

Comment on lines +244 to +264

## Quick Start Example

Using your data file:

```bash
# Navigate to project root
cd /Users/roburishabh/Github/odh-data-processing

# Run subset selection - Select 10% and 50% subsets
python -m scripts.subset_selection.cli \
--input scripts/subset_selection/data/combined_cut_50x.jsonl \
--subset-sizes "0.1,0.5" \
--output-dir scripts/subset_selection/data/output \
--epsilon 0.1 \
--num-folds 10

# Check results
ls scripts/subset_selection/data/output/
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use relative path instead of absolute developer-specific path.

Line 251 contains an absolute path specific to a developer's machine, which could confuse users.

Apply this diff:

 ```bash
 # Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+cd odh-data-processing
 
 # Run subset selection - Select 10% and 50% subsets
 python -m scripts.subset_selection.cli \
🤖 Prompt for AI Agents
In scripts/subset_selection/README.md around lines 244 to 264, the Quick Start
uses an absolute, developer-specific path; change the example to use a relative
path (e.g., "cd odh-data-processing" or simply "cd <project-root>") so the
instructions work for other users and do not expose a local filesystem location;
update the README example accordingly and ensure the rest of the commands remain
unchanged.

Comment on lines +38 to +52
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.9.86
nvidia-nvtx-cu12==12.1.105
packaging==25.0
pandas==2.3.3
pillow==11.3.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

GPU-only wheels make local CPU setup impossible

Pinning the +cu121 torch stack plus the nvidia-* CUDA wheels forces Linux + CUDA 12.1; pip will fail outright on CPU-only laptops, contradicting the stated “tested locally without GPU”. Provide a CPU-friendly requirements set (e.g., torch==... w/o +cu121, gate CUDA deps behind an extra, or split into cpu/gpu files) so contributors can install and run the scripts on non-GPU environments.

🤖 Prompt for AI Agents
In scripts/subset_selection/requirements.txt around lines 38 to 52, the current
file pins CUDA-specific nvidia wheel packages and a +cu121 torch build which
prevents installation on CPU-only machines; update the requirements so
contributors without GPUs can install: remove or make optional the nvidia-*
CUDA-specific packages and avoid a +cu121 torch pin (use a generic torch version
or leave torch unpinned), and provide one of the following fixes — either split
into two files (requirements.txt for CPU-friendly deps and requirements-gpu.txt
for CUDA deps), or gate CUDA packages behind an extras marker (e.g., [gpu]) and
document installing with pip install -r requirements.txt (and optionally pip
install -r requirements-gpu.txt) or pip install .[gpu]; ensure the default
requirements.txt installs and runs on CPU-only systems.

Comment on lines +70 to +76
torch==2.5.1+cu121
torchaudio==2.5.1+cu121
torchvision==0.20.1+cu121
tqdm==4.67.1
transformers==4.57.0
triton==3.1.0
typing_extensions==4.15.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Blocker: torch 2.5.1+cu121 is under multiple critical CVEs

OSV flags this exact build for GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp (see static analysis hints). Until the project bumps to a patched PyTorch release (and aligns torchaudio/torchvision accordingly), we ship with RCE/DoS exposure. Please update to a non-vulnerable version or document a mitigation before merging.

🤖 Prompt for AI Agents
In scripts/subset_selection/requirements.txt around lines 70 to 76, the pinned
torch==2.5.1+cu121 (and matching torchaudio/torchvision) is flagged by multiple
CVEs; update torch to a patched, non-vulnerable release (choose the minimum safe
version per OSV/GHSA advisories), and align torchaudio and torchvision to the
matching compatible builds (CUDA tags if needed), update the lock/build/test
config accordingly, run full CI and smoke tests to verify compatibility, and if
you cannot upgrade now add a clear mitigation note in the requirements or repo
security doc explaining the temporary risk and any runtime/workaround steps
until upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant