[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

RobuRishabh · 2025-10-14T00:25:54Z

Description

https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts .py files only.

How Has This Been Tested?

Tested it in local environment without GPU on CPU only.
Tested it on AWS with GPU.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Added a subset selection tool with CLI and Python API, multi-file inputs, percentage/absolute subset sizes, testing mode, and progress output.
- Embedding-based selection with facility-location optimization, multi-GPU support, and an Arctic encoder option.
- Produces embeddings, metadata, and subsets in common formats.
Documentation
- Added a comprehensive README with installation, quick start, CLI examples, API usage, configuration, and troubleshooting.
Chores
- Added project-specific ignore rules and a requirements file for the workflow.

Signed-off-by: roburishabh <[email protected]>

coderabbitai · 2025-10-14T00:26:05Z

Walkthrough

Adds a new subset_selection package with CLI, encoder registry and Arctic encoder, utilities, a configuration-driven core pipeline for embedding generation and facility-location subset selection, README, requirements, and gitignore; exposes package-level API via init.

Changes

Cohort / File(s)	Summary of Changes
Docs & Project Files `scripts/subset_selection/README.md`, `scripts/subset_selection/.gitignore`, `scripts/subset_selection/requirements.txt`	Added README describing workflow, usage, config and troubleshooting; added .gitignore for local/data/dev artifacts; added pinned requirements for ML, data, and tooling dependencies.
Package API Surface `scripts/subset_selection/__init__.py`	Re-exports public symbols (BasicConfig, DataProcessor, EncoderConfig, ProcessingConfig, SystemConfig, TemplateConfig, get_supported_encoders, subset_datasets) and defines all.
CLI `scripts/subset_selection/cli.py`	New argparse-based CLI with `parse_args()` and `main()` that parses inputs/options, configures multiprocessing for CUDA, and invokes `subset_datasets`.
Encoder Registry `scripts/subset_selection/encoders/__init__.py`	Introduces `ENCODER_REGISTRY` mapping and `get_encoder_class(encoder_type)` with error handling; registers "arctic".
Arctic Encoder `scripts/subset_selection/encoders/arctic_encoder.py`	New `ArcticEmbedEncoder`, EncoderConfig dataclass and ModelConfig TypedDict; model loading (testing/local), device/FP16 handling, batched tokenization/encoding, normalization, and cleanup utilities.
Core Pipeline `scripts/subset_selection/subset_selection.py`	Adds config dataclasses (BasicConfig, EncoderConfig, TemplateConfig, SystemConfig, ProcessingConfig), `DataProcessor`, shard-based embedding generation and saving, fold-based facility-location subset selection, templating, merging, and `subset_datasets` orchestrator.
Utilities `scripts/subset_selection/utils/__init__.py`, `scripts/subset_selection/utils/subset_selection_utils.py`	New utilities: `retry_on_exception` decorator, `get_default_num_gpus`, `compute_pairwise_dense` (cosine/dot/euclidean/rbf, batching, scaling); utils package re-exports key helpers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (cli.py)
  participant Orchestrator as subset_datasets
  participant Processor as DataProcessor
  participant Registry as Encoder Registry
  participant Encoder as ArcticEmbedEncoder
  participant Utils as Utils
  participant FS as Filesystem

  User->>CLI: run with inputs & options
  CLI->>Orchestrator: subset_datasets(input_files, subset_sizes, **kwargs)
  Orchestrator->>Processor: instantiate & configure
  Processor->>FS: load dataset(s)
  Processor->>Registry: get_encoder_class("arctic")
  Registry-->>Processor: ArcticEmbedEncoder
  Processor->>Encoder: init(model, device, fp16, testing)
  loop per shard
    Processor->>Encoder: encode(batch)
    Encoder-->>Processor: embeddings
    Processor->>FS: save shard embeddings
  end
  Processor->>FS: merge shard embeddings
  loop per fold / subset size
    Processor->>Utils: compute_pairwise_dense / selection
    Processor->>FS: save subset indices & metadata
  end
  Processor-->>Orchestrator: results
  Orchestrator-->>CLI: completion
  CLI-->>User: exit code / message

sequenceDiagram
  autonumber
  participant Caller
  participant Registry as Encoder Registry
  participant Arctic as ArcticEmbedEncoder

  Caller->>Registry: get_encoder_class(type)
  alt supported
    Registry-->>Caller: ArcticEmbedEncoder
    Caller->>Arctic: __init__(...)
    Arctic-->>Caller: encoder instance
  else unsupported
    Registry-->>Caller: ValueError
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

I thump through files and GPU glow,
I cup embeddings row by row.
Fold by fold I pick with care,
Arctic winds comb data hair.
A rabbit's hop, a tidy chest—subset found, now time to rest. 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The current title “[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts)” includes a ticket identifier and a vague reference to baseline testing but fails to clearly describe the primary change of adding a new subset selection scripts package with CLI, encoders, and utilities. It does not convey the core functionality introduced by the PR.	Please rename the pull request to a concise, descriptive title that highlights the main change, for example “Add subset_selection scripts package with CLI, encoders, and utilities.”

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cd429 and 0300fc6.

📒 Files selected for processing (10)

scripts/subset_selection/.gitignore (1 hunks)
scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__init__.py (1 hunks)
scripts/subset_selection/cli.py (1 hunks)
scripts/subset_selection/encoders/__init__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/__init__.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

scripts/subset_selection/utils/__init__.py (1)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/cli.py (1)

scripts/subset_selection/subset_selection.py (1)

subset_datasets (864-923)

scripts/subset_selection/__init__.py (2)

scripts/subset_selection/subset_selection.py (8)

BasicConfig (39-78)

DataProcessor (168-576)

EncoderConfig (82-93)

ProcessingConfig (127-165)

SystemConfig (112-123)

TemplateConfig (97-108)

get_supported_encoders (853-861)

subset_datasets (864-923)

scripts/subset_selection/encoders/arctic_encoder.py (1)

EncoderConfig (47-55)

scripts/subset_selection/encoders/__init__.py (1)

scripts/subset_selection/encoders/arctic_encoder.py (1)

ArcticEmbedEncoder (58-204)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (82-93)

scripts/subset_selection/subset_selection.py (3)

scripts/subset_selection/encoders/__init__.py (1)

get_encoder_class (12-23)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/encoders/arctic_encoder.py (2)

EncoderConfig (47-55)

encode (170-204)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.0)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)

[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🪛 Ruff (0.14.0)

scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)

116-116: f-string without any placeholders

Remove extraneous f prefix

(F541)

147-147: Consider moving this statement to an else block

(TRY300)

148-148: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)

38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)

44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)

48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)

52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)

80-82: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/__init__.py

17-20: Abstract raise to an inner function

(TRY301)

17-20: Avoid specifying long messages outside the exception class

(TRY003)

23-23: Avoid specifying long messages outside the exception class

(TRY003)

23-23: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)

117-120: Avoid specifying long messages outside the exception class

(TRY003)

148-151: Avoid specifying long messages outside the exception class

(TRY003)

161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)

155-155: Prefer TypeError exception for invalid type

(TRY004)

155-155: Avoid specifying long messages outside the exception class

(TRY003)

159-159: Prefer TypeError exception for invalid type

(TRY004)

159-159: Avoid specifying long messages outside the exception class

(TRY003)

161-163: Avoid specifying long messages outside the exception class

(TRY003)

165-165: Avoid specifying long messages outside the exception class

(TRY003)

181-181: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

206-206: Avoid specifying long messages outside the exception class

(TRY003)

235-237: Avoid specifying long messages outside the exception class

(TRY003)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

345-345: Avoid specifying long messages outside the exception class

(TRY003)

412-412: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)

424-427: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

494-494: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

494-494: Use explicit conversion flag

Replace with conversion flag

(RUF010)

558-558: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

558-558: Use explicit conversion flag

Replace with conversion flag

(RUF010)

614-614: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

640-640: Abstract raise to an inner function

(TRY301)

640-640: Avoid specifying long messages outside the exception class

(TRY003)

683-683: Consider moving this statement to an else block

(TRY300)

688-688: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

688-688: Use explicit conversion flag

Replace with conversion flag

(RUF010)

762-762: Abstract raise to an inner function

(TRY301)

762-762: Avoid specifying long messages outside the exception class

(TRY003)

831-833: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

832-832: Use explicit conversion flag

Replace with conversion flag

(RUF010)

847-847: Consider moving this statement to an else block

(TRY300)

849-849: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

849-849: Use explicit conversion flag

Replace with conversion flag

(RUF010)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)

coderabbitai · 2025-10-14T00:38:28Z

scripts/subset_selection/subset_selection.py

+        if isinstance(size_spec, float):
+            # if not in range of 0 to 1, raise error
+            if size_spec <= 0 or size_spec > 1:
+                raise ValueError(
+                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
+                )
+            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
+            return max(1, int((size_spec) * total_samples))
+        # Treat as absolute number
+        return min(size_spec, total_samples)
+


⚠️ Potential issue | 🔴 Critical

Fix subset percentage normalization

ProcessingConfig.__post_init__ explicitly allows float subset_sizes up to 100 to represent percentages, but calculate_subset_size then rejects anything greater than 1. A config such as subset_sizes=[10.0] sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.

if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + percentage = size_spec / 100 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError( + "Percentage values must be between 0 (exclusive) and 100 (inclusive)" + ) + return max(1, int(percentage * total_samples))

Make the same normalization when computing per-fold budgets so both code paths agree.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(size_spec, float):

# if not in range of 0 to 1, raise error

if size_spec <= 0 or size_spec > 1:

raise ValueError(

"Percentage values must be between 0(non-inclusive) and 1(inclusive)"

)

# If between 0 and 1, treat as decimal percentage (0.5 = 50%)

return max(1, int((size_spec) * total_samples))

# Treat as absolute number

return min(size_spec, total_samples)

if isinstance(size_spec, float):

percentage = size_spec / 100 if size_spec > 1 else size_spec

if percentage <= 0 or percentage > 1:

raise ValueError(

"Percentage values must be between 0 (exclusive) and 100 (inclusive)"

)

return max(1, int(percentage * total_samples))

# Treat as absolute number

return min(size_spec, total_samples)

🧰 Tools

🪛 Ruff (0.14.0)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In scripts/subset_selection/subset_selection.py around lines 253 to 263, float size_spec values greater than 1 are currently rejected even though config allows percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust calculate_subset_size to first normalize float percentage values >1 by dividing by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and compute the sample count; also apply the same normalization logic in the per-fold budget computation path so both code paths treat percentage semantics consistently.

coderabbitai · 2025-10-14T00:38:29Z

scripts/subset_selection/subset_selection.py

+        for fold_idx, fold_indices in gpu_folds_info:
+            try:
+                logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
+
+                fold_embeddings = embeddings[fold_indices].to(device)
+
+                logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
+                max_sim_mat = compute_pairwise_dense(
+                    fold_embeddings,
+                    batch_size=50000,
+                    metric="cosine",
+                    device=device,
+                    scaling="additive",
+                )
+                similarity_matrix = max_sim_mat.cpu().numpy()
+
+                subsets = {}
+                ds_func = FacilityLocationFunction(
+                    n=similarity_matrix.shape[0],
+                    sijs=similarity_matrix,
+                    mode="dense",
+                    separate_rep=False,
+                )
+
+                for size_spec in subset_sizes:
+                    if isinstance(size_spec, float):
+                        # Percentage-based selection
+                        budget = max(
+                            1, math.ceil(size_spec * similarity_matrix.shape[0])
+                        )
+                    else:
+                        # Absolute number-based selection
+                        budget = max(
+                            1,
+                            math.ceil(
+                                size_spec * (similarity_matrix.shape[0] / total_samples)
+                            ),
+                        )
+
+                    logger.info(
+                        f"Selecting subset of size {budget} for fold {fold_idx + 1}"
+                    )
+
+                    subset_result = ds_func.maximize(
+                        budget=budget,
+                        optimizer="LazierThanLazyGreedy",
+                        epsilon=epsilon,
+                        stopIfZeroGain=False,
+                        stopIfNegativeGain=False,
+                        verbose=False,
+                    )
+
+                    subset_indices = [fold_indices[x[0]] for x in subset_result]
+                    subset_gains = [x[1] for x in subset_result]
+                    subsets[size_spec] = {
+                        "indices": subset_indices,
+                        "gains": subset_gains,
+                    }
+


⚠️ Potential issue | 🔴 Critical

Skip empty folds and cap per-fold budgets

When len(embeddings) < num_folds we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500] on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize is asked to pick at least one item from a fold that has zero (or fewer than budget) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.

- for fold_idx, fold_indices in gpu_folds_info: + for fold_idx, fold_indices in gpu_folds_info: + fold_size = len(fold_indices) + if fold_size == 0: + logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") + continue try: @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + percentage = size_spec / 100 if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max( + 1, + math.ceil(size_spec * (fold_size / total_samples)), + ) + budget = min(budget, fold_size) + if budget == 0: + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for fold_idx, fold_indices in gpu_folds_info:

try:

logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")

max_sim_mat = compute_pairwise_dense(

fold_embeddings,

batch_size=50000,

metric="cosine",

device=device,

scaling="additive",

)

similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}

ds_func = FacilityLocationFunction(

n=similarity_matrix.shape[0],

sijs=similarity_matrix,

mode="dense",

separate_rep=False,

)

for size_spec in subset_sizes:

if isinstance(size_spec, float):

# Percentage-based selection

budget = max(

1, math.ceil(size_spec * similarity_matrix.shape[0])

)

else:

# Absolute number-based selection

budget = max(

1,

math.ceil(

size_spec * (similarity_matrix.shape[0] / total_samples)

),

)

logger.info(

f"Selecting subset of size {budget} for fold {fold_idx + 1}"

)

subset_result = ds_func.maximize(

budget=budget,

optimizer="LazierThanLazyGreedy",

epsilon=epsilon,

stopIfZeroGain=False,

stopIfNegativeGain=False,

verbose=False,

)

subset_indices = [fold_indices[x[0]] for x in subset_result]

subset_gains = [x[1] for x in subset_result]

subsets[size_spec] = {

"indices": subset_indices,

"gains": subset_gains,

}

for fold_idx, fold_indices in gpu_folds_info:

fold_size = len(fold_indices)

if fold_size == 0:

logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")

continue

try:

logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")

max_sim_mat = compute_pairwise_dense(

fold_embeddings,

batch_size=50000,

metric="cosine",

device=device,

scaling="additive",

)

similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}

ds_func = FacilityLocationFunction(

n=similarity_matrix.shape[0],

sijs=similarity_matrix,

mode="dense",

separate_rep=False,

)

for size_spec in subset_sizes:

if isinstance(size_spec, float):

percentage = size_spec / 100 if size_spec > 1 else size_spec

budget = max(1, math.ceil(percentage * fold_size))

else:

# Absolute number-based selection

budget = max(

1,

math.ceil(size_spec * (fold_size / total_samples)),

)

budget = min(budget, fold_size)

if budget == 0:

continue

logger.info(

f"Selecting subset of size {budget} for fold {fold_idx + 1}"

)

subset_result = ds_func.maximize(

budget=budget,

optimizer="LazierThanLazyGreedy",

epsilon=epsilon,

stopIfZeroGain=False,

stopIfNegativeGain=False,

verbose=False,

)

subset_indices = [fold_indices[x[0]] for x in subset_result]

subset_gains = [x[1] for x in subset_result]

subsets[size_spec] = {

"indices": subset_indices,

"gains": subset_gains,

}

coderabbitai · 2025-10-14T00:38:29Z

scripts/subset_selection/utils/subset_selection_utils.py

+            distance = torch.cdist(a, b)
+            squared_distance = distance**2
+            avg_dist = torch.mean(squared_distance)
+            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            torch.exp(-squared_distance, out=squared_distance)
+            return squared_distance


⚠️ Potential issue | 🟠 Major

Clamp the RBF scaling factor to avoid NaNs

When all embeddings in a batch are identical (or nearly so), squared_distance becomes 0 everywhere, making avg_dist zero. Dividing by kw * avg_dist (Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.

if metric == "rbf": distance = torch.cdist(a, b) squared_distance = distance**2 - avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance) torch.exp(-squared_distance, out=squared_distance) return squared_distance

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

distance = torch.cdist(a, b)

squared_distance = distance**2

avg_dist = torch.mean(squared_distance)

torch.div(squared_distance, kw * avg_dist, out=squared_distance)

torch.exp(-squared_distance, out=squared_distance)

return squared_distance

if metric == "rbf":

distance = torch.cdist(a, b)

squared_distance = distance**2

- avg_dist = torch.mean(squared_distance)

avg_dist = torch.mean(squared_distance)

eps = torch.finfo(squared_distance.dtype).eps

scale = kw * torch.clamp(avg_dist, min=eps)

torch.div(squared_distance, scale, out=squared_distance)

torch.exp(-squared_distance, out=squared_distance)

return squared_distance

🤖 Prompt for AI Agents

In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to 127, the RBF scaling divides by kw * avg_dist which can be zero for identical embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g. compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel() or a small constant like 1e-12 on the same device/dtype, then use avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains device/dtype consistency.

…s.txt Signed-off-by: EC2 Default User <[email protected]>

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

scripts/subset_selection/README.md (3)
103-103: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.

Apply this diff:
-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")
Based on static analysis hints.

207-207: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:
-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/
Based on static analysis hints.

270-270: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:
-```
+```text
 RuntimeError: Cannot re-initialize CUDA in forked subprocess
Based on static analysis hints.

</blockquote></details>
<details>
<summary>scripts/subset_selection/cli.py (1)</summary><blockquote>

`124-124`: **Remove unnecessary f-string prefix.**

The string has no placeholders, so the `f` prefix is unnecessary.

Apply this diff:

```diff
-    print(f"Starting subset selection...")
+    print("Starting subset selection...")
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0300fc6 and eac077a.

📒 Files selected for processing (3)

scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/cli.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/subset_selection/cli.py (1)

scripts/subset_selection/subset_selection.py (1)

subset_datasets (864-923)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

103-103: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

207-207: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 OSV Scanner (2.2.3)

scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Ruff (0.14.0)

scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)

124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)

155-155: Consider moving this statement to an else block

(TRY300)

156-156: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (10)

scripts/subset_selection/README.md (7)

1-58: LGTM! Clear installation and model setup documentation.

The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.

59-100: LGTM! Comprehensive CLI usage examples.

The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.

122-161: LGTM! Clear Python API documentation.

The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.

162-204: LGTM! Thorough configuration documentation.

The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.

224-243: LGTM! Clear documentation of encoders and outputs.

The supported encoders and output files sections are concise and informative.

280-312: LGTM! Comprehensive troubleshooting guidance.

The troubleshooting section provides clear solutions for common issues users may encounter.

313-328: LGTM! Helpful notes for users.

The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.

scripts/subset_selection/cli.py (3)

13-101: LGTM! Well-structured argument parsing.

The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.

103-122: LGTM! Proper multiprocessing setup and argument parsing.

The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.

147-162: LGTM! Appropriate exception handling for CLI entry point.

The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).

coderabbitai · 2025-10-14T04:09:10Z

scripts/subset_selection/cli.py

@@ -0,0 +1,162 @@
+#!/usr/bin/env python3


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify file permissions for executable script.

The shebang is present, but the file might not have executable permissions. If this file is intended to be run directly, set the executable bit.

Run the following script to check and optionally fix file permissions:

Based on static analysis hints.

🏁 Script executed:

#!/bin/bash # Check if the file is executable if [ -x scripts/subset_selection/cli.py ]; then echo "File is executable" else echo "File is NOT executable. To make it executable, run:" echo "chmod +x scripts/subset_selection/cli.py" fi

Length of output: 92

Make scripts/subset_selection/cli.py executable.

The file lacks the executable bit; run

chmod +x scripts/subset_selection/cli.py

to allow direct execution via its shebang.

🧰 Tools

🪛 Ruff (0.14.0)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents

In scripts/subset_selection/cli.py around line 1, the script has a shebang but is missing the executable permission; make the file executable by running chmod +x scripts/subset_selection/cli.py (or set the executable bit in your VCS/CI), and verify the shebang (#!/usr/bin/env python3) is correct so the script can be run directly.

coderabbitai · 2025-10-14T04:09:10Z

scripts/subset_selection/README.md

+
+## Quick Start Example
+
+Using your data file:
+
+```bash
+# Navigate to project root
+cd /Users/roburishabh/Github/odh-data-processing
+
+# Run subset selection - Select 10% and 50% subsets
+python -m scripts.subset_selection.cli \
+  --input scripts/subset_selection/data/combined_cut_50x.jsonl \
+  --subset-sizes "0.1,0.5" \
+  --output-dir scripts/subset_selection/data/output \
+  --epsilon 0.1 \
+  --num-folds 10
+
+# Check results
+ls scripts/subset_selection/data/output/
+```
+


⚠️ Potential issue | 🟡 Minor

Use relative path instead of absolute developer-specific path.

Line 251 contains an absolute path specific to a developer's machine, which could confuse users.

Apply this diff:

```bash # Navigate to project root -cd /Users/roburishabh/Github/odh-data-processing +cd odh-data-processing # Run subset selection - Select 10% and 50% subsets python -m scripts.subset_selection.cli \

🤖 Prompt for AI Agents

In scripts/subset_selection/README.md around lines 244 to 264, the Quick Start uses an absolute, developer-specific path; change the example to use a relative path (e.g., "cd odh-data-processing" or simply "cd <project-root>") so the instructions work for other users and do not expose a local filesystem location; update the README example accordingly and ensure the rest of the commands remain unchanged.

coderabbitai · 2025-10-14T04:09:11Z

scripts/subset_selection/requirements.txt

+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==9.1.0.70
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-nccl-cu12==2.21.5
+nvidia-nvjitlink-cu12==12.9.86
+nvidia-nvtx-cu12==12.1.105
+packaging==25.0
+pandas==2.3.3
+pillow==11.3.0


⚠️ Potential issue | 🟠 Major

GPU-only wheels make local CPU setup impossible

Pinning the +cu121 torch stack plus the nvidia-* CUDA wheels forces Linux + CUDA 12.1; pip will fail outright on CPU-only laptops, contradicting the stated “tested locally without GPU”. Provide a CPU-friendly requirements set (e.g., torch==... w/o +cu121, gate CUDA deps behind an extra, or split into cpu/gpu files) so contributors can install and run the scripts on non-GPU environments.

🤖 Prompt for AI Agents

In scripts/subset_selection/requirements.txt around lines 38 to 52, the current file pins CUDA-specific nvidia wheel packages and a +cu121 torch build which prevents installation on CPU-only machines; update the requirements so contributors without GPUs can install: remove or make optional the nvidia-* CUDA-specific packages and avoid a +cu121 torch pin (use a generic torch version or leave torch unpinned), and provide one of the following fixes — either split into two files (requirements.txt for CPU-friendly deps and requirements-gpu.txt for CUDA deps), or gate CUDA packages behind an extras marker (e.g., [gpu]) and document installing with pip install -r requirements.txt (and optionally pip install -r requirements-gpu.txt) or pip install .[gpu]; ensure the default requirements.txt installs and runs on CPU-only systems.

coderabbitai · 2025-10-14T04:09:11Z

scripts/subset_selection/requirements.txt

+torch==2.5.1+cu121
+torchaudio==2.5.1+cu121
+torchvision==0.20.1+cu121
+tqdm==4.67.1
+transformers==4.57.0
+triton==3.1.0
+typing_extensions==4.15.0


⚠️ Potential issue | 🔴 Critical

Blocker: torch 2.5.1+cu121 is under multiple critical CVEs

OSV flags this exact build for GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp (see static analysis hints). Until the project bumps to a patched PyTorch release (and aligns torchaudio/torchvision accordingly), we ship with RCE/DoS exposure. Please update to a non-vulnerable version or document a mitigation before merging.

🤖 Prompt for AI Agents

In scripts/subset_selection/requirements.txt around lines 70 to 76, the pinned torch==2.5.1+cu121 (and matching torchaudio/torchvision) is flagged by multiple CVEs; update torch to a patched, non-vulnerable release (choose the minimum safe version per OSV/GHSA advisories), and align torchaudio and torchvision to the matching compatible builds (CUDA tags if needed), update the lock/build/test config accordingly, run full CI and smoke tests to verify compatibility, and if you cannot upgrade now add a clear mitigation note in the requirements or repo security doc explaining the temporary risk and any runtime/workaround steps until upgraded.

RobuRishabh added 2 commits October 13, 2025 16:28

subset selection scripts added

7c116f9

Signed-off-by: roburishabh <[email protected]>

added gitignore

0300fc6

Signed-off-by: roburishabh <[email protected]>

RobuRishabh requested review from alimaredia, alinaryan, eshwarprasadS, fabianofranz and shruthis4 October 14, 2025 00:25

RobuRishabh requested a review from a team as a code owner October 14, 2025 00:25

RobuRishabh changed the title ~~R1146 only scripts for subset selection~~ [RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) Oct 14, 2025

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Fixed CUDA multiprocessing error, updated README, freezed requirement…

eac077a

…s.txt Signed-off-by: EC2 Default User <[email protected]>

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

RobuRishabh mentioned this pull request Oct 15, 2025

[RHAIENG-1146] Subset Selection Notebooks #8

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

Uh oh!

RobuRishabh commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

Are you sure you want to change the base?

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

Uh oh!

Conversation

RobuRishabh commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobuRishabh commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading