-
Couldn't load subscription status.
- Fork 7
[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
RobuRishabh
wants to merge
1
commit into
opendatahub-io:main
Choose a base branch
from
RobuRishabh:R1146-Only-scripts-for-subset-selection
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| data/ | ||
| # Python | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *.pyo | ||
| *.pyd | ||
| *.egg-info/ | ||
| *.egg | ||
| dist/ | ||
| build/ | ||
| pip-wheel-metadata/ | ||
|
|
||
| # Environments | ||
| .env | ||
| *.env | ||
| .venv/ | ||
| venv/ | ||
| env/ | ||
|
|
||
| # Tooling caches | ||
| .mypy_cache/ | ||
| .pytype/ | ||
| .ruff_cache/ | ||
| .pytest_cache/ | ||
| .tox/ | ||
| .nox/ | ||
| .coverage | ||
| .coverage.* | ||
| coverage.xml | ||
|
|
||
| # Editors/IDEs | ||
| .vscode/ | ||
| .idea/ | ||
| .history/ | ||
| .DS_Store | ||
| Thumbs.db | ||
|
|
||
| # Jupyter | ||
| .ipynb_checkpoints/ | ||
|
|
||
| # Cursor | ||
| .cursor/ | ||
|
|
||
| # Local runs / artifacts | ||
| local_outputs/ | ||
|
|
||
| # Logs | ||
| *.log | ||
| logs/ | ||
| venv/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,304 @@ | ||
| # Subset Selection Scripts | ||
|
|
||
| The subset selection scripts use advanced machine learning techniques to identify representative samples from large datasets. It provides functionality for selecting diverse subsets of datasets using facility location maximization with embedding-based similarity. This is particularly useful for: | ||
| - Reducing dataset size while maintaining diversity | ||
| - Selecting training data that covers the full distribution | ||
| - Creating validation/test sets that represent the full dataset | ||
|
|
||
| ## Requirements | ||
|
|
||
| - **Python 3.12** (required for compatibility with the rest of the codebase) | ||
| - CUDA 12.1+ for GPU support (recommended) | ||
|
|
||
| ## Installation | ||
|
|
||
| Install all dependencies including PyTorch with CUDA support: | ||
|
|
||
| ```bash | ||
| pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 | ||
| ``` | ||
|
|
||
| **Note:** The CLI automatically configures multiprocessing to use the 'spawn' method for CUDA compatibility, enabling efficient multi-GPU parallel processing. | ||
|
|
||
| ## Model Setup | ||
|
|
||
| The default encoder (`Snowflake/snowflake-arctic-embed-l-v2.0`) is automatically downloaded from HuggingFace on first run if not found locally. The model will be cached for subsequent runs. | ||
|
|
||
| ### Automatic Download (Default Behavior) | ||
|
|
||
| On first run, the model will be automatically downloaded from HuggingFace: | ||
|
|
||
| ```bash | ||
| # Run from scripts/subset_selection/ directory | ||
| cd odh-data-processing/scripts/subset_selection | ||
| source .venv/bin/activate # or venv/bin/activate | ||
| python -m subset_selection \ | ||
| --input dataset.jsonl \ | ||
| --subset-sizes "10%" \ | ||
| --output-dir output/ | ||
| ``` | ||
|
|
||
| The model is cached at `~/.cache/huggingface/` for future use. | ||
|
|
||
| ### Using a Local Model (Optional) | ||
|
|
||
| If you have the model cached locally or want to use a custom model path: | ||
|
|
||
| ```bash | ||
| # Run from scripts/subset_selection/ directory | ||
| cd /path/to/odh-data-processing/scripts/subset_selection | ||
| source .venv/bin/activate # or venv/bin/activate | ||
| python -m subset_selection \ | ||
| --input dataset.jsonl \ | ||
| --subset-sizes "10%" \ | ||
| --encoder-model-path /path/to/your/local/model \ | ||
| --output-dir output/ | ||
| ``` | ||
|
|
||
| ### Command Line Interface (Recommended) | ||
|
|
||
| The easiest way to use subset selection is through the CLI (run from `scripts/subset_selection/` directory): | ||
|
|
||
| ```bash | ||
| # Basic usage - Select 10% and 50% subsets | ||
| python -m subset_selection \ | ||
| --input path/to/dataset.jsonl \ | ||
| --subset-sizes "10%,50%" \ | ||
| --output-dir output/ | ||
|
|
||
| # Absolute counts - Select exactly 1000 and 5000 samples | ||
| python -m subset_selection \ | ||
| --input path/to/dataset.jsonl \ | ||
| --subset-sizes "1000,5000" \ | ||
| --output-dir output/ | ||
|
|
||
| # Small dataset (< 100k samples) - adjust epsilon and num_folds | ||
| python -m subset_selection \ | ||
| --input path/to/small_dataset.jsonl \ | ||
| --subset-sizes "50%" \ | ||
| --epsilon 0.1 \ | ||
| --num-folds 10 \ | ||
| --output-dir output/ | ||
|
|
||
| # Multiple files combined | ||
| python -m subset_selection \ | ||
| --input file1.jsonl file2.jsonl file3.jsonl \ | ||
| --subset-sizes "25%,50%" \ | ||
| --combine-files \ | ||
| --output-dir output/ | ||
|
|
||
| # Using a custom local model path | ||
| python -m subset_selection \ | ||
| --input dataset.jsonl \ | ||
| --subset-sizes "10%" \ | ||
| --encoder-model-path /path/to/local/model \ | ||
| --output-dir output/ | ||
| ``` | ||
|
|
||
| #### CLI Options | ||
|
|
||
| ``` | ||
| Required: | ||
| --input <file> [<file> ...] Input file(s) to process (JSONL, JSON, CSV, Parquet) | ||
| --subset-sizes <sizes> Comma-separated sizes (e.g., "10%,50%" or "1000,5000") | ||
|
|
||
| Optional: | ||
| --output-dir <dir> Output directory (default: output) | ||
| --batch-size <int> Batch size for processing (default: 100000) | ||
| --num-folds <int> Number of folds/partitions (default: 50) | ||
| --epsilon <float> Optimization parameter (default: 160.0) | ||
| --num-gpus <int> Number of GPUs to use (default: auto-detect) | ||
| --combine-files Combine multiple input files before processing | ||
| --encoder-type <str> Encoder type (default: arctic) | ||
| --encoder-model <str> Model name (default: Snowflake/snowflake-arctic-embed-l-v2.0) | ||
| --encoder-model-path <path> Local path to encoder model (optional, auto-downloads if not provided) | ||
| --template-name <str> Template name (default: conversation) | ||
| --seed <int> Random seed (default: 42) | ||
| ``` | ||
|
|
||
| #### Subset Size Formats | ||
|
|
||
| The `--subset-sizes` parameter accepts three formats: | ||
|
|
||
| 1. **Percentage notation (Recommended)**: Use `"%"` for clarity | ||
| - `"10%"` = 10% of the dataset | ||
| - `"50%"` = 50% of the dataset | ||
| - Example: `--subset-sizes "10%,50%,90%"` | ||
|
|
||
| 2. **Absolute counts**: Specify exact number of samples | ||
| - `"1000"` = exactly 1000 samples | ||
| - `"5000"` = exactly 5000 samples | ||
| - Example: `--subset-sizes "1000,5000"` | ||
|
|
||
| 3. **Decimal notation (Backward compatibility)**: Float values between 0 and 1 | ||
| - `"0.1"` = 10% of the dataset | ||
| - `"0.5"` = 50% of the dataset | ||
| - Example: `--subset-sizes "0.1,0.5"` | ||
| - **Note**: This format is supported for backward compatibility but percentage notation is recommended for clarity. | ||
|
|
||
| **Mixing formats**: You cannot mix different formats in the same command. Use either all percentages, all counts, or all decimals. | ||
|
|
||
| ### Python API | ||
|
|
||
| You can also use subset selection directly in Python: | ||
|
|
||
| ```python | ||
| from subset_selection import subset_datasets | ||
|
|
||
| # Select subsets from your dataset (using percentages) | ||
| subset_datasets( | ||
| input_files=["path/to/your/dataset.jsonl"], | ||
| subset_sizes=[0.1, 0.5], # 10% and 50% of the dataset (as decimals) | ||
| ) | ||
|
|
||
| # Or using absolute counts | ||
| subset_datasets( | ||
| input_files=["path/to/your/dataset.jsonl"], | ||
| subset_sizes=[1000, 5000], # Exactly 1000 and 5000 samples | ||
| ) | ||
| ``` | ||
|
|
||
| ### Advanced Python Configuration | ||
|
|
||
| ```python | ||
| from subset_selection import ( | ||
| subset_datasets, | ||
| BasicConfig, | ||
| EncoderConfig, | ||
| TemplateConfig, | ||
| SystemConfig | ||
| ) | ||
|
|
||
| # Configure subset selection | ||
| subset_datasets( | ||
| input_files=["dataset1.jsonl", "dataset2.jsonl"], | ||
| subset_sizes=[1000, 5000], # Select 1000 and 5000 samples | ||
| output_dir="output", | ||
| batch_size=100000, | ||
| num_folds=50, | ||
| combine_files=False, | ||
| epsilon=160.0, | ||
| encoder_type="arctic", | ||
| encoder_model="Snowflake/snowflake-arctic-embed-l-v2.0", | ||
| encoder_model_path=None, # Optional: specify local model path | ||
| template_name="conversation", | ||
| ) | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### BasicConfig Parameters | ||
|
|
||
| - **`output_dir`**: Directory for output files (default: `"output"`) | ||
| - **`batch_size`**: Batch size for processing (default: `100000`) | ||
| - **`num_folds`**: Number of folds/partitions for subset selection (default: `50`) | ||
| - The dataset is divided into folds for parallel processing across GPUs | ||
| - **Recommendations based on dataset size:** | ||
| - < 1,000 samples: Use `5-10` folds | ||
| - 1,000-10,000 samples: Use `10-20` folds | ||
| - 10,000-100,000 samples: Use `20-50` folds | ||
| - \> 100,000 samples: Use `50-100` folds (default: 50) | ||
| - More folds = better parallelization but higher memory usage per fold | ||
| - Use fewer folds for small datasets to ensure each fold has enough samples | ||
| - **`combine_files`**: Whether to combine multiple input files (default: `False`) | ||
| - **`epsilon`**: Epsilon parameter for the LazierThanLazyGreedy optimizer (default: `160.0`) | ||
| - Controls the trade-off between optimization quality and speed | ||
| - **Recommendations based on dataset size:** | ||
| - < 1,000 samples: Use `0.01-0.1` | ||
| - 1,000-10,000 samples: Use `0.1-1.0` | ||
| - 10,000-100,000 samples: Use `1.0-10.0` | ||
| - \> 100,000 samples: Use `160.0` (default) | ||
|
|
||
| ### EncoderConfig Parameters | ||
|
|
||
| - `encoder_type`: Type of encoder to use (default: "arctic") | ||
| - `encoder_model`: Model name for the encoder (default: "Snowflake/snowflake-arctic-embed-l-v2.0") | ||
| - `encoder_model_path`: Local path to encoder model (optional, will auto-download from HuggingFace if not provided) | ||
| - `instruction`: Custom instruction for embedding generation | ||
|
|
||
| ### TemplateConfig Parameters | ||
|
|
||
| - `template_name`: Name of the template to use (default: "conversation") | ||
| - `templates`: Custom templates for text formatting | ||
|
|
||
| ### SystemConfig Parameters | ||
|
|
||
| - `num_gpus`: Number of GPUs to use (auto-detected by default) | ||
| - `seed`: Random seed for reproducibility (default: 42) | ||
| - `max_retries`: Maximum number of retries on failure (default: 3) | ||
| - `retry_delay`: Delay between retries in seconds (default: 30) | ||
|
|
||
| ## Package Structure | ||
|
|
||
| ``` | ||
| scripts/ | ||
| └── subset_selection/ | ||
| ├── __main__.py # Entry point for module execution | ||
| ├── subset_selection.py # Main subset selection logic, CLI, and encoder registry | ||
| ├── requirements.txt # Package dependencies | ||
| ├── README.md # This file | ||
| ├── encoders/ | ||
| │ └── arctic_encoder.py # Arctic embedding encoder | ||
| └── utils/ | ||
| └── subset_selection_utils.py # Utility functions | ||
| ``` | ||
|
|
||
| ## Output Files | ||
|
|
||
| The script generates several output files: | ||
|
|
||
| 1. **Embeddings**: Stored in HDF5 format in `{output_dir}/{dataset_name}/embeddings/` | ||
| 2. **Metadata**: NPZ files containing indices and gains for each subset | ||
| 3. **Subset Files**: Dataset subsets in the original file format (JSON, CSV, Parquet) | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Memory Issues | ||
|
|
||
| If you run out of GPU memory: | ||
| - Reduce `--num-folds` to process larger chunks per GPU | ||
| - Reduce `--num-gpus` to use fewer GPUs | ||
| - For small datasets (<10k samples), use fewer folds (5-10) | ||
| - The default batch size is optimized for A100 GPUs; adjust if needed | ||
|
|
||
| ### GPU Not Detected | ||
|
|
||
| Verify CUDA is properly installed and accessible: | ||
| ```bash | ||
| # Check GPU availability | ||
| nvidia-smi | ||
|
|
||
| # Check PyTorch CUDA | ||
| python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')" | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - **Dataset Size**: Subset selection is optimized for datasets >100k samples | ||
| - For smaller datasets, adjust `--epsilon` and `--num-folds` accordingly | ||
| - **GPU Requirement**: GPU acceleration is strongly recommended for production use | ||
| - The code automatically uses all available GPUs with parallel processing | ||
| - CPU fallback is available but significantly slower (warnings will be displayed) | ||
| - **Multiple GPUs**: Automatically detects and utilizes all available GPUs | ||
| - Uses 'spawn' multiprocessing method for CUDA compatibility | ||
| - Override with `--num-gpus` flag if needed | ||
| - **Memory**: Each fold processes independently, so more folds = less memory per fold | ||
| - **Model Caching**: Models are automatically downloaded on first run and cached locally | ||
| - Default cache location: `~/.cache/huggingface/` | ||
| - Use `--encoder-model-path` to specify a custom model location | ||
| - **Performance**: | ||
| - Larger epsilon values = faster but potentially lower quality | ||
| - More folds = better GPU utilization but more overhead | ||
| - Multi-GPU processing scales linearly with the number of GPUs | ||
|
|
||
| ## Credits and Acknowledgements | ||
|
|
||
| This subset selection implementation is derived from the **DataCurate4LLMs** project. | ||
|
|
||
| ### Original Author | ||
| **Krishnateja Killamsetty** | ||
| 📫 [email protected] | ||
|
|
||
| ### Original Repository | ||
| The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| Entry point for subset selection when run as a module. | ||
| """ | ||
|
|
||
| import sys | ||
|
|
||
| from subset_selection import main | ||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.