Conversation
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
There was a problem hiding this comment.
Temp file to aid development, will be removed.
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive code annotation library to NeMo Curator, enabling language detection, quality metrics computation, and filtering for code datasets. The implementation uses Rust with PyO3 bindings for performance-critical operations.
Changes:
- Added Rust-based code annotation library with PyO3 bindings for language detection, basic statistics, software metrics, and tokenization
- Created document modifiers and filters for code processing pipelines
- Added tutorial scripts demonstrating distributed Ray pipelines for code curation
Reviewed changes
Copilot reviewed 40 out of 43 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added code curation dependencies and sources |
| nemo_curator/code_annotation/ | New Rust library with Python wrappers for code annotation |
| nemo_curator/stages/code/ | Document modifiers and processing stages for code |
| nemo_curator/stages/text/filters/code.py | Added 8 code quality filters |
| tests/code_annotation/ | Unit tests for annotation functions |
| tests/stages/code/ | Tests for code processing stages |
| tutorials/code/ | Tutorial scripts for code curation workflows |
| examples/code_annotation/ | Example scripts for annotation and filtering |
| docker/Dockerfile | Added Rust toolchain and code annotation dependencies |
Comments suppressed due to low confidence (3)
nemo_curator/code_annotation/rust/src/annotations.rs:1
- Corrected spelling of 'Javascript' to 'JavaScript'.
// Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
nemo_curator/code_annotation/rust/src/annotations.rs:1
- Corrected spelling of 'Typescript' to 'TypeScript'.
// Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tutorials/code/README.md:1
- Corrected spelling of 'Annotaion' to 'Annotation' in the PR title.
| >>> result = modifier.modify_document(df) | ||
| """ | ||
|
|
||
| def __init__( # noqa: PLR0913 |
There was a problem hiding this comment.
The CodeAnnotator.__init__ method has 7 parameters, which exceeds typical maintainability guidelines. Consider grouping related parameters into a configuration dataclass or dictionary to improve maintainability.
| except Exception: # noqa: BLE001, S110 | ||
| # License detection can fail on malformed content | ||
| pass |
There was a problem hiding this comment.
The broad exception handler silently catches all exceptions without logging. Consider logging the error or at least the exception type to aid in debugging when license detection fails.
| const LANGUAGE_JAVASCRIPT: &str = "Javascript"; | ||
| const LANGUAGE_TYPESCRIPT: &str = "Typescript"; |
There was a problem hiding this comment.
Inconsistent capitalization: 'Javascript' and 'Typescript' should be 'JavaScript' and 'TypeScript' to match standard naming conventions.
| const LANGUAGE_JAVASCRIPT: &str = "Javascript"; | |
| const LANGUAGE_TYPESCRIPT: &str = "Typescript"; | |
| const LANGUAGE_JAVASCRIPT: &str = "JavaScript"; | |
| const LANGUAGE_TYPESCRIPT: &str = "TypeScript"; |
PR: Code Annotation (Add code annotation library with Rust-based quality signals)
Description
This PR introduces a new code annotation library for NeMo Curator that provides fast, Rust-based annotation functions for code data curation. The library enables language detection, basic statistics, software metrics, comment fraction analysis, and tokenization.
Features
Rust Library (
nemo_curator/code_annotation/): High-performance annotation functions using PyO3 bindingsdetect_language: Programming language detection via hyperpolyglotbasic: Basic statistics (bytes, lines, patterns, XML detection)software_metrics: Code complexity metrics via rust-code-analysisopencoder_rs: Comment line/character fractionstokenize: BPE tokenization (github_o200k_base, tiktoken_o200k_base)Document Modifiers (
nemo_curator/stages/code/):CodeLanguageDetectorCodeBasicStatsCodeSoftwareMetricsCodeOpenCoderMetricsCodeTokenizerCodeAnnotator(all-in-one convenience modifier)Document Filters (
nemo_curator/stages/text/filters/code.py):CommentFractionFilterMaxLineLengthFilterAverageLineLengthFilterAlphaPercentFilterHexContentFilterBase64ContentFilterTokenCountFilterCyclomaticComplexityFilterUsage Example
Testing
tests/code_annotation/test_annotate.pyFiles Changed
New Files:
nemo_curator/code_annotation/- Rust library with PyO3 bindingsnemo_curator/stages/code/__init__.py- Code stages modulenemo_curator/stages/code/modifiers.py- Document modifierstests/code_annotation/test_annotate.py- Unit testsexamples/code_annotation/annotate_code.py- Annotation exampleexamples/code_annotation/filter_code.py- Filtering exampledocs/code_annotation_plan.md- DocumentationModified Files:
nemo_curator/stages/text/filters/code.py- Added 8 new filtersBuild Instructions
Dependencies
pandas,pyarrow,maturinpyo3,hyperpolyglot,software-metrics,tiktoken-rs,bpe-openaiChecklist