Skip to content

Comments

[DRAFT] Code Annotaion WIP#1356

Draft
VibhuJawa wants to merge 14 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:vjawa/add_code_annotation
Draft

[DRAFT] Code Annotaion WIP#1356
VibhuJawa wants to merge 14 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:vjawa/add_code_annotation

Conversation

@VibhuJawa
Copy link
Contributor

@VibhuJawa VibhuJawa commented Jan 7, 2026

PR: Code Annotation (Add code annotation library with Rust-based quality signals)

Description

This PR introduces a new code annotation library for NeMo Curator that provides fast, Rust-based annotation functions for code data curation. The library enables language detection, basic statistics, software metrics, comment fraction analysis, and tokenization.

Features

  • Rust Library (nemo_curator/code_annotation/): High-performance annotation functions using PyO3 bindings

    • detect_language: Programming language detection via hyperpolyglot
    • basic: Basic statistics (bytes, lines, patterns, XML detection)
    • software_metrics: Code complexity metrics via rust-code-analysis
    • opencoder_rs: Comment line/character fractions
    • tokenize: BPE tokenization (github_o200k_base, tiktoken_o200k_base)
  • Document Modifiers (nemo_curator/stages/code/):

    • CodeLanguageDetector
    • CodeBasicStats
    • CodeSoftwareMetrics
    • CodeOpenCoderMetrics
    • CodeTokenizer
    • CodeAnnotator (all-in-one convenience modifier)
  • Document Filters (nemo_curator/stages/text/filters/code.py):

    • CommentFractionFilter
    • MaxLineLengthFilter
    • AverageLineLengthFilter
    • AlphaPercentFilter
    • HexContentFilter
    • Base64ContentFilter
    • TokenCountFilter
    • CyclomaticComplexityFilter

Usage Example

import pandas as pd
from nemo_curator.stages.code import CodeAnnotator

df = pd.DataFrame({
    'content': ['def hello(): pass', 'fn main() {}'],
    'representative_filename': ['test.py', 'main.rs'],
})

annotator = CodeAnnotator(
    detect_language=True,
    basic_stats=True,
    software_metrics=True,
    opencoder_metrics=True,
    tokenize=True,
)
result = annotator.modify_document(df)

Testing

  • 21 unit tests added in tests/code_annotation/test_annotate.py
  • All tests passing
python -m pytest tests/code_annotation/test_annotate.py -v

Files Changed

New Files:

  • nemo_curator/code_annotation/ - Rust library with PyO3 bindings
  • nemo_curator/stages/code/__init__.py - Code stages module
  • nemo_curator/stages/code/modifiers.py - Document modifiers
  • tests/code_annotation/test_annotate.py - Unit tests
  • examples/code_annotation/annotate_code.py - Annotation example
  • examples/code_annotation/filter_code.py - Filtering example
  • docs/code_annotation_plan.md - Documentation

Modified Files:

  • nemo_curator/stages/text/filters/code.py - Added 8 new filters

Build Instructions

cd nemo_curator/code_annotation
maturin develop  # Development build
# OR
maturin build --release  # Release wheel

Dependencies

  • Python: pandas, pyarrow, maturin
  • Rust: pyo3, hyperpolyglot, software-metrics, tiktoken-rs, bpe-openai

Checklist

  • Code compiles without errors
  • Tests added and passing (21 tests)
  • Documentation added
  • Example scripts provided
  • Follows NeMo Curator coding standards

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 7, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

VibhuJawa and others added 8 commits January 7, 2026 16:01
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp file to aid development, will be removed.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa requested a review from Copilot January 31, 2026 00:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive code annotation library to NeMo Curator, enabling language detection, quality metrics computation, and filtering for code datasets. The implementation uses Rust with PyO3 bindings for performance-critical operations.

Changes:

  • Added Rust-based code annotation library with PyO3 bindings for language detection, basic statistics, software metrics, and tokenization
  • Created document modifiers and filters for code processing pipelines
  • Added tutorial scripts demonstrating distributed Ray pipelines for code curation

Reviewed changes

Copilot reviewed 40 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Added code curation dependencies and sources
nemo_curator/code_annotation/ New Rust library with Python wrappers for code annotation
nemo_curator/stages/code/ Document modifiers and processing stages for code
nemo_curator/stages/text/filters/code.py Added 8 code quality filters
tests/code_annotation/ Unit tests for annotation functions
tests/stages/code/ Tests for code processing stages
tutorials/code/ Tutorial scripts for code curation workflows
examples/code_annotation/ Example scripts for annotation and filtering
docker/Dockerfile Added Rust toolchain and code annotation dependencies
Comments suppressed due to low confidence (3)

nemo_curator/code_annotation/rust/src/annotations.rs:1

  • Corrected spelling of 'Javascript' to 'JavaScript'.
// Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

nemo_curator/code_annotation/rust/src/annotations.rs:1

  • Corrected spelling of 'Typescript' to 'TypeScript'.
// Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

tutorials/code/README.md:1

  • Corrected spelling of 'Annotaion' to 'Annotation' in the PR title.

>>> result = modifier.modify_document(df)
"""

def __init__( # noqa: PLR0913
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CodeAnnotator.__init__ method has 7 parameters, which exceeds typical maintainability guidelines. Consider grouping related parameters into a configuration dataclass or dictionary to improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +399 to +401
except Exception: # noqa: BLE001, S110
# License detection can fail on malformed content
pass
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad exception handler silently catches all exceptions without logging. Consider logging the error or at least the exception type to aid in debugging when license detection fails.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
const LANGUAGE_JAVASCRIPT: &str = "Javascript";
const LANGUAGE_TYPESCRIPT: &str = "Typescript";
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent capitalization: 'Javascript' and 'Typescript' should be 'JavaScript' and 'TypeScript' to match standard naming conventions.

Suggested change
const LANGUAGE_JAVASCRIPT: &str = "Javascript";
const LANGUAGE_TYPESCRIPT: &str = "Typescript";
const LANGUAGE_JAVASCRIPT: &str = "JavaScript";
const LANGUAGE_TYPESCRIPT: &str = "TypeScript";

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants