Skip to content

Comments

refactor: standardize ID field names across deduplication workflows#1390

Merged
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
KunalSachdev2005:fixes_1192_standardize_id_naming_for_dedup
Jan 28, 2026
Merged

refactor: standardize ID field names across deduplication workflows#1390
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
KunalSachdev2005:fixes_1192_standardize_id_naming_for_dedup

Conversation

@KunalSachdev2005
Copy link
Contributor

@KunalSachdev2005 KunalSachdev2005 commented Jan 17, 2026

Fixes #1192

  • Renamed ids_to_remove_duplicate_id_field to duplicate_id_field in TextDuplicatesRemovalWorkflow
  • Renamed input_id_field to id_field in TextDuplicatesRemovalWorkflow
  • Renamed doc_id_field to document_id_field in BucketsToEdgesStage
  • Kept document_id_field and duplicate_id_field unchanged in other stages
  • Updated internal usages to match new standardized field names

Rationale: unifying naming conventions for input IDs, removal IDs, and document grouping IDs

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@KunalSachdev2005
Copy link
Contributor Author

Hey @sarahyurick , the code is updated and ready for review.

I wanted to mention that the core renaming changes were performed in only a handful of files. However, many files were modified when I executed the following command according to the contribution guidelines:

pre-commit run --all-files

These changes weren't necessarily changes in the code logic. They seemed like changes in the coding style (defining function headers on a single line instead of multiple lines and stuff like that).

Wanted to point this out because the greptile-bot flagged it: "Skipped: This PR changes more files than the configured file change limit: (122 files found, 100 file limit)."

Please let me know if you’d like me to make any changes.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 20, 2026
@sarahyurick
Copy link
Contributor

Hi @KunalSachdev2005 can you please revert any modifications from the pre-commit and only commit changes relevant to #1192 ?

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 20, 2026
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch from e8688bb to 4accf4b Compare January 21, 2026 01:26
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Overview

Greptile Summary

This PR standardizes ID field naming conventions across deduplication workflows to improve API consistency and usability. The refactor addresses issue #1192 by renaming parameters to be more intuitive and consistent.

Key Changes:

  • Renamed input_id_field to id_field in TextDuplicatesRemovalWorkflow
  • Renamed ids_to_remove_duplicate_id_field to duplicate_id_field in TextDuplicatesRemovalWorkflow
  • Renamed ids_to_remove_read_kwargs to duplicate_id_read_kwargs in TextDuplicatesRemovalWorkflow
  • Renamed doc_id_field to document_id_field in BucketsToEdgesStage
  • Updated all internal usages, tests, documentation, tutorials, and benchmarks

Impact:

  • Breaking API change requiring users to update their code
  • All existing tests pass with updated parameter names
  • Comprehensive documentation updates ensure users can migrate easily
  • Maintains backward compatibility in functionality, only parameter names changed

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-executed refactoring with comprehensive test coverage and documentation updates
  • The changes are purely cosmetic (parameter renaming) with no logic modifications, all tests have been updated and pass, documentation is thoroughly updated including tutorials and examples, and the naming convention improvements genuinely enhance API clarity
  • No files require special attention

Important Files Changed

Filename Overview
nemo_curator/stages/text/deduplication/removal_workflow.py Renamed input_id_field to id_field and ids_to_remove_duplicate_id_field to duplicate_id_field, plus ids_to_remove_read_kwargs to duplicate_id_read_kwargs for consistency
nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py Renamed doc_id_field to document_id_field for clarity and consistency with other deduplication stages
nemo_curator/stages/text/deduplication/semantic.py Updated to use new parameter names (id_field, duplicate_id_field, duplicate_id_read_kwargs) when instantiating TextDuplicatesRemovalWorkflow
tests/stages/text/deduplication/test_removal_workflow.py Updated all test cases to use new parameter names, maintaining comprehensive test coverage across all workflows
tests/stages/deduplication/fuzzy/test_buckets_to_edges_stage.py Updated all test cases to use document_id_field instead of doc_id_field

Sequence Diagram

sequenceDiagram
    participant User
    participant TextDuplicatesRemovalWorkflow
    participant BucketsToEdgesStage
    participant TextDuplicatesRemovalStage
    
    Note over User,TextDuplicatesRemovalStage: Standardized Parameter Names
    
    User->>TextDuplicatesRemovalWorkflow: Create workflow with:<br/>id_field="_curator_dedup_id"<br/>duplicate_id_field="id"<br/>duplicate_id_read_kwargs={...}
    
    Note over TextDuplicatesRemovalWorkflow: Previously:<br/>input_id_field<br/>ids_to_remove_duplicate_id_field<br/>ids_to_remove_read_kwargs
    
    TextDuplicatesRemovalWorkflow->>TextDuplicatesRemovalStage: Pass parameters:<br/>id_field<br/>duplicate_id_field<br/>read_kwargs
    
    TextDuplicatesRemovalStage-->>TextDuplicatesRemovalWorkflow: Remove duplicates using standardized field names
    
    Note over User,BucketsToEdgesStage: Fuzzy Deduplication Flow
    
    User->>BucketsToEdgesStage: Create stage with:<br/>document_id_field="_curator_dedup_id"
    
    Note over BucketsToEdgesStage: Previously: doc_id_field<br/>Now: document_id_field
    
    BucketsToEdgesStage-->>User: Generate edges with<br/>document_id_field_x and document_id_field_y columns
Loading

@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch 4 times, most recently from 57ba3e2 to 20339c9 Compare January 21, 2026 01:45
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch from 20339c9 to 7b60df3 Compare January 21, 2026 01:46
@KunalSachdev2005
Copy link
Contributor Author

Hi @sarahyurick, I’ve removed the pre-commit formatting changes. The new branch now only contains the ID renaming refactor for #1192. Thanks!

Apologies for the multiple commits. I was trying to fix an issue with commit signing.

@sarahyurick
Copy link
Contributor

/ok to test 7b60df3


Args:
doc_id_field: The field name containing the document ids for each bucket.
document_id_field: The field name containing the document ids for each bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KunalSachdev2005 ! Your reasoning makes sense to me. We can merge this soon :)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sarahyurick sarahyurick merged commit f5846be into NVIDIA-NeMo:main Jan 28, 2026
49 checks passed
@KunalSachdev2005 KunalSachdev2005 deleted the fixes_1192_standardize_id_naming_for_dedup branch January 29, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardize ID naming for deduplication

4 participants