refactor: standardize ID field names across deduplication workflows by KunalSachdev2005 · Pull Request #1390 · NVIDIA-NeMo/Curator

KunalSachdev2005 · 2026-01-17T06:19:32Z

Renamed ids_to_remove_duplicate_id_field to duplicate_id_field in TextDuplicatesRemovalWorkflow
Renamed input_id_field to id_field in TextDuplicatesRemovalWorkflow
Renamed doc_id_field to document_id_field in BucketsToEdgesStage
Kept document_id_field and duplicate_id_field unchanged in other stages
Updated internal usages to match new standardized field names

Rationale: unifying naming conventions for input IDs, removal IDs, and document grouping IDs

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-17T06:19:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

KunalSachdev2005 · 2026-01-18T04:45:05Z

Hey @sarahyurick , the code is updated and ready for review.

I wanted to mention that the core renaming changes were performed in only a handful of files. However, many files were modified when I executed the following command according to the contribution guidelines:

pre-commit run --all-files

These changes weren't necessarily changes in the code logic. They seemed like changes in the coding style (defining function headers on a single line instead of multiple lines and stuff like that).

Wanted to point this out because the greptile-bot flagged it: "Skipped: This PR changes more files than the configured file change limit: (122 files found, 100 file limit)."

Please let me know if you’d like me to make any changes.

sarahyurick · 2026-01-20T17:02:22Z

Hi @KunalSachdev2005 can you please revert any modifications from the pre-commit and only commit changes relevant to #1192 ?

greptile-apps · 2026-01-21T01:28:11Z

Greptile Overview

Greptile Summary

This PR standardizes ID field naming conventions across deduplication workflows to improve API consistency and usability. The refactor addresses issue #1192 by renaming parameters to be more intuitive and consistent.

Key Changes:

Renamed input_id_field to id_field in TextDuplicatesRemovalWorkflow
Renamed ids_to_remove_duplicate_id_field to duplicate_id_field in TextDuplicatesRemovalWorkflow
Renamed ids_to_remove_read_kwargs to duplicate_id_read_kwargs in TextDuplicatesRemovalWorkflow
Renamed doc_id_field to document_id_field in BucketsToEdgesStage
Updated all internal usages, tests, documentation, tutorials, and benchmarks

Impact:

Breaking API change requiring users to update their code
All existing tests pass with updated parameter names
Comprehensive documentation updates ensure users can migrate easily
Maintains backward compatibility in functionality, only parameter names changed

Confidence Score: 5/5

This PR is safe to merge - it's a well-executed refactoring with comprehensive test coverage and documentation updates
The changes are purely cosmetic (parameter renaming) with no logic modifications, all tests have been updated and pass, documentation is thoroughly updated including tutorials and examples, and the naming convention improvements genuinely enhance API clarity
No files require special attention

Important Files Changed

Filename	Overview
nemo_curator/stages/text/deduplication/removal_workflow.py	Renamed `input_id_field` to `id_field` and `ids_to_remove_duplicate_id_field` to `duplicate_id_field`, plus `ids_to_remove_read_kwargs` to `duplicate_id_read_kwargs` for consistency
nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py	Renamed `doc_id_field` to `document_id_field` for clarity and consistency with other deduplication stages
nemo_curator/stages/text/deduplication/semantic.py	Updated to use new parameter names (`id_field`, `duplicate_id_field`, `duplicate_id_read_kwargs`) when instantiating TextDuplicatesRemovalWorkflow
tests/stages/text/deduplication/test_removal_workflow.py	Updated all test cases to use new parameter names, maintaining comprehensive test coverage across all workflows
tests/stages/deduplication/fuzzy/test_buckets_to_edges_stage.py	Updated all test cases to use `document_id_field` instead of `doc_id_field`

Sequence Diagram

sequenceDiagram
    participant User
    participant TextDuplicatesRemovalWorkflow
    participant BucketsToEdgesStage
    participant TextDuplicatesRemovalStage
    
    Note over User,TextDuplicatesRemovalStage: Standardized Parameter Names
    
    User->>TextDuplicatesRemovalWorkflow: Create workflow with:<br/>id_field="_curator_dedup_id"<br/>duplicate_id_field="id"<br/>duplicate_id_read_kwargs={...}
    
    Note over TextDuplicatesRemovalWorkflow: Previously:<br/>input_id_field<br/>ids_to_remove_duplicate_id_field<br/>ids_to_remove_read_kwargs
    
    TextDuplicatesRemovalWorkflow->>TextDuplicatesRemovalStage: Pass parameters:<br/>id_field<br/>duplicate_id_field<br/>read_kwargs
    
    TextDuplicatesRemovalStage-->>TextDuplicatesRemovalWorkflow: Remove duplicates using standardized field names
    
    Note over User,BucketsToEdgesStage: Fuzzy Deduplication Flow
    
    User->>BucketsToEdgesStage: Create stage with:<br/>document_id_field="_curator_dedup_id"
    
    Note over BucketsToEdgesStage: Previously: doc_id_field<br/>Now: document_id_field
    
    BucketsToEdgesStage-->>User: Generate edges with<br/>document_id_field_x and document_id_field_y columns

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

KunalSachdev2005 · 2026-01-21T01:49:21Z

Hi @sarahyurick, I’ve removed the pre-commit formatting changes. The new branch now only contains the ID renaming refactor for #1192. Thanks!

Apologies for the multiple commits. I was trying to fix an issue with commit signing.

sarahyurick · 2026-01-21T17:36:35Z

/ok to test 7b60df3

sarahyurick · 2026-01-27T23:22:42Z

nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py


    Args:
-        doc_id_field: The field name containing the document ids for each bucket.
+        document_id_field: The field name containing the document ids for each bucket.


Thanks @KunalSachdev2005 ! Your reasoning makes sense to me. We can merge this soon :)

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

github-actions bot added the community-request label Jan 17, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 20, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 20, 2026

KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch from e8688bb to 4accf4b Compare January 21, 2026 01:26

KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch 4 times, most recently from 57ba3e2 to 20339c9 Compare January 21, 2026 01:45

standardize ID field names across deduplication workflows

7b60df3

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

KunalSachdev2005 force-pushed the fixes_1192_standardize_id_naming_for_dedup branch from 20339c9 to 7b60df3 Compare January 21, 2026 01:46

copy-pr-bot bot temporarily deployed to test January 21, 2026 17:36 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 21, 2026 17:37 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 17:37 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 23:22 Inactive

sarahyurick approved these changes Jan 27, 2026

View reviewed changes

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

sarahyurick merged commit f5846be into NVIDIA-NeMo:main Jan 28, 2026
49 checks passed

KunalSachdev2005 deleted the fixes_1192_standardize_id_naming_for_dedup branch January 29, 2026 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

refactor: standardize ID field names across deduplication workflows#1390

refactor: standardize ID field names across deduplication workflows#1390
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
KunalSachdev2005:fixes_1192_standardize_id_naming_for_dedup

KunalSachdev2005 commented Jan 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 17, 2026

Uh oh!

KunalSachdev2005 commented Jan 18, 2026

Uh oh!

sarahyurick commented Jan 20, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

KunalSachdev2005 commented Jan 21, 2026

Uh oh!

sarahyurick commented Jan 21, 2026

Uh oh!

sarahyurick Jan 27, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

KunalSachdev2005 commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 17, 2026

Uh oh!

KunalSachdev2005 commented Jan 18, 2026

Uh oh!

sarahyurick commented Jan 20, 2026

Uh oh!

greptile-apps bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

KunalSachdev2005 commented Jan 21, 2026

Uh oh!

sarahyurick commented Jan 21, 2026

Uh oh!

sarahyurick Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KunalSachdev2005 commented Jan 17, 2026 •

edited

Loading

greptile-apps bot commented Jan 21, 2026 •

edited

Loading