refactor: standardize ID field names across deduplication workflows#1390
Conversation
|
Hey @sarahyurick , the code is updated and ready for review. I wanted to mention that the core renaming changes were performed in only a handful of files. However, many files were modified when I executed the following command according to the contribution guidelines: pre-commit run --all-filesThese changes weren't necessarily changes in the code logic. They seemed like changes in the coding style (defining function headers on a single line instead of multiple lines and stuff like that). Wanted to point this out because the greptile-bot flagged it: "Skipped: This PR changes more files than the configured file change limit: (122 files found, 100 file limit)." Please let me know if you’d like me to make any changes. |
|
Hi @KunalSachdev2005 can you please revert any modifications from the pre-commit and only commit changes relevant to #1192 ? |
e8688bb to
4accf4b
Compare
Greptile OverviewGreptile SummaryThis PR standardizes ID field naming conventions across deduplication workflows to improve API consistency and usability. The refactor addresses issue #1192 by renaming parameters to be more intuitive and consistent. Key Changes:
Impact:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant TextDuplicatesRemovalWorkflow
participant BucketsToEdgesStage
participant TextDuplicatesRemovalStage
Note over User,TextDuplicatesRemovalStage: Standardized Parameter Names
User->>TextDuplicatesRemovalWorkflow: Create workflow with:<br/>id_field="_curator_dedup_id"<br/>duplicate_id_field="id"<br/>duplicate_id_read_kwargs={...}
Note over TextDuplicatesRemovalWorkflow: Previously:<br/>input_id_field<br/>ids_to_remove_duplicate_id_field<br/>ids_to_remove_read_kwargs
TextDuplicatesRemovalWorkflow->>TextDuplicatesRemovalStage: Pass parameters:<br/>id_field<br/>duplicate_id_field<br/>read_kwargs
TextDuplicatesRemovalStage-->>TextDuplicatesRemovalWorkflow: Remove duplicates using standardized field names
Note over User,BucketsToEdgesStage: Fuzzy Deduplication Flow
User->>BucketsToEdgesStage: Create stage with:<br/>document_id_field="_curator_dedup_id"
Note over BucketsToEdgesStage: Previously: doc_id_field<br/>Now: document_id_field
BucketsToEdgesStage-->>User: Generate edges with<br/>document_id_field_x and document_id_field_y columns
|
57ba3e2 to
20339c9
Compare
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
20339c9 to
7b60df3
Compare
|
Hi @sarahyurick, I’ve removed the pre-commit formatting changes. The new branch now only contains the ID renaming refactor for #1192. Thanks! Apologies for the multiple commits. I was trying to fix an issue with commit signing. |
|
/ok to test 7b60df3 |
|
|
||
| Args: | ||
| doc_id_field: The field name containing the document ids for each bucket. | ||
| document_id_field: The field name containing the document ids for each bucket. |
There was a problem hiding this comment.
Thanks @KunalSachdev2005 ! Your reasoning makes sense to me. We can merge this soon :)
Fixes #1192
ids_to_remove_duplicate_id_fieldtoduplicate_id_fieldin TextDuplicatesRemovalWorkflowinput_id_fieldtoid_fieldin TextDuplicatesRemovalWorkflowdoc_id_fieldtodocument_id_fieldin BucketsToEdgesStagedocument_id_fieldandduplicate_id_fieldunchanged in other stagesRationale: unifying naming conventions for input IDs, removal IDs, and document grouping IDs
Description
Usage
# Add snippet demonstrating usageChecklist