Skip to content

[benchmarking] Add Semantic Deduplication Identification#1410

Merged
praateekmahajan merged 11 commits intoNVIDIA-NeMo:mainfrom
ayushdg:semdedup-identify-benchmark
Jan 26, 2026
Merged

[benchmarking] Add Semantic Deduplication Identification#1410
praateekmahajan merged 11 commits intoNVIDIA-NeMo:mainfrom
ayushdg:semdedup-identify-benchmark

Conversation

@ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Jan 21, 2026

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ayushdg ayushdg changed the title initial semdedup benchmark script initial semdedup identification benchmark script Jan 21, 2026
ayushdg and others added 4 commits January 21, 2026 15:36
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
[benchmarks] Update Semdedup benchmark + Add more metrics logging to KMeans + load_dataset_size accepts dataset ratio
ayushdg and others added 2 commits January 22, 2026 19:14
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@ayushdg ayushdg marked this pull request as ready for review January 23, 2026 00:14
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +118 to +122
_kmeans_time_taken = kmeans_read_time + kmeans_write_time + kmeans_fit_predict_time

kmeans_read_percent_time = round((kmeans_read_time / _kmeans_time_taken) * 100, 2)
kmeans_write_percent_time = round((kmeans_write_time / _kmeans_time_taken) * 100, 2)
kmeans_fit_predict_percent_time = round((kmeans_fit_predict_time / _kmeans_time_taken) * 100, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

division by zero if _kmeans_time_taken equals 0

If all three time metrics (kmeans_read_time, kmeans_write_time, kmeans_fit_predict_time) are 0, this will cause a division by zero error.

Suggested change
_kmeans_time_taken = kmeans_read_time + kmeans_write_time + kmeans_fit_predict_time
kmeans_read_percent_time = round((kmeans_read_time / _kmeans_time_taken) * 100, 2)
kmeans_write_percent_time = round((kmeans_write_time / _kmeans_time_taken) * 100, 2)
kmeans_fit_predict_percent_time = round((kmeans_fit_predict_time / _kmeans_time_taken) * 100, 2)
_kmeans_time_taken = kmeans_read_time + kmeans_write_time + kmeans_fit_predict_time
if _kmeans_time_taken > 0:
kmeans_read_percent_time = round((kmeans_read_time / _kmeans_time_taken) * 100, 2)
kmeans_write_percent_time = round((kmeans_write_time / _kmeans_time_taken) * 100, 2)
kmeans_fit_predict_percent_time = round((kmeans_fit_predict_time / _kmeans_time_taken) * 100, 2)
else:
kmeans_read_percent_time = None
kmeans_write_percent_time = None
kmeans_fit_predict_percent_time = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants