Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UP Deduplication Design Documentation #17143

Merged
merged 21 commits into from
Feb 14, 2025
Merged

Conversation

JFisk42
Copy link
Collaborator

@JFisk42 JFisk42 commented Jan 24, 2025

This PR is for the UP Deduplication design. There are some outstanding research items which could affect minor pieces of the described design.

Changes

  • Added UP Deduplication Design Documentation

Linked Issues

Outstanding TODOs

Should be resolved before merge:

  • Database processing efficiency investigation.
  • Is sender name collision a concern?
    • A: No, sender names are unique to an organization.
  • Product Q: Are Azure Event logs needed for instances of duplicate detection?
    • A: From a platform perspective this is not needed
  • Engagement Q: Are Azure Event logs needed for instances of duplicate detection?
    • A: Engagement would like an Azure Event logged with parent_report_id, child_report_id, and sender org/name.
  • Update UP SRD with dedupe requirements/convert step changes as necessary.
    • A: This will be completed as part of the implementation.

@JFisk42 JFisk42 requested a review from a team as a code owner January 24, 2025 02:45
@JFisk42 JFisk42 added the platform Platform Team label Jan 24, 2025
@JFisk42 JFisk42 assigned JFisk42 and unassigned JFisk42 Jan 24, 2025
Copy link
Contributor

github-actions bot commented Jan 24, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Manifest Files

Copy link
Contributor

github-actions bot commented Jan 24, 2025

Test Results

1 279 tests  ±0   1 275 ✅ ±0   7m 35s ⏱️ +7s
  168 suites ±0       4 💤 ±0 
  168 files   ±0       0 ❌ ±0 

Results for commit 186698b. ± Comparison against base commit a99306f.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Jan 24, 2025

Integration Test Results

 60 files   60 suites   42m 45s ⏱️
428 tests 418 ✅ 10 💤 0 ❌
431 runs  421 ✅ 10 💤 0 ❌

Results for commit 186698b.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@arnejduranovic arnejduranovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Please address TODOs before merging.

@JFisk42
Copy link
Collaborator Author

JFisk42 commented Feb 6, 2025

Key changes since last reviews:

  • New DB recommendations based on discovery that only 5% of item_lineage.item_hash values are necessary for deduplication. (See "Other Updates")
  • Added proposal for 2-step rollout. Talked with Jessica and having some time to gather item_hash data to analyze potential false positives would be a good safety check.
  • Added UP SRD update proposals
  • Documented existing sender setting for enabling duplicate detection
  • Addition of Azure Events for engagement team
  • Mock class design for containing key field business logic

Update on TODOs:

  • While investigating memory requirements, item_lineage queries continually timed out. An index was pushed for this field to be able to help this.
  • UP SRD will be updated once changes are approved

@JFisk42 JFisk42 merged commit 0764157 into main Feb 14, 2025
25 checks passed
@JFisk42 JFisk42 deleted the platform/josh/16547-dedupe-design branch February 14, 2025 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform Platform Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplication in the UP MVP
4 participants