Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve contact parsing in the EmlImportService and added tests #260

Draft
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

vjrj
Copy link
Contributor

@vjrj vjrj commented Jan 21, 2025

This pull request includes multiple changes to the Contact class and related services to improve contact management and data consistency. The main changes involve adding new fields to the Contact class, updating validation rules, and modifying the logic for extracting and updating contact information.

Changes to Contact class:

  • Added new fields: organizationName, positionName, and userId to the Contact class.
  • Updated validation rules to include the new fields with appropriate constraints.
  • Modified the buildName method to incorporate the new fields for constructing contact names.
  • Updated the hasContent method to check for the presence of the new fields.

Database schema changes:

  • Added new columns organization_name, position_name, and user_id to the contact table in the database schema.

Changes to EmlImportService:

  • Introduced a processedContacts list to ensure unique contacts are added.
  • Refactored the logic for adding and updating contacts to include the new fields and ensure data consistency. [1] [2]* grails-app/services/au/org/ala/collectory/EmlImportService.groovy: Introduced a new list processedContacts to track unique contacts and avoid duplicates. Updated the addOrUpdateContact method to handle additional fields and ensure contacts are unique based on email, name, organization, or position. [1] [2] [3]

Addition of comprehensive unit tests:

Addition of test resources:

This fix #259.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…his and the IPTService (fix for #259)
@vjrj vjrj marked this pull request as draft January 21, 2025 16:08
@vjrj
Copy link
Contributor Author

vjrj commented Feb 6, 2025

I've been working to improve the contacts parsing of emls and contacts association with drs.

I created too a comparison tool with gbif in the la-toolkit (that I've just released) to test my changes with real data that helped me to add more tests and improve this PR.

image

Number of differences initially:

image

I reduced the number of differences, little by little:

image

until:

image

But now I am at a point where I realized that we are not doing this well, exactly the part when we try to find existing contacts and we use and update them. Let me explain.

When a eml have a contact:

dr1

  • John Doe, position: A, email: john@example

we search if that contact already exists in our contacts table.

We continue with other drs and we try to improve the contact if we have more data (email, etc).

dr2

  • John Doe, position: B, phone: 0000

At the end of the processing, there is one contact in the db with the latest data we have (if we processed dr2 after dr1):

  • John Doe, position: B, phone: 0000

and dr1 and dr2 have:

dr1

  • John Doe, position: B, phone: 0000

dr2

  • John Doe, position: B, phone: 0000

How is this done at GBIF for the same drs (or in the IPT)?

dr1

  • John Doe, position: A, email: john@example

dr2

  • John Doe, position: B, phone: 0000

that represent correctly the eml contacts data over time, without trying to find or link existing contacts added by other drs.

So I propose to rewrite this contact find & update part and instead do the same but only for contacts for that dr, that represent exactly what the eml has, and not reusing contacts with the same name/etc of other drs.

cc @adam-collins and @peggynewman .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some contacts are still not well created/updated during an IPT sync
1 participant