-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve contact parsing in the EmlImportService and added tests #260
base: develop
Are you sure you want to change the base?
Conversation
…his and the IPTService (fix for #259)
…present instead of the name)
…re or less data). The Contact should be like in the eml
I've been working to improve the contacts parsing of emls and contacts association with drs. I created too a comparison tool with gbif in the la-toolkit (that I've just released) to test my changes with real data that helped me to add more tests and improve this PR. Number of differences initially: I reduced the number of differences, little by little: until: But now I am at a point where I realized that we are not doing this well, exactly the part when we try to find existing contacts and we use and update them. Let me explain. When a eml have a contact:
we search if that contact already exists in our contacts table. We continue with other drs and we try to improve the contact if we have more data (email, etc).
At the end of the processing, there is one contact in the db with the latest data we have (if we processed dr2 after dr1):
and dr1 and dr2 have:
How is this done at GBIF for the same drs (or in the IPT)?
that represent correctly the eml contacts data over time, without trying to find or link existing contacts added by other drs. So I propose to rewrite this contact find & update part and instead do the same but only for contacts for that dr, that represent exactly what the eml has, and not reusing contacts with the same name/etc of other drs. cc @adam-collins and @peggynewman . |
This pull request includes multiple changes to the
Contact
class and related services to improve contact management and data consistency. The main changes involve adding new fields to theContact
class, updating validation rules, and modifying the logic for extracting and updating contact information.Changes to
Contact
class:organizationName
,positionName
, anduserId
to theContact
class.buildName
method to incorporate the new fields for constructing contact names.hasContent
method to check for the presence of the new fields.Database schema changes:
organization_name
,position_name
, anduser_id
to thecontact
table in the database schema.Changes to
EmlImportService
:processedContacts
list to ensure unique contacts are added.grails-app/services/au/org/ala/collectory/EmlImportService.groovy
: Introduced a new listprocessedContacts
to track unique contacts and avoid duplicates. Updated theaddOrUpdateContact
method to handle additional fields and ensure contacts are unique based on email, name, organization, or position. [1] [2] [3]Addition of comprehensive unit tests:
src/test/groovy/au/org/ala/collectory/IptServiceSpec.groovy
: Added a new test specification for theIptService
class, including tests for merging updates with existing resources, creating new resources, and handling duplicate contacts.EmlImportServiceSpec
.Addition of test resources:
src/test/resources/base_eml.xml
: Added a new XML file to serve as a base template for testing EML data extraction.This fix #259.