You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 5, 2019. It is now read-only.
We need to add a to express whether or not an entity has already been deduplicated. The need for this was realised at the 2017/02/07 meeting, when we discussed how to implement a workflow which would enable volunteers to help manually deduplicate incoming entities from data sources after import.
Comments, Questions and Considerations
We want an automatic deduplication heuristic which runs during import, and tries to automatically match each newly imported entity with an existing one. For example, the most naive implementation would simply look for an existing entity whose name is identical to the one being imported, and if it finds one, it would assume they refer to the same entity, therefore it would reuse the existing one rather than create a new unclean entity.
In terms of implementing this, initially we had thought a boolean flag would suffice, however after further discussion at the 2017/03/14 meeting we concluded that a better way would be to add an extra table between the Organisation Entity table and the Meetings. This table would be for 'Entity Names' and would store names in the raw form read in from the csv file. It would contain a field for the name as well as an optional foreign key link to the entity table, whether or not that key is present will indicate if the entry is 'clean'.
Furthermore, the entity table instead of needing a name field itself can link back to an entry in the new table which will then be considered the canonical name for that entity. If the canonical name has yet to be seen then a new entry can be created purely for that purpose.
This story can be considered done when the following acceptance tests
are satisfied:
Given a new data file to import When the data file is imported Then the importer tries to match each value in the data file against all entity names already in the database, for each value where no match can be found, it creates a new entity name that's not yet linked to an entity.
The text was updated successfully, but these errors were encountered:
Description
We need to add a to express whether or not an entity has already been deduplicated. The need for this was realised at the 2017/02/07 meeting, when we discussed how to implement a workflow which would enable volunteers to help manually deduplicate incoming entities from data sources after import.
Comments, Questions and Considerations
We want an automatic deduplication heuristic which runs during import, and tries to automatically match each newly imported entity with an existing one. For example, the most naive implementation would simply look for an existing entity whose name is identical to the one being imported, and if it finds one, it would assume they refer to the same entity, therefore it would reuse the existing one rather than create a new unclean entity.
In terms of implementing this, initially we had thought a boolean flag would suffice, however after further discussion at the 2017/03/14 meeting we concluded that a better way would be to add an extra table between the Organisation Entity table and the Meetings. This table would be for 'Entity Names' and would store names in the raw form read in from the csv file. It would contain a field for the name as well as an optional foreign key link to the entity table, whether or not that key is present will indicate if the entry is 'clean'.
Furthermore, the entity table instead of needing a name field itself can link back to an entry in the new table which will then be considered the canonical name for that entity. If the canonical name has yet to be seen then a new entry can be created purely for that purpose.
Blocks
Acceptance Criteria
This story can be considered done when the following acceptance tests
are satisfied:
Given a new data file to import
When the data file is imported
Then the importer tries to match each value in the data file against all entity names already in the database, for each value where no match can be found, it creates a new entity name that's not yet linked to an entity.
The text was updated successfully, but these errors were encountered: