Skip to content
This repository has been archived by the owner on Mar 5, 2019. It is now read-only.

db: express whether an entity has been 'cleaned' or not #150

Open
aspiers opened this issue Feb 7, 2017 · 3 comments
Open

db: express whether an entity has been 'cleaned' or not #150

aspiers opened this issue Feb 7, 2017 · 3 comments

Comments

@aspiers
Copy link
Member

aspiers commented Feb 7, 2017

Description

We need to add a to express whether or not an entity has already been deduplicated. The need for this was realised at the 2017/02/07 meeting, when we discussed how to implement a workflow which would enable volunteers to help manually deduplicate incoming entities from data sources after import.

Comments, Questions and Considerations

We want an automatic deduplication heuristic which runs during import, and tries to automatically match each newly imported entity with an existing one. For example, the most naive implementation would simply look for an existing entity whose name is identical to the one being imported, and if it finds one, it would assume they refer to the same entity, therefore it would reuse the existing one rather than create a new unclean entity.

In terms of implementing this, initially we had thought a boolean flag would suffice, however after further discussion at the 2017/03/14 meeting we concluded that a better way would be to add an extra table between the Organisation Entity table and the Meetings. This table would be for 'Entity Names' and would store names in the raw form read in from the csv file. It would contain a field for the name as well as an optional foreign key link to the entity table, whether or not that key is present will indicate if the entry is 'clean'.
Furthermore, the entity table instead of needing a name field itself can link back to an entry in the new table which will then be considered the canonical name for that entity. If the canonical name has yet to be seen then a new entry can be created purely for that purpose.

Blocks

Acceptance Criteria

This story can be considered done when the following acceptance tests
are satisfied:

Given a new data file to import
When the data file is imported
Then the importer tries to match each value in the data file against all entity names already in the database, for each value where no match can be found, it creates a new entity name that's not yet linked to an entity.

@aspiers
Copy link
Member Author

aspiers commented Feb 7, 2017

@JohnSmall @Greatlemer Does this look right to you?

@aspiers aspiers changed the title add unclean flag to entity model db: add unclean flag to entity model Feb 7, 2017
@Greatlemer Greatlemer assigned Greatlemer and unassigned JohnSmall Mar 20, 2017
@Greatlemer Greatlemer changed the title db: add unclean flag to entity model db: express whether an entity has been 'cleaned' or not Mar 20, 2017
@Greatlemer
Copy link
Contributor

@aspiers, I've repurposed this case to fit with what we discussed last week, hope that's ok.

@aspiers
Copy link
Member Author

aspiers commented Mar 20, 2017

@Greatlemer More than OK, it's what I would have suggested ;-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants