db: express whether an entity has been 'cleaned' or not #150

aspiers · 2017-02-07T22:39:23Z

Description

We need to add a to express whether or not an entity has already been deduplicated. The need for this was realised at the 2017/02/07 meeting, when we discussed how to implement a workflow which would enable volunteers to help manually deduplicate incoming entities from data sources after import.

Comments, Questions and Considerations

We want an automatic deduplication heuristic which runs during import, and tries to automatically match each newly imported entity with an existing one. For example, the most naive implementation would simply look for an existing entity whose name is identical to the one being imported, and if it finds one, it would assume they refer to the same entity, therefore it would reuse the existing one rather than create a new unclean entity.

In terms of implementing this, initially we had thought a boolean flag would suffice, however after further discussion at the 2017/03/14 meeting we concluded that a better way would be to add an extra table between the Organisation Entity table and the Meetings. This table would be for 'Entity Names' and would store names in the raw form read in from the csv file. It would contain a field for the name as well as an optional foreign key link to the entity table, whether or not that key is present will indicate if the entry is 'clean'.
Furthermore, the entity table instead of needing a name field itself can link back to an entry in the new table which will then be considered the canonical name for that entity. If the canonical name has yet to be seen then a new entry can be created purely for that purpose.

Blocks

ui: exclude data relating to unclean entities from UI #151 (ui: exclude data relating to unclean entities from UI)

Acceptance Criteria

This story can be considered done when the following acceptance tests
are satisfied:

Given a new data file to import
When the data file is imported
Then the importer tries to match each value in the data file against all entity names already in the database, for each value where no match can be found, it creates a new entity name that's not yet linked to an entity.

aspiers · 2017-02-07T22:39:44Z

@JohnSmall @Greatlemer Does this look right to you?

Greatlemer · 2017-03-20T11:58:50Z

@aspiers, I've repurposed this case to fit with what we discussed last week, hope that's ok.

aspiers · 2017-03-20T15:17:57Z

@Greatlemer More than OK, it's what I would have suggested ;-)

aspiers added Data Collection Data Storage and API labels Feb 7, 2017

aspiers assigned JohnSmall Feb 7, 2017

aspiers mentioned this issue Feb 7, 2017

ui: exclude data relating to unclean entities from UI #151

Open

aspiers changed the title ~~add unclean flag to entity model~~ db: add unclean flag to entity model Feb 7, 2017

aspiers mentioned this issue Feb 7, 2017

rails: add ui for manual deduplication workflow #153

Open

Greatlemer assigned Greatlemer and unassigned JohnSmall Mar 20, 2017

Greatlemer changed the title ~~db: add unclean flag to entity model~~ db: express whether an entity has been 'cleaned' or not Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: express whether an entity has been 'cleaned' or not #150

db: express whether an entity has been 'cleaned' or not #150

aspiers commented Feb 7, 2017 •

edited by Greatlemer

Loading

aspiers commented Feb 7, 2017

Greatlemer commented Mar 20, 2017

aspiers commented Mar 20, 2017

db: express whether an entity has been 'cleaned' or not #150

db: express whether an entity has been 'cleaned' or not #150

Comments

aspiers commented Feb 7, 2017 • edited by Greatlemer Loading

Description

Comments, Questions and Considerations

Blocks

Acceptance Criteria

aspiers commented Feb 7, 2017

Greatlemer commented Mar 20, 2017

aspiers commented Mar 20, 2017

aspiers commented Feb 7, 2017 •

edited by Greatlemer

Loading