-
Notifications
You must be signed in to change notification settings - Fork 1
OpenRefine readme
Version 0.2 | 2023-09-01
Contributors: @amandawhitmire, @dimus
Reconciliation endpoint: https://verifier.globalnames.org/api/v1/reconcile
Video tutorial: http://opendata.globalnames.org/video/gnverifier-openrefine.mp4
CSV file from the video tutorial paleo-names-for-demo-20231102.csv. You can use this file to follow along.
Project file from the video tutorial (use Import Project
in OpenRefine)
paleo-names-for-demo-20231102-csv.openrefine.tar.gz
- OpenRefine user manual
- OpenRefine documentation on reconciliation
- OpenRefine Reconciliation Service
- OpenRefine Services Registry
- GBIF documentation about loading data and basic OpenRefine functionality (faceting, editing, etc.): Use of OpenRefine
This documentation starts at the step where you have already created an OpenRefine project and have a column of scientific names (e.g., “Verbatim Name” below). If needed, see the documentation linked above to get started with installing OpenRefine and creating a project. Note that scientific names
with authorship (for example Icteranthidium ferrugineum (Fabricius, 1787) vs Icteranthidium ferrugineum) usually provide more precise reconciliation when using GlobalNames endpoint.
You want to avoid writing over your original data, so you need to create a new column with a copy of the existing names. Use the drop-down arrow at the column with species names to access the menu > Edit column > Add column based on this column …
Give the new column a name, keep GREL as the Language (the default), and leave ‘value’
in the Expression field (the default). A preview of the result will be shown below the Expression box, in the right column. Click ‘OK’
.
Once the new column is created, you can start reconciling. Access the menu > Reconcile > Start reconciling …
Click on ‘Add standard service …’
, paste in the URL to the Global Names Verifier reconciliation service (https://verifier.globalnames.org/api/v1/reconcile), and click ‘Add service’
.
You can leave all the settings in the default for now (additional filtering features are described in the next section). ‘Auto-match candidates with high confidence’
should be selected, and you can choose to limit the number of possible matches. Click ‘Start reconciling …’
.
Optionally you can use higher_taxon
and data_source_ids
settings to restrict matches to names whose classification contains a desired
higher taxon name, or names from particular data-sources. To make it work add
columns to your data where every entry contains desired settings.
When columns are created add filters at the reconciliation settings window
You will see the reconciliation tool make progress. Depending on how many records you have, this may take some time. It took way less than one minute to reconcile 459 rows.
After the reconciliation process is complete, high-quality match candidates have automatically matched. In the example below, 439 of 459 rows have matched automatically. Reviewing all matches is recommended. You can undo a match by clicking on ‘Choose new match’
for any of the results.
Cases where there are manual match options (multiple names and/or lower-confidence scores) will show up in the Facet/Filter reconciliation judgment ‘none’
. Click on ‘none’
to show only the scientific names that need to be matched manually.
If you hover your cursor over a potential match, you will see information about the verification source and the match. To select a match, click on the check mark. If you have the same scientific name in multiple rows of your dataset, clicking the double check will make the match for all rows with the same name.
There may be a few cases where your data was not reconciled, and these will be judged as ‘(unreconciled)’. Click ‘(unreconciled)’
to isolate your view to those rows. NOTE: If you leave these unreconciled values in the column, they will be exported as though they were reconciled values. You do not want this!
To remove the unreconciled values, access the menu > Edit cells > Common transforms > To null
. This will convert the values to nulls, or blanks.
You can now export the dataset as a spreadsheet in the format of your choice under the ‘Export’ menu at the top right of the OpenRefine window.
It is very beneficial to understand the Best Candidate's Score
filter that should appear on the left when reconciliation finishes.
Its graph shows the matching score distribution. Best score to the right, worst score to the left. The score would be clustered into seven groups (canonical form
here means a name without authorship) :
- Canonical form of a name matched exactly, authorship matched well, data came from sources that are known to have some curation (these names normally match automatically).
- Canonical form of a name matched exactly, authorship matched well, but the source is of an unknown quality.
- Canonical form matched, but not authorship.
- Canonical form matched imperfectly (fuzzy matching).
- Canonical form only matched at the level of species (for infraspecies).
- Canonical form matched at the level of species imperfectly (fuzzy matching).
- Canonical form matched only at the level of genus.
Sliders on the graph allow to isolate a particular group of results by score, do manual reconciliation or assign a bulk decision to the whole group (discard it or set it as matched).
It is possible to get additional information for matched results. Add new columns from the column of resolved results:
Select additional data from the list of available properties.
Wikidata endpoint: https://wikidata.reconci.link/en/api
In case if reconciliation via Wikidata endpoint is desirable, we have to use a canonical form of a name (without authorship).
When such a scientific name contains authorship, it is possible to transform such names using gnparser
API.
The gnparser
understands inner structure of scientific names and separates their components into separate fields.
First create a new column using Add column by fetching URL...
Change expression to the following line: "https://parser.globalnames.org/api/v1/" + escape(value, "url")
.
Also change Throttle delay
from 500
to 5
.
At the end of the column creation, it will be filled with JSON-encoded data that looks like this:
[{"parsed":true,"quality":2,"qualityWarnings":[{"quality":2,"warning":"Ex authors are not required (ICZN only)"}],"verbatim":"Bulimus canarius Philippi, in Pfeiffer, 1867","normalized":"Bulimus canarius Philippi ex Pfeiffer 1867","canonical":{"stemmed":"Bulimus canar","simple":"Bulimus canarius","full":"Bulimus canarius"},"cardinality":2,"authorship":{"verbatim":"Philippi, in Pfeiffer, 1867","normalized":"Philippi ex Pfeiffer 1867","year":"1867","authors":["Philippi","Pfeiffer"]},"id":"eb69fa41-52ca-5d6b-9d16-d76d97a6dddc","parserVersion":"v1.7.4"}]
Now we need to extract canonical form
from the parsed JSON-encoded data. Use "Edit cells->Transform" menu.
In the Expression field put parseJson(value)[0]["canonical"]["simple"]
This new column will now have canonical form of names and can be used for reconciliation via Wikidata.