Skip to content

OpenRefine readme

Debbie Paul edited this page Aug 5, 2024 · 35 revisions

Reconciling taxonomic names in OpenRefine via Global Names

Version 0.2 | 2023-09-01

Contributors: @amandawhitmire, @dimus

Reconciliation endpoint: https://verifier.globalnames.org/api/v1/reconcile

Video tutorial: http://opendata.globalnames.org/video/gnverifier-openrefine.mp4

CSV file from the video tutorial paleo-names-for-demo-20231102.csv. You can use this file to follow along.

Project file from the video tutorial (use Import Project in OpenRefine) paleo-names-for-demo-20231102-csv.openrefine.tar.gz

Existing OpenRefine Documentation

Taxonomic Reconciliation Instructions

Initial steps

This documentation starts at the step where you have already created an OpenRefine project and have a column of scientific names (e.g., “Verbatim Name” below). If needed, see the documentation linked above to get started with installing OpenRefine and creating a project. Note that scientific names with authorship (for example Icteranthidium ferrugineum (Fabricius, 1787) vs Icteranthidium ferrugineum) usually provide more precise reconciliation when using GlobalNames endpoint.

OR_GNV_11 26 38

Creating a separate reconciliation column

You want to avoid writing over your original data, so you need to create a new column with a copy of the existing names. Use the drop-down arrow at the column with species names to access the menu > Edit column > Add column based on this column …

OR_GNV_11 26 50

Give the new column a name, keep GREL as the Language (the default), and leave ‘value’ in the Expression field (the default). A preview of the result will be shown below the Expression box, in the right column. Click ‘OK’.

OR_GNV_11 27 14

Preparing reconciliation

Once the new column is created, you can start reconciling. Access the menu > Reconcile > Start reconciling …

OR_GNV_11 27 47

Click on ‘Add standard service …’, paste in the URL to the Global Names Verifier reconciliation service (https://verifier.globalnames.org/api/v1/reconcile), and click ‘Add service’.

OR_GNV_11 31 03

You can leave all the settings in the default for now (additional filtering features are described in the next section). ‘Auto-match candidates with high confidence’ should be selected, and you can choose to limit the number of possible matches. Click ‘Start reconciling …’.

OR_GNV_11 31 52

Filters to remove false positive matches

Optionally you can use higher_taxon and data_source_ids settings to restrict matches to names whose classification contains a desired higher taxon name, or names from particular data-sources. To make it work add columns to your data where every entry contains desired settings.

image image image

When columns are created add filters at the reconciliation settings window

image

Start reconciliation

OR_GNV_11 32 00

You will see the reconciliation tool make progress. Depending on how many records you have, this may take some time. It took way less than one minute to reconcile 459 rows.

Processing reconciliation results

After the reconciliation process is complete, high-quality match candidates have automatically matched. In the example below, 439 of 459 rows have matched automatically. Reviewing all matches is recommended. You can undo a match by clicking on ‘Choose new match’ for any of the results.

OR_GNV_11 32 32 crop

Cases where there are manual match options (multiple names and/or lower-confidence scores) will show up in the Facet/Filter reconciliation judgment ‘none’. Click on ‘none’ to show only the scientific names that need to be matched manually.

OR_GNV_11 33 01

If you hover your cursor over a potential match, you will see information about the verification source and the match. To select a match, click on the check mark. If you have the same scientific name in multiple rows of your dataset, clicking the double check will make the match for all rows with the same name.

There may be a few cases where your data was not reconciled, and these will be judged as ‘(unreconciled)’. Click ‘(unreconciled)’ to isolate your view to those rows. NOTE: If you leave these unreconciled values in the column, they will be exported as though they were reconciled values. You do not want this!

OR_GNV_11 44 05

To remove the unreconciled values, access the menu > Edit cells > Common transforms > To null. This will convert the values to nulls, or blanks.

OR_GNV_11 44 23

You can now export the dataset as a spreadsheet in the format of your choice under the ‘Export’ menu at the top right of the OpenRefine window.

Best candidate's score filter

It is very beneficial to understand the Best Candidate's Score filter that should appear on the left when reconciliation finishes. Its graph shows the matching score distribution. Best score to the right, worst score to the left. The score would be clustered into seven groups (canonical form here means a name without authorship) :

  1. Canonical form of a name matched exactly, authorship matched well, data came from sources that are known to have some curation (these names normally match automatically).
  2. Canonical form of a name matched exactly, authorship matched well, but the source is of an unknown quality.
  3. Canonical form matched, but not authorship.
  4. Canonical form matched imperfectly (fuzzy matching).
  5. Canonical form only matched at the level of species (for infraspecies).
  6. Canonical form matched at the level of species imperfectly (fuzzy matching).
  7. Canonical form matched only at the level of genus.

Sliders on the graph allow to isolate a particular group of results by score, do manual reconciliation or assign a bulk decision to the whole group (discard it or set it as matched).

Score Graph ORefine

Add more data for reconciled names

It is possible to get additional information for matched results. Add new columns from the column of resolved results:

image

Select additional data from the list of available properties.

image

Processing names using gnparser for Wikidata reconciliation

Wikidata endpoint: https://wikidata.reconci.link/en/api

In case if reconciliation via Wikidata endpoint is desirable, we have to use a canonical form of a name (without authorship). When such a scientific name contains authorship, it is possible to transform such names using gnparser API. The gnparser understands inner structure of scientific names and separates their components into separate fields.

First create a new column using Add column by fetching URL...

image

Change expression to the following line: "https://parser.globalnames.org/api/v1/" + escape(value, "url"). Also change Throttle delay from 500 to 5.

image

At the end of the column creation, it will be filled with JSON-encoded data that looks like this:

[{"parsed":true,"quality":2,"qualityWarnings":[{"quality":2,"warning":"Ex authors are not required (ICZN only)"}],"verbatim":"Bulimus canarius Philippi, in Pfeiffer, 1867","normalized":"Bulimus canarius Philippi ex Pfeiffer 1867","canonical":{"stemmed":"Bulimus canar","simple":"Bulimus canarius","full":"Bulimus canarius"},"cardinality":2,"authorship":{"verbatim":"Philippi, in Pfeiffer, 1867","normalized":"Philippi ex Pfeiffer 1867","year":"1867","authors":["Philippi","Pfeiffer"]},"id":"eb69fa41-52ca-5d6b-9d16-d76d97a6dddc","parserVersion":"v1.7.4"}]

Now we need to extract canonical form from the parsed JSON-encoded data. Use "Edit cells->Transform" menu.

image

In the Expression field put parseJson(value)[0]["canonical"]["simple"]

image

This new column will now have canonical form of names and can be used for reconciliation via Wikidata.

image