Created 28 August 2019
Updated 29 August 2019
This repository contains instructions and example files for using the reconcile-csv software to reconcile OpenRefine records against a local spreadsheet. This method can be used during data cleanup to match unique identifiers to records based on fuzzy matching and/or multiple criteria fields.
-
records.csv: contain sample record data
-
species.csv: contain sample species taxonomic data
-
Download reconcile-csv (http://okfnlabs.org/reconcile-csv/). It is recommended that reconcile-csv-0.1.2.jar be saved in the same directory as species.csv.
-
Start OpenRefine. Double-click on the diamond icon and it should automatically start in the default browser. If needed, navigate to
http://127.0.0.1:3333in a browser window. -
Create a new project, then choose
records.csv. Make sure to specify that columns are separated by comma, give the project a new Project name, then Create Project. -
Open a Command Line/Terminal window and navigate to the directory with reconcile-csv-0.1.2.jar and species.csv. Then run the following command:
java -Xmx2g -jar reconcile-csv-0.1.2.jar species.csv scientific_name taxon_id
(for very large CSV files, more memory may need to be allocated to the Java program. In that case, you can change the 2 in the above options to 4 or however many gigabytes of memory you would like to allocate.)
The following message should be printed:
Starting CSV Reconciliation service
Point refine to http://localhost:8000 as reconciliation service
<date> <time>:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
<date> <time>:INFO:oejs.AbstractConnector:Started [email protected]:8000
-
Go back to OpenRefine. Click on the dropdown arrow next to the species column header, go down to Reconcile, then click Start Reconciling...
-
At the bottom of the prompt, click Add Standard Service...
-
When prompted to enter the service's URL, enter
http://127.0.0.1:8000/reconcile, then click Add Service.
At this point, the Reconcile column "species" prompt should come up quickly. If not, and there is just a prolonged Working wheel, it means that there is a problem with the formatting of species.csv. reconcile-csv is very sensitive to CSV formatting, but unfortunately does not produce useful errors stating as much. Fix the formatting of species.csv; make sure every line has the expected number of values, and make sure there are no blank rows at the bottom of the file.
-
On the Reconcile column "species" prompt, the defaults should suffice. At the lower right, click the Start Reconciling button.
-
Once OpenRefine is done reconciling, each name in the species column will now be a blue link. Clicking on the link opens up a new tab containing the row that was matched in the CSV, and all the associated information to the matched species name.
- Several records that did not find perfect matches will need to be manually corrected. OpenRefine will show a list of likely matches, along with a decimal value showing how close each option matches the reconciled value. You can use the auto-generated species:judgment facet to isolate unmatched records if you would like.
- record_id 15: Bos taurus is not in species.csv. Click the single checkbox next to Create new item.
- record_id 24: Faciolela oxyrynca is misspelled. CLick the single checkbox next to Facciolella oxyrhyncha.
- record_id 36: Cofea farabica is misspelled. Click the single checkbox next to Coffea arabica.
-
Click on the dropdown arrow next to the species column header, go down to Edit column, then click Add column based on this column...
-
In the New column name box, enter
taxon_id. In the Expression box, entercell.recon.match.id. In the Preview tab, you should see the column populated with taxon ID numbers. Then click OK.
This is an example of adding kingdom data from the reconciliation CSV file to the dataset. This process can be repeated for as many columns as desired; just substitute the field appropriately.
-
Click on the dropdown arrow next to the taxon_id column header, go down to Edit column, then click Add column by fetching URLs.
-
In New Column Name, enter
html. In the Expression box, enter"http://127.0.0.1:8000/view/"+value. Then click OK. (It may take a while for this step to run.) -
Click on the dropdown arrow next to the html header, go down to Edit column, then click Add column based on this column...
-
In New column name, enter
kingdom. In the Expression box, entervalue.parseHtml().select("tr:contains(kingdom)")[0].select("td")[1].htmlText(). Then click OK. -
Steps 15 and 16 can be repeated for as many new columns as desired. For each, replace
kingdomin"tr:contains(kingdom)"with the CSV data field you would like to retrieve. -
Once all new data have been added, the html column can be removed.
Once all other cleanup operations have been completed, the new dataset can be exported with the Export dropdown menu at the top.
To stop the reconcile-csv server, type [Ctrl]+c in the Command Line/Terminal window. Alternatively, closing the Command Line/Terminal window should also stop the server.