Foundations

Note

This module includes instruction to ensure all participants are at the same level before delving into the data mobilization topics. You will receive an introduction to the language, terminology and definitions for some of the basic concepts, functions and processes that you are going to be putting into use during the rest of the course. You will also receive an introduction to data quality and learn the importance of documentation. Lastly, you will be asked to install the OpenRefine software as part of this module.

Terminology

Definitions

Note	In this video (12:02), you will review terminology used in this course. If you are unable to watch the embedded video, you can download it locally. (MP4 - 38.5 MB)

Software

Note	In this video (05:58), you will review examples of the different types of applications and software available in the world of biodiversity mobilization informatics. If you are unable to watch the embedded video, you can download it locally. (MP4 - 18.9 MB)

Structures

Note	In this video (13:10), you will review the field and data types that hold data, the structures that help to organize and protect that data and what these mean for the integrity and security of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 38.8 MB)

Data quality

Note	In this video (12:26), you will review terminology used in this course. If you are unable to watch the embedded video, you can download it locally. (MP4 - 44.5 MB)

Note	Below you will find a selected reading from Arthur Chapman’s guide “Principles of data quality”. Full document, references and translations can be found on GBIF.org.

Before a detailed discussion on data quality and its application to species-occurrence data can take place, there are a number of concepts that need to be defined and described. These include the term data quality itself, the terms accuracy and precision that are often misapplied, and what we mean by primary species data and species-occurrence data.

Species-occurrence data

Species-occurrence data is used here to include specimen label data attached to specimens or lots housed in museums and herbaria, observational data and environmental survey data. In general, the data are what we term “point-based”, although line (transect data from environmental surveys, collections along a river), polygon (observations from within a defined area such as a national park) and grid data (observations or survey records from a regular grid) are also included. In general we are talking about georeferenced data – i.e. records with geographic references that tie them to a particular place in space – whether with a georeferenced coordinate (e.g. latitude and longitude, UTM) or not (textual description of a locality, altitude, depth) – and time (date, time of day).

In general the data are also tied to a taxonomic name, but unidentified collections may also be included. The term has occasionally been used interchangeably with the term “primary species data”.

Primary species data

“Primary species data” is used to describe raw collection data and data without any spatial attributes. It includes taxonomic and nomenclatural data without spatial attributes, such as names, taxa and taxonomic concepts without associated geographic references.

Accuracy and Precision

Accuracy and precision are regularly confused and the differences are not generally understood.

Accuracy refers to the closeness of measured values, observations or estimates to the real or true value (or to a value that is accepted as being true – for example, the coordinates of a survey control point).

Precision (or Resolution) can be divided into two main types. Statistical precision is the closeness with which repeated observations conform to themselves. They have nothing to do with their relationship to the true value, and may have high precision, but low accuracy. Numerical precision is the number of significant digits that an observation is recorded in and has become far more obvious with the advent of computers. For example a database may output a decimal latitude/longitude record to 10 decimal places – i.e. ca .01 mm when in reality the record has a resolution no greater than 10-100 m (3-4 decimal places). This often leads to a false impression of both the resolution and the accuracy.

These terms – accuracy and precision – can also be applied to non-spatial data as well as to spatial data. For example, a collection may have an identification to subspecies level (i.e. have high precision), but be the wrong taxon (i.e. have low accuracy), or be identified only to Family level (high accuracy, but low precision).

Data quality

Data quality is multidimensional, and involves data management, modelling and analysis, quality control and assurance, storage and presentation. As independently stated by Chrisman (1991) and Strong et al. (1997), data quality is related to use and cannot be assessed independently of the user. In a database, the data have no actual quality or value (Dalcin 2004); they only have potential value that is realized only when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers’ needs (English 1999).

Redman (2001), suggested that for data to be fit for use they must be accessible, accurate, timely, complete, consistent with other sources, relevant, comprehensive, provide a proper level of detail, be easy to read and easy to interpret.

One issue that a data custodian may need to consider is what may need to be done with the database to increase its usability to a wider audience (i.e. increase its potential use or relevance) and thus make it fit for a wider range of purposes. There will be a trade off in this between the increased usability and the amount of effort required to add extra functionality and usability. This may require such things as atomizing data fields, adding geo-referencing information, etc.

Quality Assurance/ Quality Control

The difference between quality control and quality assurance is not always clear. Taulbee (1996) makes the distinction between Quality Control and Quality Assurance and stresses that one cannot exist without the other if quality goals are to be met. She defines Quality Control as a judgement of quality based on internal standards, processes and procedures established to control and monitor quality; and Quality Assurance as a judgement of quality based on standards external to the process and is the reviewing of the activities and quality control processes to insure that the final products meet predetermined standards of quality.

In a more business-oriented approach, Redman (2001) defines Quality Assurance as “those activities that are designed to produce defect-free information products to meet the most important needs of the most important customers, at the lowest possible cost”.

How these terms are to be applied in practice is not clear, and in most cases the terms seem to be largely used synonymously to describe the overall practice of data quality management.

Uncertainty

Uncertainty may be thought of as a “measure of the incompleteness of one’s knowledge or information about an unknown quantity whose true value could be established if a perfect measuring device were available” (Cullen and Frey 1999). Uncertainty is a property of the observer’s understanding of the data, and is more about the observer than the data per se. There is always uncertainty in data; the difficulty is in recording, understanding and visualizing that uncertainty so that others can also understand it. Uncertainty is a key term in understanding risk and risk assessment.

Error

Error encompasses both the imprecision of data and their inaccuracies. There are many factors that contribute to error. Error is generally seen as being either random or systematic. Random error tends to refer to deviation from the true state in a random manner. Systematic error or bias arises from a uniform shift in values and is sometimes described as having ‘relative accuracy’ in the cartographic world (Chrisman 1991). In determining ‘fitness for use’ systematic error may be acceptable for some applications, and unfit for others.

An example may be the use of a different geodetic datum1 – where, if used throughout the analysis, may not cause any major problems. Problems will arise though where an analysis uses data from different sources and with different biases – for example data sources that use different geodetic datums, or where identifications may have been carried out using an earlier version of a nomenclatural code.

“Because error is inescapable, it should be recognized as a fundamental dimension of data” (Chrisman 1991). Only when error is included in a representation of the data is it possible to answer questions about limitations in the data, and even limitations in current knowledge. Known errors in the three dimensions of space, attribute and time need to be measured, calculated, recorded and documented.

Validation and Cleaning

Validation is a process used to determine if data are inaccurate, incomplete, or unreasonable. The process may include format checks, completeness checks, reasonableness checks, limit checks, review of the data to identify outliers (geographic, statistical, temporal or environmental) or other errors, and assessment of data by subject area experts (e.g. taxonomic specialists). These processes usually result in flagging, documenting and subsequent checking of suspect records. Validation checks may also involve checking for compliance against applicable standards, rules, and conventions. A key stage in data validation and cleaning is to identify the root causes of the errors detected and to focus on preventing those errors from re-occurring (Redman 2001).

Data cleaning refers to the process of “fixing” errors in the data that have been identified during the validation process. The term is synonymous with “data cleansing”, although some use data cleansing to encompass both data validation and data cleaning. It is important in the data cleaning process that data is not inadvertently lost, and changes to existing information be carried out very carefully. It is often better to retain both the old (original data) and the new (corrected data) side by side in the database so that if mistakes are made in the cleaning process, the original information can be recovered.

Documentation

Note	In this video (09:47), we will provide an overview of the importance of documentation as it relates to data management and data publishing. You will learn about data mapping, data relationships and metadata. If you are unable to watch the embedded video, you can download it locally. (MP4 - 29.2 MB)

Digitization Workflows

Note

This video (07:20) on Digitization Workflows identifies five clusters (or stages) in the process of digitizing natural history collection objects using digital images, and these stages can be easily adapted to other biodiversity data sources. If you are unable to watch the embedded video, you can download it locally. (MP4 - 26.8 MB)

Tip	As the video highlights, digitization protocols vary from institution to institution, but it is essential that the chosen protocol is agreed, documented and respected.

We do not teach digitization, per se, during the workshop, as it can easily stand as a week-long course on its own, instead we focus on basic introduction to biodiversity data capture. However, we want to provide you with resources on digitization as we know many are interested in this.

There are many ways to organize digitization efforts and so digitization can seem daunting to begin with. It is important to remember that in most cases someone else has already tried to digitize the same types of specimens and objects that you are planning to. In this exercise we introduce you to some practical digitization workflow resources to help get you started. These will also form the basis for work we will do in the workshop on selecting, modifying and assessing workflows.

Some steps in the process may include:

Pre-digitization curation and staging: This includes the preparation of the data source for the digitization process, including the assignment of unique identifiers that will help to refer to the source without error and to keep all derived information together.
Image capture: This includes a fair amount of planning, not only on the image capture itself (e.g. definition of the work sequence, selection of adequate hardware), but also on how and where the images will be stored and handled.
Image processing: This includes quality control, file conversion, etc.
Electronic data capture: The core of the digitization process, includes capturing key information in a database. The video highlights that the most common method of entering the information is through a keyboard, but more and more institutions are turning to advanced data entry technologies.
Georeferencing: Geographical information is very important for biodiversity analysis, so digitization projects should seek to extract the most accurate geographical information possible.

Integrated Digitized Biocollections (iDigBio) is the coordination centre for the United States National Resource for Advancing Digitization of Biodiversity Collections (ADBC). They lead a nation-wide effort to make data and images for millions of biological specimens available in a standard electronic format for the research community, government agencies, students, educators, and the general public. They have produced several videos that discuss the digitization process.

There are other videos in the iDigBio series that you may be interested in, if you wish to learn more about specific workflows for different specimen types:

“Digitizing Wet Collections” (4:34 mins) https://vimeo.com/120369690
“Imaging Workflows for the Digitization of Dry-preserved Vertebrate Specimens” (7:25 mins) https://vimeo.com/160615629
“Digitizing Herbarium Specimens” (7:34 mins) https://vimeo.com/120369768

Software tools

Note	Review software tools used in biodiversity informatics

During the course activities, we’ll demonstrate and work with many different software tools related to data digitization, data quality and transformation. You probably already use several of them in your daily work.

Community trainers, mentors and former course participants have compiled a list with information about biodiversity informatics software tools. It provides links for their main websites, a key facts and a summary of strong and weak points.

Download Software-database-EN.xlsx. (23 KB)

When analysing biodiversity software that you have not used before, you need to consider how you would adapt it for your purposes. You will find below a list with which you can start your evaluation. They are inspired by the chapter “characteristics of a good database solution” of the GBIF manual “Initiating a Digitisation Project”:

Price: One of the most determining factors. Beware of other costs beyond the price of the software license, such as hardware needed to run it, maintenance, upgrades, and the expertise to run it.
Functionality: You need to have clarity on what do you expect the software to achieve, and make sure it does it efficiently. Do not get distracted by additional functionality that can make the software more complex unnecessarily.
Stability: Some solutions have been in the market for long and are supported by solid institutions or companies are more likely to be bug-free and/or have good systems in place to solve any issues arising. It will also make more likely to be updated and ported to more modern operating systems.
Scalability: Some software performs very well when demoed out-of-the-box, but its performance degrades after some time or when using them with larger amounts of data or when several users access it simultaneously. Check the opinions of other users online.
Integration: Make sure that the software accepts and produces the data formats that you use and need. Data transformation is a time consuming task.
Language support: it is essential that everyone using the software can understand its interface, and the documentation that will make possible its use.
Documentation and technical support: make sure to explore the existing documentation and support mechanisms. You can be sure that at some point you will need it.
Learning curve: Some software may require specific training to learn how to use it, while others are more intuitive and can be learned while using them, supported by in-line help systems.

Install OpenRefine

Note	Install software required for activities later in the course

OpenRefine is a tool with a set of features for working with tabular data that improves the overall quality of a dataset. It is an application that runs on your own computer as a small web server, and in order to use it your web browser should point at that web server. So, think of OpenRefine as a personal and private web application.

We will use OpenRefine during the data mobilization portion of the course, especially during the practical exercises. It will be necessary to install OpenRefine on your laptop. If you are a skilled computer user, you can follow these steps to install the software on your computer. If you are not confident, please ask for help. Refer to the OpenRefine download page for more details.

Caution

Administrative passwords may be required to install software.

Installation Requirements

Linux users only: Java JRE installed.
Google Chrome, Microsoft Edge or Mozilla Firefox installed. Internet Explorer is not supported.

Note	The latest stable release is OpenRefine 3.4.1, released on September 24, 2020. Detailed installation instructions are available at https://docs.openrefine.org/manual/installing/.

Installation on MS Windows

Download the Windows kit with embedded Java. Choose to save the file rather than open it.
Find the downloaded file. Right click it, and choose "Extract all…". Unzip, and double-click on openrefine.exe or refine.bat if the former does not work.
A command window will appear (don’t close it) and soon after a new web browser window will show the application.

Detailed instructions for MS Windows (click to expand)

Download the Windows kit with embedded Java. Choose to save the file rather than open it.
Find the file you downloaded. Right click it, and choose "Extract All…"
Click "Extract"
Find the extracted files. Optionally, right click "openrefine" and choose "Send to → Desktop (create shortcut)" to create a shortcut on your desktop. Then double click "openrefine"
A black console window opens, and a short time later the browser opens. OpenRefine is now ready to use.

Installation on Mac

Download the Mac kit.
Download, open, drag icon into the Applications folder. You do not need to install Java separately.
Double click on it and a new web browser window will show the application.

Detailed instructions for Mac (click to expand)

Download the Mac kit, and choose to open it.
A warning is shown. Click "OK".
Open System Preferences.
Open Security & Privacy
Choose "Open Anyway" at the bottom.
Choose "Open"
Finally, the application archive is opened! Drag it to your Applications folder.
Double-click the OpenRefine icon. Another security warning appears!
Go back to "Security & Privacy" and click "Open Anyway" — again.
(To avoid these warnings, the OpenRefine developers would need to pay Apple.)

Click "Open".
Finally! The application is running.

Installation on Linux

Download the Linux kit.
Download, extract, then type ./refine to start. This requires Java to be installed on your computer.

Detailed instructions for Linux (click to expand)

These instructions are for KDE (e.g. Kubuntu, SuSE), but the process is similar for Gnome (e.g. Ubuntu, Red Hat, CentOS).

Download the Linux kit. Open it.
Click "Extract" to unpack the downloaded application.
Choose a suitable place. I also selected "Open destination folder after extraction" and "Close Ark after extraction"
Right click "refine" and choose "Run in Konsole". This is needed so you can safely exit OpenRefine later, by closing the Konsole window.
Confirm that you wish to execute the downloaded application.
OpenRefine is now running.

Foundations review

Note	Quiz yourself on the concepts learned in this section.

For the given statement, input the correct term (database, database language, database program)

* combines and presents functions and features for manipulating data, together in a unified interface +
  __database program__
* structured and organized collection of data and/or information held on a computer +
  __database__
* the way by which a human communicates with a computer +
  __database language__

If you open a data file and see the following, what would you suspect is the issue?

�tre, ou ne pas �tre, c�est l� la question.
```
- [ ] Nothing
- [ ] It is corrupt
- [x] The wrong encoding was used to open the file
- [ ] The sender used a weird font
```

For the given software, input the type of software (data capture, data management, data cleaning, data publishing).

* Integrated Publishing Toolkit (IPT) +
  __data publishing__
* Specify +
  __data capture__ AND __data management__
* iNaturalist +
  __data capture__
* OpenRefine +
  __data cleaning__

For the given example, input the correct data type (binary, boolean, float, integer, long integer, text, unstructured text)

* 1236975 +
  __long integer__
* 01101111 +
  __binary__
* We walked 5 miles down the road west from the post office in the center of town. We then went 2 miles north on a dirt path to the river. Then we continued west along the river for another 5 miles. +
  __unstructured text__
* 1024 +
  __integer__
* 29.0 +
  __float__
* Yes/No +
  __boolean__
* 6 rabbits were observed +
  __text__

Which of these terms describes a "field/column name"?

- [x] Assigned
- [ ] Descriptive
- [x] Identifying
- [ ] Readable
- [x] Unique
- [ ] User-interface

Which of these terms describes a "field label"?

- [ ] Assigned
- [x] Descriptive
- [ ] Identifying
- [x] Readable
- [ ] Unique
- [x] User-interface

For each statement, input the correct structure (row, column, table)

* All data refers to a SINGLE concept. +
  __table__
* An attribute has the SAME field/data type for every record. +
  __column__
* Attributes of a record ALWAYS stay together. +
  __row__

Who determines the fitness for use of your data?

- [ ] The museum or department director
- [x] The users of the data for research or education
- [ ] The collector of the data in the field
- [ ] The person entering the data into the database

For the given statements, input the matching image. (A, B, C, D)

* High accuracy, low precision +
  __C__
* Low accuracy, high precision +
  __B__
* High accuracy, high precision +
  __D__
* Low accuracy, low precision +
  __A__

Identify the data relationships where Dataset B needs to be merged into Dataset A (0:1, 1:0, 1:1, 1:∞, ∞:1, ∞:∞). Not all the relationships are used.

* Collector field exists in both dataset A and B +
  __1:1__
* Country field only exists in dataset B +
  __0:1__
* Name field exists in dataset A, but dataset B contains First Name and Last Name fields +
  __1:∞__
* ID field exists in both dataset A and B +
  __1:1__
* Elevation exists in dataset A, but not in dataset B +
  __1:0__
* Date exists in dataset A, but Day, Month, and Year are separate fields in dataset B +
  __1:∞__

Metadata is important because (select the TRUE statements):

- [x] it allows users to determine if a dataset is fit for their use.
- [ ] it allows you to share exact coordinates for each occurrence.
- [x] it allows you to know under which legal terms the reuse of data is permitted.
- [ ] it also applies to all supplemental and associated materials, including images, video, and other media.
- [ ] it allows you know about the institution's next exhibition/opening hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!