Skip to content

Latest commit

 

History

History
210 lines (168 loc) · 8.92 KB

management.en.adoc

File metadata and controls

210 lines (168 loc) · 8.92 KB

Data management

Note
In this module, you will review the main concepts, related tools and best practices for data management, particularly, data cleaning and standardization.

Principles of data management

Note
In this video (09:49), you will review an important set of principles necessary to improve data through the processes of data cleaning. If you are unable to watch the embedded video, you can download it locally. (MP4 - 16.6 MB)

Data management tools

Note
In this video (06:42), you will learn about a variety of tools that you can use to improve the quality of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 10.3 MB)

Exercise 3a-c

Note
For these exercises, you will perform technical and consistency validation checks, improve data with different tools, and learn how to use OpenRefine.

Read USE CASE I (if you haven’t already).

Your institution is part of the “Global Poales Association (GPA)”. This association has secured funding to publish an up-to-date flora on the group and has requested your herbarium to participate and provide any high quality records you may have on this order of plants. The order is well represented in your collection so you think you could contribute substantially to this effort.

Exercise 3a

Validation checks

In this exercise we will focus on technical errors and perform a basic validation check to identify technical errors. Refer to Validation checks for information on the types of errors.

  1. Download UC1-3ab-data-cleaning.csv. (207.5 KB)

  2. Import the CSV file in Excel using the Excel wizard. See Excel-tips-EN.pdf (PDF, 7 MB) for import instructions for your operating system (Windows, Mac, Linux).

  3. Find and correct the errors manually.

  4. Use the previously downloaded exercise sheet to provide your answers.

Exercise 3b

Other data management tools

The GPA association has given you a checklist of data quality elements to verify:

  • All plant names (full name) are correctly spelled

  • All plant names belong to the order

  • All records have coordinates

  • All coordinates are inside the country stated and converted to decimal format

  • All dates are in the proper column and in the format YYYY-MM-DD

The three categories of errors are:

  • Nomenclatural errors

  • Format errors

  • Geographic errors / outliers

    1. Refer to Helpful tools in order to complete the exercise. You are not limited to these tools, you may use any tools you like.

    2. Use the same file from the previous exercise.

    3. Make the correction ONLY for the Eriocaulaceae family (so you may want to filter the data)

    4. Correct the errors found in the dataset used in exercise 3a (previous exercise), using the tools of your choice, and document the changes you perform in the exercise sheet.

    5. Correct the entire file if you have time.

    6. Use the previously downloaded exercise sheet to provide your answers.

Exercise 3c

Note
In this video (03:27), you will learn about OpenRefine. You can use OpenRefine to standardize and improve the quality of your data. If you are unable to watch the embedded video, you can download it locally. (MP4 - 3.8 MB)

OpenRefine

In this exercise we use OpenRefine to improve the quality of a dataset by using the default features, existing web services and regular expressions.

  1. Download UC1-3c-open-refine.csv. (207.5 KB)

  2. Download and complete the exercises in OpenRefine-Exercise3c-EN.pdf. (PDF, 1.1 MB) Also available in French and Spanish.

  3. Use the previously downloaded exercise sheet to provide your answers.

Exercise tips

Validation checks

Technical errors Relatively simple, often able to be automated, checks against the integrity of the data. These may indicate incorrect exports, data mapping, field slippage (e.g. moving 1 column to the right) or data missing at the source.

  • Completeness: Whether all the data and metadata is available – are all fields present, are all fields filled out?

  • Bounds: For example, are days given in the range 1-31 (depending on month)

  • Data type: For example, does the Date field contain a date or a number?

  • Data format: For example, are Dates provided as 01/01/2010 or 01/Jan/10?

Consistency errors

Application of real-world rules to the data. These may indicate incorrect data entry from older records, transcription errors or post processing. Some are complex to implement and require reference data sets to check against. E.g. a list of known collectors and collecting habits. These rules can be gathered from data users and analysts.

  • Taxonomic: For example, if identified to species level, have a binomial scientific name and entries in genus and species fields been provided?

  • Currency: Are dates of collection, identification, update and digitization consistent?

  • Outliers: Detect outliers, but remember that not all outliers are necessarily errors. For example, compare against a known species range, or known environmental range (but remember that outliers may be misidentifications, rather than incorrect coordinates).

  • Geographic: Are the coordinates within the identified locality or region? For example, are there any terrestrial occurrences in the sea or marine occurrences on land?

  • Collecting patterns: Does the occurrence detail match the known collecting patterns of the organization or collector? Do any records appear to have been created after a collector has died (could this possibly be a different collector with a similar name)? For example, are any mammal records attributed to a bird watching group?

  • Accuracy and precision: For example, are any georeferenced records indicating very high precision or accuracy from a pre-GPS (or pre-accurate GPS) collecting period?

  • Collecting methods: Different survey methods (e.g. transects and area surveys) have particular characteristics. Are the records consistent with the method provided?

Review

Note
Quiz yourself on the concepts learned in this section.
  1. Why is it best to clean your data?

    - [x] to make them as fit for use as possible
    - [x] to achieve your data quality goals
    - [ ] data should be cleaned by the users, not the providers
  2. How should you organize your data cleaning workflow?

    - [ ] work alone, you know your data best
    - [x] ask your colleagues for expertise
    - [x] work at an institutional level to harmonize data quality workflows
  3. Which is best:

    - [x] prevent errors from occurring
    - [x] correct errors as soon as you find them in your database or spreadsheet
    - [ ] not cleaning errors but documenting them as you go, so people who reuse your data know where they are
  4. Whose responsibility is data quality?

    - [ ] The person(s) who record data on the field
    - [ ] The data transcribers
    - [ ] The database manager
    - [x] Everyone involved in the management of data
    - [ ] The people who use your data
    - [ ] GBIF
  5. Which tools can be used to clean your data ?

    - [x] Excel & other spreadsheets management tools
    - [x] OpenRefine
    - [x] Your database software
    - [x] Online tools such as Scientific Names Resolver or Google Maps