Skip to content

Latest commit

 

History

History
296 lines (202 loc) · 13.5 KB

solutions.en.adoc

File metadata and controls

296 lines (202 loc) · 13.5 KB

Appendix: Solutions

Note
This appendix contains the answers and additional information to all of the review quizzes. Additionally, this section contains a suggested solution to USE CASE I.

Foundations review solutions

For the given statement, input the correct term (database, database language, database program)

  • combines and presents functions and features for manipulating data, together in a unified interface
    database program

  • structured and organized collection of data and/or information held on a computer
    database

  • the way by which a human communicates with a computer
    database language

If you open a data file and see the following, what would you suspect is the issue?

�tre, ou ne pas �tre, c�est l� la question.

  • The wrong encoding was used to open the file

For the given software, input the type of software (data capture, data management, data cleaning, data publishing).

  • Integrated Publishing Toolkit (IPT)
    data publishing

  • Specify
    data capture AND data management

  • iNaturalist
    data capture

  • OpenRefine
    data cleaning

For the given example, input the correct data type (binary, boolean, float, integer, long integer, text, unstructured text)

  • 1236975
    long integer

  • 01101111
    binary

  • We walked 5 miles down the road west from the post office in the center of town. We then went 2 miles north on a dirt path to the river. Then we continued west along the river for another 5 miles.
    unstructured text

  • 1024
    integer

  • 29.0
    float

  • Yes/No
    boolean

  • 6 rabbits were observed
    text

Which of these terms describes a "field/column name"?

  • Assigned

  • Identifying

  • Unique

Which of these terms describes a "field label"?

  • Descriptive

  • Readable

  • User-interface

For each statement, input the correct structure (row, column, table)

  • All data refers to a SINGLE concept.
    table

  • An attribute has the SAME field/data type for every record.
    column

  • Attributes of a record ALWAYS stay together.
    row

Who determines the fitness for use of your data?

  • The users of the data for research or education

For the given statements, input the matching image. (A, B, C, D) accuracyandprecision

  • High accuracy, low precision
    C

  • Low accuracy, high precision
    B

  • High accuracy, high precision
    D

  • Low accuracy, low precision
    A

Identify the data relationships where Dataset B needs to be merged into Dataset A (0:1, 1:0, 1:1, 1:∞, ∞:1, ∞:∞). Not all the relationships used.

  • Collector field exists in both dataset A and B
    1:1

  • Country field only exists in dataset B
    0:1

  • Name field exists in dataset A, but dataset B contains First Name and Last Name fields
    1:∞

  • ID field exists in both dataset A and B
    1:1

  • Elevation exists in dataset A, but not in dataset B
    1:0

  • Date exists in dataset A, but Day, Month, and Year are separate fields in dataset B
    1:∞

Metadata is important because (select the TRUE statements)

  • it allows users to determine if a dataset is fit for their use.

  • it allows you to know under which legal terms the reuse of data is permitted.

Planning review solutions

What is the order of the five PMBoK Process Groupings?

  • Initiating, Planning, Executing, Monitoring and Controlling, Closing

What are the types of deliverable?

  • Stated - YES

  • Implied - YES

  • Estimated - NO

  • Direct - YES

  • Indirect - YES

  • Guesses - NO

What is a bottleneck?

  • a blockage that delays development or progress - YES a space where something or someone is missing - NO, THIS IS A GAP

  • a problem, or situation that prevents somebody from doing something, or that makes something impossible. NO, THIS IS A BARRIER

Which are examples of mobilization tasks?

  • Affiliation - NO, This is a Resource Type

  • Publishing - YES

  • Imaging - YES

  • Georeferencing - YES

  • Increased Public Awareness - NO, This is an implied goal.

Data capture review solutions

What dataset type(s) would you choose for an ichthyology collection?

  • occurrence
    Most of the time, specimens from collection databases are shared as occurrence data. Each occurrence (specimen or group of specimens) has its own unique identifier (sometimes derived from its catalogue number in the source collection) and the Darwin Core fields used to share them within GBIF describe each specimen: scientific name, the date it was collected on the field, who collected and/or identified it, where, etc. Each collection can have more than one specimen from a same species, as long as each specimen is identified by a unique ID.

  • checklist
    It is also possible to create and share a taxonomical checklist derived from a collection database; in this case, it is recommended to share the checklist as a taxonomical dataset, with the occurrence (specimen) list associated with it by using the Occurrence core as an extension to the Taxon Core on the GBIF IPT.

What dataset type(s) would you choose for a list of invasive species?

  • occurrence
    Some data publishers will share occurrence datasets coming from studies or programs tracking specimens from some specific invasive species; when the data focuses on individuals instead of the invasive species, in general, they can be shared as occurrence data.

  • checklist
    Invasive species can be tracked and monitored at different scales (regional, national, thematic…); as this type of dataset focuses more on the species and their distribution across a given geographical scope, they are mainly shared as taxonomical datasets within GBIF (see GRIIS search results).

What dataset type(s) would you choose for the flora and fauna of an environmental impact study?

  • occurrence
    Data are recorded by naturalists on the field and can be shared as simple occurrence datasets.

  • sampling event
    They can also be shared as event datasets if standardized protocols (such as vegetation plots, transects, traps…) are used to collect the data.

What dataset type(s) would you choose for bird tracking data?

  • occurrence
    These data are shared as occurrence datasets: ideally, each bird is identified with its organismID, and each occurrence (GPS ping) has its own occurrenceID, which is useful to track the different GPS locations of the same bird over the scope of the tracking programme or project. (See example)

What dataset type(s) would you choose for insect trap data?

  • occurrence
    Although such data can be shared as simple occurrence datasets, it is best if they’re shared as event datasets, where the location, identifier and contents of each trap can be better detailed.

  • sampling event
    Insect traps (as well as other traps such as pitfall traps, malaise traps…) are typically used in monitoring programmes to check the presence (or absence) of some species and/or assess their specific abundance. Using the “eventID” field to identify each trap allows the users to get all of the specimens collected within each trap. The same logic applies to other field protocols such as transects, plots, remote cameras, etc.: by using the Event Core instead of the Occurrence core, you’ll be able to share much more information about the context of the data collection, and allow users to better understand (and even replicate) your work.

What dataset type(s) would you choose for national park management data?

  • occurrence
    record individuals of species

  • checklist
    It is important to know how many species are present in the park/reserve perimeter and their conservation status.

  • sampling event
    check and track the populations

What dataset type(s) would you choose for a citizen science bioblitz?

  • occurrence
    Bioblitz datasets are mainly shared as occurrence datasets.

  • sampling event
    Depending on the citizen science programme, specific sampling protocols might be used by the volunteers, in which case, the data can be shared as an event dataset.

What dataset type(s) would you choose for a regional species list?

  • checklist
    Geographical or thematic species lists are often used to share information about the species present in a given area; most of the time, these lists also mention the distribution of each species as well as their conservation status in this area. Regional species lists can give a useful insight into a region’s biodiversity and habitats, and need to be shared as taxonomical datasets, with or without associated occurrences.

Data management review solutions

Why is it best to clean your data?

  • to make them as fit for use as possible

  • to achieve your data quality goals

You should always aim to manage and publish data with the highest possible quality. This will improve your day-to-day work (it is easier to work with organized and clean data), as well as the work of potential re-users of your data, who need to understand them and trust their source before using them.

How should you organize your data cleaning workflow?

  • ask your colleagues for expertise

  • work at an institutional level to harmonize data quality workflows

Nobody is expected to know everything about biodiversity data; you should seek help and advice from your colleagues or other knowledgeable people, and ensure that you’re applying the good practices recommended by your institution as you clean your data.

Which is best:

  • prevent errors from occurring

  • correct errors as soon as you find them in your database or spreadsheet

The best way to avoid spreading errors in your data is to prevent them from occurring at the start of the data collecting/recording process.

Of course, mistakes are unavoidable so you should also clean them as soon as you find them, and document the cleaning process.

If you don’t have the time or resources to properly clean your data, it is best to wait before you can do so instead of publishing erroneous data that might confuse people.

Whose responsibility is data quality?

  • Everyone involved in the management of data

Every person involved in your data management workflow is at least partly responsible for their quality, from the field technicians to the database manager(s).

People who might later use your data can inform you of any remaining error in your data, and should use them responsibly for their own research, but the initial data quality is not their responsibility.

GBIF can perform automatic checks on your data (e.g. detection of missing values, geographic outliers, unknown scientific names) but should not be held responsible for errors that occurred earlier in the data management process.

Which tools can be used to clean your data?

  • Excel & other spreadsheets management tools

  • OpenRefine

  • Your database software

  • Online tools such as Scientific Names Resolver or Google Maps

All kinds of tools can be used to clean your data, but you should identify which ones will answer your needs in terms of taxonomic resolving, georeferencing, deleting duplicates, and so on. You can find helpful tools listed in the data management section.

Data publishing review solutions

What does data publishing mean in the context of GBIF?

  • Making your biodiversity dataset(s) publicly accessible and discoverable in a standardized format

Data publishing within GBIF means making your biodiversity dataset(s) publicly accessible in a standardized format (most of the time, Darwin Core), so that it can be discovered and reused by other people.

What is an IPT?

  • a tool that helps you publish your data to GBIF

  • a tool that helps you produce a Data paper

The IPT (Integrated Publishing Toolkit) is a Java-coded software that allows you to upload and publish data to GBIF. It is not to be used as a data management or data cleaning tool.

The IPT can also help you with the process of writing and submitting a data paper, thanks to the EML file it generates automatically when you fill in the metadata for your data resource.

Which Creative Commons licences and waivers are recommended by GBIF for data publication?

  • CC0, CC-BY and CC-BY-NC

The Creative Commons licences and waivers recommended to publish your dataset(s) to GBIF are CC0, CC-BY and CC-BY-NC. They are widely recognized licenses and/or waivers that align with international open-data requirements for data sharing and re-use.

Please note that you should only choose CC0 or CC-BY waiver/license for your BID-related dataset(s).

What are the three Cores from which you can choose for an IPT resource?

  • Occurrence Core, Taxon Core, Event Core

You can choose one of the three following Cores for each of your IPT resources: Occurrence, Taxon or Event Core.

The Darwin Core standard also allows you to link extensions to your chosen Core, such as SimpleMultimedia or MeasurementOrFact.

The metadata are filled in a separate section of the IPT and are shared using the EML standard, not the Darwin Core (which is used for data only).

How many Extensions files can a dataset have?

  • as many as needed

Once you have chosen a Core for your IPT resource, you can add Darwin core extensions to it. You can add only one or several extensions, depending on the type of Core you chose, and which extensions are compatible with it.

Extensions are not mandatory (you can publish a dataset without any extension) but can be useful if you want to share additional information that you could not map with your chosen Core.

Use Case I suggested solution

suggested solution (PDF 144 KB)