Skip to content
John Waller edited this page Jan 30, 2024 · 1 revision

Rule Based Annotations

image1
Figure 1. Screenshot from rule based annotations demo interface, which illustrates how annotations might end up looking.

1. Summary

Write summary after finished first draft.

2. Introduction

2.1. Data cleaning

The Global Biodiversity Information Facility (GBIF) is a vital data infrastructure for researchers, conservationists, and policymakers across the globe. It aggregates and mediates access to extensive datasets of biodiversity occurrence records, thereby fostering scientific research, conservation efforts, and informed decision-making. Nevertheless, the quality of these records is pivotal for their "fitness for use", and data cleaning becomes an essential process to ensure their reliability and utility.

In recent years, the significance of occurrence data quality in scientific research and decision-making has gained recognition. As the volume and complexity of occurrence data continue to grow, the need for automated data cleaning tools has become more pronounced. R packages like CoordinateCleaner (2018) have played a key role in addressing this need, providing efficient and user-friendly solutions for common data quality issues.

image2
Figure 2. Lions in Europe and North America? It is common for GBIF maps to be confusing for users. Most GBIF users are not interested in records from zoos, fossils, or locations that might just be wrong, and GBIF mediated data is often not consistently rich enough to filter unwanted records.

2.2. Fixing at source

A competing viewpoint with regard to data cleaning is to "fix at source". Fixing GBIF occurrence data at the source, such as reaching out to data publishers to address issues and errors in their datasets, is an ideal approach in theory. However, in practice, this approach often encounters challenges, primarily because publishers may not respond to emails or communication attempts. It’s essential to bear in mind that rule-based annotations can contribute to rectifying data problems at their origin as well. Additionally, it is often the case that records do not need to be fixed, but merely are not acceptable for a certain application, such as species distribution mapping.

2.3. Motivation

A rule is a combination of geographic, taxonomic, and geographic information that facilitates data cleaning or analysis.

Automated solutions, like CoordinateCleaner, while valuable tools for data cleaning, may be considered incomplete in certain contexts due to their limited flexibility and potential to miss edge cases. A rule-based annotation system, on the other hand, allows users to make data quality decisions that fit their use case in a more granular way.

2.4. Complexity vs usability

Any system that attempts to solve every problem will solve none.

Annotation systems, like any software or tool, have the potential to become unusable when they become overly complicated.

One goal of a our rule-based annotation system is to make it accessible to a broad user base, including researchers, scientists, and casual users. If the system becomes overly complex, it can discourage potential users who may not have a deep technical background or a lot of time, but still have valuable feedback.

A rule-based annotation system, especially one used for annotating complex datasets like GBIF occurrence records, must strike a delicate balance between complexity and usability.

2.5. Controlled vocabulary

One of the key ways to increase usability and complexity is to introduce a controlled vocabulary.

penguins
Figure 3. "Penguins released in Norway". While the most accurate description of this event is the sentence above, a more useful rule might be "Penguins in Norway are suspicious".

Using a small controlled vocabulary over in an annotation system offers several advantages to downstream users. While controlled vocabularies offer simplicity, it’s essential to strike a balance. Overly restrictive controlled vocabularies can limit the ability to annotate all concepts. Therefore, finding the right level of granularity and flexibility within the controlled vocabulary is key to reaping the benefits while accommodating the specific needs of the annotation user.

image3
Figure 4. Example annotation that marks any occurrences of lions in Greenland as suspicious. It is left to the users to decide what to do with this information.

2.6. Focus on location

Another way to limited the complexity of an annotation system is too limit the scope.

We’ve made a deliberate choice to concentrate on location rule-based annotations for biodiversity occurrences. This decision stems from our goal to streamline and focus our efforts while addressing the most a prevalent type of feedback we receive at GBIF.

It’s important to note, however, that the concept of rule-based annotations is inherently extensible. While our initial focus centers on location data, the same framework and principles can be applied to other areas of data quality improvement within the GBIF context. This adaptability allows us to remain responsive to evolving user needs and feedback, ensuring that our efforts can be broadened to encompass other data quality challenges in the future. Ultimately, our aim is to create a flexible and scalable solution that can continue to benefit the biodiversity community as a whole.

2.7. Comparison with other species location databases

Other efforts exist to catalogue the ranges of the living world:

While these efforts are useful and well-developed, none of them are expressly focused on data quality. Namely, none of these systems allow users to easily state with a simple controlled vocabulary and rules where occurrences for a species are likely and unlikely.

image4
Figure 5. Our system allows users to annotate at an granular scale. For example, this annoation marks all occurrences that happen to be near this greenhouse as "managed".

3. Technical Details

3.1. Rules

A basic rule in our system looks like this.

ruletaxon in geo-polygon are controlled vocab

In our system a geo-polygon is a Well-Known Text (WKT) object. A geo-polygon could also be the name of a place that eventually maps to a WKT polygon (like a country code or gadm code).

Table 1. simple example rules

ruleLions in Greenland are suspicious

rulePenguins in Norway are suspicious

rulePenguins in WKT are native

ruleLions in Ocean are suspicious

A taxon in our system is going to be a GBIF taxonKey so rules are more likely to look like this in practice.

Table 2. taxonKey rules

rule5219404 in Greenland are suspicious

rule5284 in Norway are suspicious

rule5284 in WKT are native

rule5219404 in Ocean are suspicious

3.2. Rule extensions

We have found in initial testing that only being able to annotate land areas (a geo-polygon) is restrictive, so it is anticipated that certain extensions to this basic formula might be supported.

For example, often occurrence records can be suspicious but still be in a somewhat plausible location. A natural way to handle such cases would be to allow for rules with GBIF datasetKey.

ruletaxon in geo-polygon and datasetKey are controlled vocab

For example,

ruleLions in South Africa and datasetKey are suspicious

Another natural extension might be GBIF basisOfRecord.

For example, country centroid locations are often only suspicious for museum specimens, so a user could define a rule that captures this knowledge.

ruleLions in Centroid of South Africa and Preserved Specimen are suspicious

"Centroid of South Africa" would, of course, be defined by some WKT object like a circle or a polygon.

Finally, there might be other fields that might make good qualifiers/extensions, like year.

3.3. Rulesets

A ruleset is a collection of rules.

For example, a ruleset could be "Annotations of the Genus Leo", and it could look something like the table below.

Table 3. Example ruleset

ruleLions in Greenland are Suspicious

ruleLions in Ocean are Suspicious

ruleLions in South Africa are Native

ruleLions in WKT polygon of National Park are Native

ruleLions in WKT polygon of Zoo are Managed

ruleLions in Centroid of SA and Preserved Specimens are Suspicious

3.4. Projects

A project is a collection of rulesets.

Projects are designed to allow for collaboration between users and logical grouping of rulesets. For example, a ruleset could focus on Lions, but be part of a bigger project about cleaning up Mammal occurrence records.

Table 4. Example Project Mammals

ruleset

Annotations of Lions based on Field Guide

ruleset

Annotations of Mammals that are not in the Ocean

ruleset

Suspicious Zoo Locations of North America

ruleset

Adapted iNaturalist atlases of Mammals

ruleset

Suspicious Centroid locations for Museum Specimens

Note how a project can encode knowledge from other sources into a ruleset, such as iNaturalist atlases.

3.5. Collaboration

We hope that users will collaborate on a project that interests them and create rulesets that are widely beneficial to others within their research community.

Within a project, only users with access, granted by the project creator, will be able to create rules and rulesets. However, rules, rulesets, and projects will all be open and publicly available.

3.6. Sharing rules

It is also anticipated that a desirable feature would allow users to "borrow" rule or geo-polygon from another ruleset and assign a new taxonKey or add a rule extension. This will reduce the storage strains on GBIF and prevent duplicate work.

For example, a common rule might be to mark something in the ocean as suspicious. A user should be able apply this rule to a new taxonKey without creating a new ocean polygon every time.

3.7. Voting

For downstream users, deciding which rule and rulesets to use might become challenging without some quality control. Currently, we imagine a simple upvote-downvote system on rule, ruleset, and perhaps project. With voting users could see what annotations are supported by the broader community, and create cleaning scripts that are only use annotations supported by the community.

Additionally, voting could provide protection against vandalism.

3.8. Higher taxonomy

Another useful feature would be the ability to cast a rule down to all child taxa. Annotating higher taxonomy is harder than annotating at the species level because you have to be confident, the annotation at the higher level fits all child taxa.

amphibians
Figure 6. A map of amphibian occurrences on GBIF. It is well known there are no amphibians in Antarctica. However, we see from the map that one occurrence point still appears there in error.

Given the distribution of Amphibians, a good rule for the high taxon Amphibians would be :

ruleAmphibians in Antarctica are Suspicious

Once challenge is that is is hard to downcast annotations like "Native" to lower levels, since species of a big group tend not to be "Native" to exactly the same areas.

3.9. Exceptions to rules

Creating cast-down annotations can be hard due to several reasons related to the nature of the task and exceptions to the rule. An exclusion rule could be efficient for higher level downcasting of rules.

For example, a rule could exclude a certain group

ruletaxon in geo-polygon are controlled vocabulary except taxon x

ruleAmphibians in Antarctica are Suspicious except Antartic frogs

frogs
Figure 7. Frog article

A work around to rule exceptions could of course be rules that simply conflict.

3.10. Conflicting rules

Inevitably, there are going to be rules created in our system that conflict. For example, a user might mark and area as "Native", while another user will mark the same area as "Suspicious".

In our rule-based system, unlike perhaps other platforms, we are not striving to create a single ground truth. We aim only to have a collection of useful opinions, and we leave it to the end user to decide what to do with the information.

3.11. Rules with more than one taxon

It might be efficient in some circumstances to express rules with more than one taxon:

rule → taxon_1 + taxon_2 …​ in geo-polygon are controlled vocabulary

One useful example would be marking all marine species on land as suspicious.

rule → Marine species on WORMS list in Land Polygon are Suspicious

3.12. Controlled vocabulary

We might consider using the preexisting vocabulary, although we are attempting to annotate land area (ranges) more than we are attempting annotate occurrence records.

Below is the working controlled vocabulary for location-based annotations.

Table 5. Controlled vocabulary for locations
term definition

Native

Refers to the natural geographic range where a species or organism historically evolved and occurs without human intervention.

Introduced

Refers to the geographic area where non-native organisms have been intentionally or accidentally introduced and established

Managed

Encompasses the geographic area where specific species are actively controlled, conserved, or manipulated by human intervention.

former

Denotes the historical geographic area where a species once naturally occurred but no longer does due to various factors.

Vagrant

Describes sporadic occurrences of a species far outside its usual habitat or distribution, often due to rare or accidental dispersal events.

Suspicious

Occurrences occuring in the designated area might be in error in some way.

This vocabulary is meant to be a compromise between modeling species ranges and establishment means accurately, while not being overly complex.

Table 6. Example mappings
concept example

native

extant

native

endemic

native

indigenous

native

breeding

native

non-breeding

introduced

assisted colonization

introduced

invasive

introduced

non native range

managed

location is captive range

managed

location is botanical garden

managed

location is zoo

managed

cultivated in glasshouse

suspicious

location is in the ocean

suspicious

zero-zero coordinate

suspicious

centroid

suspicious

area too far north for taxon

suspicious

area too high elevation for taxon

suspicious

area is natural history museum

former

fossil range

former

extinct

former

historic

vagrant

migrant

The current vocabulary might change in the future. Namely, there has been some discussion introducing hierarchy such that perhaps certain terms map to present or absent for example.

3.13. Why not annotate occurrences directly?

A burning question at this point might be why not annotate occurrences directly?

Annotating land areas (and extensions) provide at least two advanateges over annotating occurrences:

  1. Avoids the use of unstable gbifIds.

  2. Allows for future occurrences to benefit from the annotation.

3.14. Rule license

(Should rules have a usage license?)

4. User Guide

If you are reading this you have been approached as potential pilot annotator.

This section will be completed once the UI matures a bit.

4.1. Good annotations

While there is not absolute definition of a good annotation and a bad one. Good annotations usually have a few properties:

  1. Good annotations usually don’t use extremely complex polygons. If you find yourself needing to trace the coastline of Italy, you might be making a bad annotation. A good annotation should take into account a little bit of buffer to take into account occurrence record uncertainty.

  2. Good annotations take into account future occurrence records. Remember that your annotations should be able to fit future occurrence fairly well.

  3. Good annotations also try to think about higher taxonomy and simplification.

(think about users outside of GBIF)

5. Road Map

Initiated tasks for 2023 in bold.

Explore approaches to annotation capabilities in GBIF.org that enable data corrections, enrichments and user-provided rules that combine taxonomic, geographic and temporal combinations to detect suspicious records

  • User interface (2023)

  • API with storage (2023)

  • R package gbifan interface (2023)

  • Seeding the database (2023)

  • Introduce to pilot users (2023-2024)

  • Start data paper (2024)

  • Integration with a hosted portal (2025)

  • Integration with GBIF.org (2025)

5.1. User Interface

A demo UI has been developed to facilitate the process of creating rule-based annotations on a map. This user-friendly interface allows users to visually interact with a map and define rules or criteria for data annotations.

The current state of the demo UI for rule-based annotations is considered a work in progress, and it is expected to evolve based on valuable feedback gathered during the pilot phase.

5.2. API backend

5.3. R package gbifan

It is anticipated that the most efficient way to interact with the rule store will be via R.

5.4. Introduction to pilot users

Identify potential pilot users who are actively engaged in biodiversity research, data annotation, or related fields. Consider reaching out to academic researchers and enthusiasts.

5.5. Seeding rulesets

Seeding rulesets into our annotation several benefits, especially when it comes to improving user experience. Seeding data can help new users get acquainted with the annotation system more quickly. By providing sample data or pre-existing annotations, users can learn how the system works and what is expected of them in terms of annotation tasks. This can reduce the learning curve and increase user engagement.

Seeding data with known annotations can also serve as a benchmark for quality assurance. Users can compare their annotations with the existing ones to ensure accuracy and consistency. This helps maintain high data quality standards, especially in applications like machine learning where training data quality is crucial. Users can be more productive when they have access to seeded data. They can start their annotation tasks immediately instead of filling in well-known information.

Providing users with seeded data can make the annotation process more engaging and rewarding. When users see that their annotations contribute to a dataset that already contains valuable information, they are more likely to stay engaged and continue using the system over the long term.

Table 7. A table of examples ruleset seeds.
World Checklist of Vascular Plants

GBIF Species Distribution Extension

5.6. Data paper

Carrot to find pilot users.

5.7. Integration with a hosted portal

2025

A hosted portal is a simple, branded and fully customizable website that displays a targeted subset of GBIF-mediated data to support Participant nodes and their partners.

This service is designed to support biodiversity data use and engagement at national, institutional, regional and thematic scales.

5.8. Integration with GBIF.org

2025

6. References