-
Notifications
You must be signed in to change notification settings - Fork 1
Home

The Global Biodiversity Information Facility (GBIF) is a vital data infrastructure for researchers, conservationists, and policymakers across the globe. It aggregates and mediates access to extensive datasets of biodiversity occurrence records, thereby fostering scientific research, conservation efforts, and informed decision-making. Nevertheless, the quality of these records is pivotal for their "fitness for use", and data cleaning becomes an essential process to ensure their reliability and utility.
In recent years, the significance of occurrence data quality in scientific research and decision-making has gained recognition. As the volume and complexity of occurrence data continue to grow, the need for automated data cleaning tools has become more pronounced. R packages like CoordinateCleaner (2018) have played a key role in addressing this need, providing efficient and user-friendly solutions for common data quality issues.

A competing viewpoint with regard to data cleaning is to "fix at source". Fixing GBIF occurrence data at the source, such as reaching out to data publishers to address issues and errors in their datasets, is an ideal approach in theory. However, in practice, this approach often encounters challenges, primarily because publishers may not respond to emails or communication attempts. It’s essential to bear in mind that rule-based annotations can contribute to rectifying data problems at their origin as well. Additionally, it is often the case that records do not need to be fixed, but merely are not acceptable for a certain application, such as species distribution mapping.
Automated solutions, like CoordinateCleaner, while valuable tools for data cleaning, may be considered incomplete in certain contexts due to their limited flexibility and potential to miss edge cases. A rule-based annotation system, on the other hand, allows users to make data quality decisions that fit their use case in a more granular way.
Annotation systems, like any software or tool, have the potential to become unusable when they become overly complicated.
One goal of a our rule-based annotation system is to make it accessible to a broad user base, including researchers, scientists, and casual users. If the system becomes overly complex, it can discourage potential users who may not have a deep technical background or a lot of time, but still have valuable feedback.
A rule-based annotation system, especially one used for annotating complex datasets like GBIF occurrence records, must strike a delicate balance between complexity and usability.
One of the key ways to increase usability and complexity is to introduce a controlled vocabulary.

Using a small controlled vocabulary over in an annotation system offers several advantages to downstream users. While controlled vocabularies offer simplicity, it’s essential to strike a balance. Overly restrictive controlled vocabularies can limit the ability to annotate all concepts. Therefore, finding the right level of granularity and flexibility within the controlled vocabulary is key to reaping the benefits while accommodating the specific needs of the annotation user.

Another way to limited the complexity of an annotation system is too limit the scope.
We’ve made a deliberate choice to concentrate on location rule-based annotations for biodiversity occurrences. This decision stems from our goal to streamline and focus our efforts while addressing the most a prevalent type of feedback we receive at GBIF.
It’s important to note, however, that the concept of rule-based annotations is inherently extensible. While our initial focus centers on location data, the same framework and principles can be applied to other areas of data quality improvement within the GBIF context. This adaptability allows us to remain responsive to evolving user needs and feedback, ensuring that our efforts can be broadened to encompass other data quality challenges in the future. Ultimately, our aim is to create a flexible and scalable solution that can continue to benefit the biodiversity community as a whole.
Other efforts exist to catalogue the ranges of the living world:
While these efforts are useful and well-developed, none of them are expressly focused on data quality. Namely, none of these systems allow users to easily state with a simple controlled vocabulary and rules where occurrences for a species are likely and unlikely.

A basic rule in our system looks like this.
rule
→ taxon
in geo-polygon
are controlled vocab
In our system a geo-polygon
is a Well-Known Text (WKT) object. A geo-polygon
could also be the name of a place that eventually maps to a WKT polygon (like a country code or gadm code).
|
|
|
|
A taxon
in our system is going to be a GBIF taxonKey
so rules are more likely to look like this in practice.
|
|
|
|
We have found in initial testing that only being able to annotate land areas (a geo-polygon) is restrictive, so it is anticipated that certain extensions to this basic formula might be supported.
For example, often occurrence records can be suspicious but still be in a somewhat plausible location. A natural way to handle such cases would be to allow for rules with GBIF datasetKey
.
rule
→ taxon
in geo-polygon
and datasetKey
are controlled vocab
For example,
rule
→ Lions in South Africa and datasetKey are suspicious
Another natural extension might be GBIF basisOfRecord
.
For example, country centroid locations are often only suspicious for museum specimens, so a user could define a rule that captures this knowledge.
rule
→ Lions in Centroid of South Africa and Preserved Specimen are suspicious
"Centroid of South Africa" would, of course, be defined by some WKT object like a circle or a polygon.
Finally, there might be other fields that might make good qualifiers/extensions, like year
.
A ruleset
is a collection of rules
.
For example, a ruleset
could be "Annotations of the Genus Leo", and it could look something like the table below.
|
|
|
|
|
|
A project
is a collection of rulesets
.
Projects are designed to allow for collaboration between users and logical grouping of rulesets
. For example, a ruleset
could focus on Lions, but be part of a bigger project
about cleaning up Mammal occurrence records.
|
Annotations of Lions based on Field Guide |
|
Annotations of Mammals that are not in the Ocean |
|
Suspicious Zoo Locations of North America |
|
Adapted iNaturalist atlases of Mammals |
|
Suspicious Centroid locations for Museum Specimens |
Note how a project
can encode knowledge from other sources into a ruleset
, such as iNaturalist atlases.
We hope that users will collaborate on a project
that interests them and create rulesets
that are widely beneficial to others within their research community.
Within a project
, only users with access, granted by the project creator, will be able to create rules and rulesets. However, rules, rulesets, and projects will all be open and publicly available.
It is also anticipated that a desirable feature would allow users to "borrow" rule
or geo-polygon from another ruleset
and assign a new taxonKey or add a rule extension. This will reduce the storage strains on GBIF and prevent duplicate work.
For example, a common rule
might be to mark something in the ocean as suspicious. A user should be able apply this rule to a new taxonKey without creating a new ocean polygon every time.
For downstream users, deciding which rule
and rulesets
to use might become challenging without some quality control. Currently, we imagine a simple upvote-downvote system on rule
, ruleset
, and perhaps project
. With voting users could see what annotations are supported by the broader community, and create cleaning scripts that are only use annotations supported by the community.
Additionally, voting could provide protection against vandalism.
Another useful feature would be the ability to cast a rule
down to all child taxa. Annotating higher taxonomy is harder than annotating at the species level because you have to be confident, the annotation at the higher level fits all child taxa.

Given the distribution of Amphibians, a good rule for the high taxon Amphibians would be :
rule
→ Amphibians in Antarctica are Suspicious
Once challenge is that is is hard to downcast annotations like "Native" to lower levels, since species of a big group tend not to be "Native" to exactly the same areas.
Creating cast-down annotations can be hard due to several reasons related to the nature of the task and exceptions to the rule. An exclusion rule could be efficient for higher level downcasting of rules.
For example, a rule could exclude a certain group
rule
→ taxon
in geo-polygon
are controlled vocabulary
except taxon x
rule
→ Amphibians in Antarctica are Suspicious except Antartic frogs

A work around to rule exceptions could of course be rules that simply conflict.
Inevitably, there are going to be rules created in our system that conflict. For example, a user might mark and area as "Native", while another user will mark the same area as "Suspicious".
In our rule-based system, unlike perhaps other platforms, we are not striving to create a single ground truth. We aim only to have a collection of useful opinions, and we leave it to the end user to decide what to do with the information.
It might be efficient in some circumstances to express rules with more than one taxon:
rule → taxon_1
+ taxon_2
…
in geo-polygon
are controlled vocabulary
One useful example would be marking all marine species on land as suspicious.
rule → Marine species on WORMS list in Land Polygon are Suspicious
We might consider using the preexisting vocabulary, although we are attempting to annotate land area (ranges) more than we are attempting annotate occurrence records.
Below is the working controlled vocabulary for location-based annotations.
term | definition |
---|---|
Native |
Refers to the natural geographic range where a species or organism historically evolved and occurs without human intervention. |
Introduced |
Refers to the geographic area where non-native organisms have been intentionally or accidentally introduced and established |
Managed |
Encompasses the geographic area where specific species are actively controlled, conserved, or manipulated by human intervention. |
former |
Denotes the historical geographic area where a species once naturally occurred but no longer does due to various factors. |
Vagrant |
Describes sporadic occurrences of a species far outside its usual habitat or distribution, often due to rare or accidental dispersal events. |
Suspicious |
Occurrences occuring in the designated area might be in error in some way. |
This vocabulary is meant to be a compromise between modeling species ranges and establishment means accurately, while not being overly complex.
concept | example |
---|---|
native |
extant |
native |
endemic |
native |
indigenous |
native |
breeding |
native |
non-breeding |
introduced |
assisted colonization |
introduced |
invasive |
introduced |
non native range |
managed |
location is captive range |
managed |
location is botanical garden |
managed |
location is zoo |
managed |
cultivated in glasshouse |
suspicious |
location is in the ocean |
suspicious |
zero-zero coordinate |
suspicious |
centroid |
suspicious |
area too far north for taxon |
suspicious |
area too high elevation for taxon |
suspicious |
area is natural history museum |
former |
fossil range |
former |
extinct |
former |
historic |
vagrant |
migrant |
The current vocabulary might change in the future. Namely, there has been some discussion introducing hierarchy such that perhaps certain terms map to present
or absent
for example.
Annotating land areas (and extensions) provide at least two advanateges over annotating occurrences:
-
Avoids the use of unstable gbifIds.
-
Allows for future occurrences to benefit from the annotation.
If you are reading this you have been approached as potential pilot annotator.
This section will be completed once the UI matures a bit.
While there is not absolute definition of a good annotation and a bad one. Good annotations usually have a few properties:
-
Good annotations usually don’t use extremely complex polygons. If you find yourself needing to trace the coastline of Italy, you might be making a bad annotation. A good annotation should take into account a little bit of buffer to take into account occurrence record uncertainty.
-
Good annotations take into account future occurrence records. Remember that your annotations should be able to fit future occurrence fairly well.
-
Good annotations also try to think about higher taxonomy and simplification.
(think about users outside of GBIF)
Initiated tasks for 2023 in bold.
Explore approaches to annotation capabilities in GBIF.org that enable data corrections, enrichments and user-provided rules that combine taxonomic, geographic and temporal combinations to detect suspicious records
-
User interface (2023)
-
API with storage (2023)
-
R package
gbifan
interface (2023) -
Seeding the database (2023)
-
Introduce to pilot users (2023-2024)
-
Start data paper (2024)
-
Integration with a hosted portal (2025)
-
Integration with GBIF.org (2025)
A demo UI has been developed to facilitate the process of creating rule-based annotations on a map. This user-friendly interface allows users to visually interact with a map and define rules or criteria for data annotations.
The current state of the demo UI for rule-based annotations is considered a work in progress, and it is expected to evolve based on valuable feedback gathered during the pilot phase.
It is anticipated that the most efficient way to interact with the rule store will be via R.
Identify potential pilot users who are actively engaged in biodiversity research, data annotation, or related fields. Consider reaching out to academic researchers and enthusiasts.
Seeding rulesets
into our annotation several benefits, especially when it comes to improving user experience. Seeding data can help new users get acquainted with the annotation system more quickly. By providing sample data or pre-existing annotations, users can learn how the system works and what is expected of them in terms of annotation tasks. This can reduce the learning curve and increase user engagement.
Seeding data with known annotations can also serve as a benchmark for quality assurance. Users can compare their annotations with the existing ones to ensure accuracy and consistency. This helps maintain high data quality standards, especially in applications like machine learning where training data quality is crucial. Users can be more productive when they have access to seeded data. They can start their annotation tasks immediately instead of filling in well-known information.
Providing users with seeded data can make the annotation process more engaging and rewarding. When users see that their annotations contribute to a dataset that already contains valuable information, they are more likely to stay engaged and continue using the system over the long term.
World Checklist of Vascular Plants |
---|
2025
A hosted portal is a simple, branded and fully customizable website that displays a targeted subset of GBIF-mediated data to support Participant nodes and their partners.
This service is designed to support biodiversity data use and engagement at national, institutional, regional and thematic scales.