Skip to content

Possible categorization error: dataset try #270

@beatrizmilz

Description

@beatrizmilz

Hello!

OTN is a great project, thank you all for it.

This issue aims to document a possible error in the "resolved" categorization.

While using the dataset, Thiago @thiago-goncalves-souza and I noticed a possible categorization error on the try dataset (https://opentraits.org/datasets/try).

If we filter OTN to get only rows that are from the try dataset AND Animalia Kingdom (resolveKingdomName == "Animalia"), we get more than 5k rows.

# download data from
# https://github.com/open-traits-network/otn-taxon-trait-summary/blob/main/traits.csv.gz
otn_raw <-
  readr::read_csv("traits.csv")

otn_dataset_try <- otn_raw |>
  # filter only the animal kingdom
  dplyr::filter(resolveKingdomName == "Animalia") |>
  dplyr::filter(datasetId == "https://opentraits.org/datasets/try")


dplyr::glimpse(otn_dataset_try)
# Rows: 5,311
# Columns: 31
# $ taxonIdVerbatim        <chr> "1669", "1669", "1669", "1669", "1669", "1…
# $ scientificNameVerbatim <chr> "Agathis philippinensis", "Agathis philipp…
# $ resolvedTaxonId        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ resolvedTaxonName      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ parentTaxonId          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ family                 <chr> "Araucariaceae", "Araucariaceae", "Araucar…
# $ phylum                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ traitIdVerbatim        <dbl> 37, 3400, 759, 98, 3401, 43, 22, 17, 4, 38…
# $ traitNameVerbatim      <chr> "Leaf phenology type", "Plant growth form …
# $ bucketId               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ bucketName             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ counts                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ datasetId              <chr> "https://opentraits.org/datasets/try", "ht…
# $ numberOfRecords        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 3, …
# $ curator                <chr> "https://opentraits.org/members/brian-s-ma…
# $ accessDate             <date> 2022-08-19, 2022-08-19, 2022-08-19, 2022-…
# $ comment                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ relationName           <chr> "HAS_ACCEPTED_NAME", "HAS_ACCEPTED_NAME", …
# $ resolvedExternalId     <chr> "COL:6635V", "COL:6635V", "COL:6635V", "CO…
# $ resolvedName           <chr> "Agathis philippinensis", "Agathis philipp…
# $ resolvedRank           <chr> "species", "species", "species", "species"…
# $ resolvedCommonNames    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ resolvedPath           <chr> "Biota | Animalia | Arthropoda | Insecta |…
# $ resolvedPathIds        <chr> "COL:5T6MX | COL:N | COL:RT | COL:H6 | COL…
# $ resolvedPathNames      <chr> "unranked | kingdom | phylum | class | ord…
# $ resolvedExternalUrl    <chr> "https://www.catalogueoflife.org/data/taxo…
# $ resolveKingdomName     <chr> "Animalia", "Animalia", "Animalia", "Anima…
# $ resolvedPhylumName     <chr> "Arthropoda", "Arthropoda", "Arthropoda", …
# $ resolvedFamilyName     <chr> "Braconidae", "Braconidae", "Braconidae", …
# $ providedTraitName      <chr> "Leaf phenology type", "Plant growth form …
# $ resolvedTraitName      <chr> "Phenology", "Morphology", "UNCATEGORIZED_…

But some of the traits seems like they are from plants:

otn_dataset_try |>
  dplyr::count(datasetId,
               resolveKingdomName,
               providedTraitName,
               sort = TRUE) |> 
  head() 
datasetId resolveKingdomName providedTraitName n
https://opentraits.org/datasets/try Animalia Plant growth form 482
https://opentraits.org/datasets/try Animalia Leaf type 257
https://opentraits.org/datasets/try Animalia Leaf compoundness 255
https://opentraits.org/datasets/try Animalia Plant woodiness 255
https://opentraits.org/datasets/try Animalia Leaf phenology type 178
https://opentraits.org/datasets/try Animalia Leaf area (in case of compound leaves: leaflet 161

Here are some of the most frequent categories that appear in resolvedPhylumName/resolvedName from this query:

otn_dataset_try |>
  dplyr::count(datasetId,
               resolveKingdomName,
               resolvedPhylumName,
               resolvedName,
               sort = TRUE) |> 
  head()
datasetId resolveKingdomName resolvedPhylumName resolvedName n
https://opentraits.org/datasets/try Animalia Mollusca Ficus 162
https://opentraits.org/datasets/try Animalia Chordata Salix 118
https://opentraits.org/datasets/try Animalia Arthropoda Eugenia 117
https://opentraits.org/datasets/try Animalia Arthropoda Inga 117
https://opentraits.org/datasets/try Animalia Arthropoda Viola 94
https://opentraits.org/datasets/try Animalia Chordata Phyllanthus 88

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions