Skip to content

Commit

Permalink
Merge pull request #7 from trias-project/make_new_cubes_without_unver…
Browse files Browse the repository at this point in the history
…ified_occs

Remove unverified occurrences and add cubes of Slovenia and Lithuania
  • Loading branch information
damianooldoni authored Dec 1, 2020
2 parents 2b87b84 + 8b3920c commit ca526b8
Show file tree
Hide file tree
Showing 6 changed files with 118 additions and 36 deletions.
19 changes: 14 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,15 +50,24 @@ taxonKey | scientificName | numberOfOccurrences | taxonRank | taxonomicStatus
8361333 | Fallopia compacta (Hook.fil.) G.H.Loos & P.Keil | 24 | SPECIES | SYNONYM
7291673 | Polygonum reynoutria (Houtt.) Makino | 3 | SPECIES | SYNONYM

See https://doi.org/10.15468/dl.rej1cz for more details. Note: the table above is just an example and can be outdated.
Table based on this [GBIF download](https://doi.org/10.15468/dl.rej1cz).

By aggregating we would loose this information, so we provide aside the cubes, e.g. `be_species_cube.csv`, a kind of taxonomic compendiums, e.g. `be_species_info.csv`. They include for each taxa in the cube all the synonyms or infraspecies whose occurrences contribute to the total count. Differently from data cube of alien species, these data cubes are completely built upon the taxonomic relationships of [GBIF Backbone Taxonomy](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c). Both data cubes and taxonomic compendiums are saved in `data/processed`.
By aggregating we would loose this information, so we provide aside the cubes, e.g. `be_species_cube.csv` for Belgium, a kind of taxonomic compendiums, e.g. `be_species_info.csv`. They include for each taxa in the cube all the synonyms or infraspecies whose occurrences contribute to the total count. Differently from data cube of alien species, these data cubes are completely built upon the taxonomic relationships of [GBIF Backbone Taxonomy](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c). Both data cubes and taxonomic compendiums are saved in `data/processed`.

For example, _Aedes japonicus (Theobald, 1901)_ is an accepted species present in the Belgian cube: based on the information stored in `occ_belgium_taxa.tsv`, its occurrences include occurrences linked to the following taxa:
1. [Aedes japonicus (Theobald, 1901)](https://www.gbif.org/species/1652212)
2. [Ochlerotatus japonicus (Theobald, 1901)](https://www.gbif.org/species/4519733)
3. [Aedes japonicus subsp. japonicus](https://www.gbif.org/species/7346173)

We provide the occurrence cube and correspondent taxonomic compendium of the following European countries:

country | countryCode
--- | ---
Belgium | BE
Italy | IT
Slovenia | SI
Lithuania | LT

## Repo structure

The repository structure is based on [Cookiecutter Data Science](http://drivendata.github.io/cookiecutter-data-science/). Files and directories indicated with `GENERATED` should not be edited manually.
Expand Down Expand Up @@ -89,14 +98,14 @@ Clone this repository to your computer and open the RStudio project file, `occ-

You can generate a national occurrence data cube by running the [R Markdown files](https://rmarkdown.rstudio.com/) in `src` following the order shown here below:

1. `1_download.Rmd`: trigger a GBIF download and add it to the list of triggered downloads
1. `1_download.Rmd`: trigger a GBIF download for a specific country and add it to the list of triggered downloads
2. `2_create_db.Rmd`: create a sqlite database and perform basic data cleaning
3. `3_assign_grid.Rmd`: assign geographic cell code to occurrence data
4. `4_aggregate.Rmd`: aggregate occurrences per taxon, year and cell code, the _Belgian occurrence data cube_
4. `4_aggregate.Rmd`: aggregate occurrences per taxon, year and cell code, the _national occurrence data cube_

The data cubes are authomatically generated in folder `/data/processed/`.

Install any required packages, first.
Install any required package, first.

## Contributors

Expand Down
6 changes: 5 additions & 1 deletion data/raw/gbif_downloads.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,8 @@ gbif_download_key input_checklist gbif_download_created gbif_download_status gbi
0008507-190621201848488 NA 2019-07-09 08:31:24 SUCCEEDED https://doi.org/10.15468/dl.1eycss
0030713-190918142434337 NA 2019-10-28 09:05:48 SUCCEEDED https://doi.org/10.15468/dl.g1z7y7
0000537-200127171203522 NA 2020-01-28 14:23:32 SUCCEEDED https://doi.org/10.15468/dl.apwtzv
0002883-200127171203522 2020-02-01T18:40:05.112+0000 SUCCEEDED https://doi.org/10.15468/dl.oztfun
0002883-200127171203522 NA 2020-02-01 18:40:05 SUCCEEDED https://doi.org/10.15468/dl.oztfun
0003154-200613084148143 NA 2020-06-17 08:36:35 SUCCEEDED https://doi.org/10.15468/dl.97jyjt
0003160-200613084148143 NA 2020-06-17 08:52:20 SUCCEEDED https://doi.org/10.15468/dl.as99qq
0123848-200613084148143 NA 2020-11-27 14:29:15 SUCCEEDED https://doi.org/10.15468/dl.49ksep
0124676-200613084148143 NA 2020-11-28 19:34:08 SUCCEEDED https://doi.org/10.15468/dl.v5xe9r
16 changes: 8 additions & 8 deletions src/1_download.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ library(lubridate) # To work with dates
We define country we want to get data cube:

```{r define_countries}
countries <- c("IT")
countries <- c("LT")
```

## Basis of record
Expand Down Expand Up @@ -83,15 +83,15 @@ Trigger download:

```{r trigger_gbif_download}
# Reuse existing download (comment to trigger new download)
gbif_download_key <- "0002883-200127171203522"
# gbif_download_key <- "0124676-200613084148143"
# Trigger new download (commented by default)
# gbif_download_key <- occ_download(
# pred_in("country", countries),
# pred_in("basisOfRecord", basis_of_record),
# pred_gte("year", year_begin),
# pred_lte("year", year_end),
# pred("hasCoordinate", hasCoordinate),
gbif_download_key <- occ_download(
pred_in("country", countries),
pred_in("basisOfRecord", basis_of_record),
pred_gte("year", year_begin),
pred_lte("year", year_end),
pred("hasCoordinate", hasCoordinate))
# user = rstudioapi::askForPassword("GBIF username"),
# pwd = rstudioapi::askForPassword("GBIF password"),
# email = rstudioapi::askForPassword("Email address for notification")
Expand Down
75 changes: 66 additions & 9 deletions src/2_create_db.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ library(RSQLite) # To interact with SQlite databases
Define key returned in `1_download.Rmd` and country:

```{r define_key_countries}
key <- "0002883-200127171203522"
countries <- c("IT")
key <- "0003154-200613084148143"
countries <- c("BE")
```

Download the occurrences from GBIF:
Expand Down Expand Up @@ -194,8 +194,7 @@ cols_to_use <- c(
"lastInterpreted", "hasCoordinate", "hasGeospatialIssues", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters",
"coordinatePrecision", "pointRadiusSpatialFit", "verbatimCoordinateSystem",
"verbatimSRS", "eventDate", "startDayOfYear", "endDayOfYear", "year", "month",
"day", "verbatimEventDate", "samplingProtocol", "samplingEffort", "issue",
"taxonKey", "acceptedTaxonKey", "kingdomKey", "phylumKey", "classKey",
"day", "verbatimEventDate", "samplingProtocol", "samplingEffort", "issue", "identificationVerificationStatus", "taxonKey", "acceptedTaxonKey", "kingdomKey", "phylumKey", "classKey",
"orderKey", "familyKey", "genusKey", "subgenusKey", "speciesKey", "species"
)
```
Expand Down Expand Up @@ -247,10 +246,24 @@ occurrenceStatus_to_discard <- c(
)
```

We create an index based on these two columns if not already present:
We won't take into account unverified observations neither:

```{r identificationVerificationStatus}
identificationVerificationStatus_to_discard <- c(
"unverified",
"unvalidated",
"not able to validate",
"control could not be conclusive due to insufficient knowledge",
"unconfirmed",
"unconfirmed - not reviewed",
"validation requested"
)
```

We create an index based on these three columns if not already present:

```{r create_idx_occStatus_issue}
idx_occStatus_issue <- "idx_occStatus_issue"
idx_occStatus_issue <- "idx_verifStatus_occStatus_issue"
# get indexes on table
query <- glue_sql(
"PRAGMA index_list({table_name})",
Expand All @@ -265,7 +278,9 @@ if (!idx_occStatus_issue %in% indexes_all$name) {
"CREATE INDEX {`idx`} ON {table_name} ({`cols_idx`*})",
idx = idx_occStatus_issue,
table_name = table_name,
cols_idx = c("occurrenceStatus", "issue"),
cols_idx = c("identificationVerificationStatus",
"occurrenceStatus",
"issue"),
.con = sqlite_occ
)
dbExecute(sqlite_occ, query)
Expand Down Expand Up @@ -302,10 +317,11 @@ if (!table_name_subset %in% dbListTables(sqlite_occ)) {
fields = field_types_subset)
query <- glue_sql(
"INSERT INTO {small_table} SELECT {`some_cols`*} FROM {big_table} WHERE
occurrenceStatus NOT IN ({bad_status*}) AND ", issue_condition,
LOWER(identificationVerificationStatus) NOT IN ({unverified*}) AND LOWER(occurrenceStatus) NOT IN ({bad_status*}) AND ", issue_condition,
small_table = table_name_subset,
some_cols = names(field_types_subset),
big_table = table_name,
unverified = identificationVerificationStatus_to_discard,
bad_status = occurrenceStatus_to_discard,
.con = sqlite_occ
)
Expand Down Expand Up @@ -403,7 +419,48 @@ any(map_lgl(issues_to_discard,
}))
```

Overview of alll indexes present on `occ`:
We create an index on `identificationVerificationStatus`:

```{r idx_identificationVerificationStatus}
idx_issue <- "idx_identificationVerificationStatus"
if (!idx_issue %in% indexes$name) {
query <- glue_sql(
"CREATE INDEX {idx} ON {table_name} ({cols_idx})",
idx = idx_issue,
table_name = table_name_subset,
cols_idx = c("identificationVerificationStatus"),
.con = sqlite_occ
)
dbExecute(sqlite_occ, query)
}
```

Identification verification status left in the filtered data:

```{r check_identificationVerificationStatus_values}
query <- glue_sql(
"SELECT DISTINCT identificationVerificationStatus FROM {table}",
table = table_name_subset,
.con = sqlite_occ
)
status_verification_left <- dbGetQuery(sqlite_occ, query)
status_verification_left
```

Number of occurrences left:

```{r n_occs}
query <- glue_sql(
"SELECT COUNT() FROM {table}",
table = table_name_subset,
.con = sqlite_occ
)
n_occs <- dbGetQuery(sqlite_occ, query)
n_occs <- n_occs$`COUNT()`
n_occs
```

Overview of all indexes present on `occ`:

```{r index_filtered_table}
query <- glue_sql(
Expand Down
14 changes: 10 additions & 4 deletions src/3_assign_grid.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ library(RSQLite) # To interact with SQlite databases
Define key returned in `1_download.Rmd` and country:

```{r define_key_countries}
key <- "0002883-200127171203522"
countries <- c("IT")
key <- "0003154-200613084148143"
countries <- c("BE")
```

Name and path of `.sqlite` file:
Expand Down Expand Up @@ -186,7 +186,13 @@ write_tsv(geodata_df, temp_file_coords, na = "")
remove(geodata_df)
```

We define the function to apply to each chunk:
Set random number generator seed (this helps reproducibility). We use the unique identifier of the [Zenodo dataset's DOI](https://doi.org/10.5281/zenodo.3637911) which the occurrence cube will be published to:

```{r set_seed}
set.seed(3637911)
```

We define the function `reproject_assign()` to apply to each chunk:

```{r transform_to_3035_assign_pts_in_circle}
reproject_assign <- function(df, pos){
Expand Down Expand Up @@ -255,7 +261,7 @@ file.remove(temp_file_coords)

## Add grid cell code to sqlite file

We can now add the column `eea_cell_code` to the table `occ_be` of sqlite file. We first create the new column `eaa_cell_code` in the table:
We can now add the column `eea_cell_code` to the table `occ` of sqlite file. We first create the new column `eaa_cell_code` in the table:

```{r add_eaa_cellcode_to_sqlite}
new_col <- "eea_cell_code"
Expand Down
24 changes: 15 additions & 9 deletions src/4_aggregate.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ library(RSQLite) # To interact with SQlite databases
Define key returned in `1_download.Rmd` and country:

```{r define_key_countries}
key <- "0002883-200127171203522"
countries <- c("IT")
key <- "0003154-200613084148143"
countries <- c("BE")
```

Name and path of `.sqlite` file:
Expand Down Expand Up @@ -235,13 +235,19 @@ taxa_species <-
# get unique 'speciesKey'
distinct(speciesKey) %>%
# rename column 'speciesKey' to 'key'
rename(key = speciesKey) %>%
# GBIF query via name_usage with 'key' column as input
pmap_dfr(name_usage, return = "data") %>%
# extract speciesKey
pull(speciesKey) %>%
# GBIF query via name_usage
map(~name_usage(key = .x)) %>%
# Select data
map(~.x[["data"]]) %>%
# Merge all taxa in a data.frame
reduce(full_join) %>%
# select columns of interest
select(speciesKey, scientificName, rank, taxonomicStatus) %>%
Expand Down

0 comments on commit ca526b8

Please sign in to comment.