Merge pull request #7 from trias-project/make_new_cubes_without_unver…

…ified_occs Remove unverified occurrences and add cubes of Slovenia and Lithuania
trias-project · Dec 1, 2020 · ca526b8 · ca526b8
2 parents 2b87b84 + 8b3920c
commit ca526b8
Show file tree

Hide file tree

Showing 6 changed files with 118 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -50,15 +50,24 @@ taxonKey | scientificName | numberOfOccurrences | taxonRank | taxonomicStatus
 8361333 | Fallopia compacta (Hook.fil.) G.H.Loos & P.Keil | 24 | SPECIES | SYNONYM
 7291673 | Polygonum reynoutria (Houtt.) Makino | 3 | SPECIES | SYNONYM
 
-See https://doi.org/10.15468/dl.rej1cz for more details. Note: the table above is just an example and can be outdated.
+Table based on this [GBIF download](https://doi.org/10.15468/dl.rej1cz).
 
-By aggregating we would loose this information, so we provide aside the cubes, e.g. `be_species_cube.csv`, a kind of taxonomic compendiums, e.g. `be_species_info.csv`. They include for each taxa in the cube all the synonyms or infraspecies whose occurrences contribute to the total count. Differently from data cube of alien species, these data cubes are completely built upon the taxonomic relationships of [GBIF Backbone Taxonomy](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c). Both data cubes and taxonomic compendiums are saved in `data/processed`.
+By aggregating we would loose this information, so we provide aside the cubes, e.g. `be_species_cube.csv` for Belgium, a kind of taxonomic compendiums, e.g. `be_species_info.csv`. They include for each taxa in the cube all the synonyms or infraspecies whose occurrences contribute to the total count. Differently from data cube of alien species, these data cubes are completely built upon the taxonomic relationships of [GBIF Backbone Taxonomy](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c). Both data cubes and taxonomic compendiums are saved in `data/processed`.
 
 For example, _Aedes japonicus (Theobald, 1901)_ is an accepted species present in the Belgian cube: based on the information stored in `occ_belgium_taxa.tsv`, its occurrences include occurrences linked to the following taxa:
 1. [Aedes japonicus (Theobald, 1901)](https://www.gbif.org/species/1652212)
 2. [Ochlerotatus japonicus (Theobald, 1901)](https://www.gbif.org/species/4519733)
 3. [Aedes japonicus subsp. japonicus](https://www.gbif.org/species/7346173)
 
+We provide the occurrence cube and correspondent taxonomic compendium of the following European countries:
+
+country | countryCode
+--- | ---
+Belgium | BE
+Italy | IT
+Slovenia | SI
+Lithuania | LT
+
 ## Repo structure
 
 The repository structure is based on [Cookiecutter Data Science](http://drivendata.github.io/cookiecutter-data-science/). Files and directories indicated with `GENERATED` should not be edited manually.
@@ -89,14 +98,14 @@ Clone this repository to your computer and open the RStudio project file,  `occ-
 
 You can generate a national occurrence data cube by running the [R Markdown files](https://rmarkdown.rstudio.com/) in `src` following the order shown here below:
 
-1. `1_download.Rmd`: trigger a GBIF download and add it to the list of triggered downloads
+1. `1_download.Rmd`: trigger a GBIF download for a specific country and add it to the list of triggered downloads
 2. `2_create_db.Rmd`: create a sqlite database and perform basic data cleaning
 3. `3_assign_grid.Rmd`: assign geographic cell code to occurrence data
-4. `4_aggregate.Rmd`: aggregate occurrences per taxon, year and cell code, the _Belgian occurrence data cube_
+4. `4_aggregate.Rmd`: aggregate occurrences per taxon, year and cell code, the _national occurrence data cube_
 
 The data cubes are authomatically generated in  folder `/data/processed/`.
 
-Install any required packages, first.
+Install any required package, first.
 
 ## Contributors
 

diff --git a/data/raw/gbif_downloads.tsv b/data/raw/gbif_downloads.tsv
@@ -3,4 +3,8 @@ gbif_download_key	input_checklist	gbif_download_created	gbif_download_status	gbi
 0008507-190621201848488	NA	2019-07-09 08:31:24	SUCCEEDED	https://doi.org/10.15468/dl.1eycss
 0030713-190918142434337	NA	2019-10-28 09:05:48	SUCCEEDED	https://doi.org/10.15468/dl.g1z7y7
 0000537-200127171203522	NA	2020-01-28 14:23:32	SUCCEEDED	https://doi.org/10.15468/dl.apwtzv
-0002883-200127171203522		2020-02-01T18:40:05.112+0000	SUCCEEDED	https://doi.org/10.15468/dl.oztfun
+0002883-200127171203522	NA	2020-02-01 18:40:05	SUCCEEDED	https://doi.org/10.15468/dl.oztfun
+0003154-200613084148143	NA	2020-06-17 08:36:35	SUCCEEDED	https://doi.org/10.15468/dl.97jyjt
+0003160-200613084148143	NA	2020-06-17 08:52:20	SUCCEEDED	https://doi.org/10.15468/dl.as99qq
+0123848-200613084148143	NA	2020-11-27 14:29:15	SUCCEEDED	https://doi.org/10.15468/dl.49ksep
+0124676-200613084148143	NA	2020-11-28 19:34:08	SUCCEEDED	https://doi.org/10.15468/dl.v5xe9r
diff --git a/src/1_download.Rmd b/src/1_download.Rmd
@@ -37,7 +37,7 @@ library(lubridate)      # To work with dates
 We define country we want to get data cube:
 
 ```{r define_countries}
-countries <- c("IT")
+countries <- c("LT")
 ```
 
 ## Basis of record
@@ -83,15 +83,15 @@ Trigger download:
 
 ```{r trigger_gbif_download}
 # Reuse existing download (comment to trigger new download)
-gbif_download_key <- "0002883-200127171203522"
+# gbif_download_key <- "0124676-200613084148143"
 
 # Trigger new download (commented by default)
-# gbif_download_key <- occ_download(
-#   pred_in("country", countries),
-#   pred_in("basisOfRecord", basis_of_record),
-#   pred_gte("year", year_begin),
-#   pred_lte("year", year_end),
-#   pred("hasCoordinate", hasCoordinate),
+gbif_download_key <- occ_download(
+  pred_in("country", countries),
+  pred_in("basisOfRecord", basis_of_record),
+  pred_gte("year", year_begin),
+  pred_lte("year", year_end),
+  pred("hasCoordinate", hasCoordinate))
 #   user = rstudioapi::askForPassword("GBIF username"),
 #   pwd = rstudioapi::askForPassword("GBIF password"),
 #   email = rstudioapi::askForPassword("Email address for notification")

diff --git a/src/2_create_db.Rmd b/src/2_create_db.Rmd
@@ -37,8 +37,8 @@ library(RSQLite)        # To interact with SQlite databases
 Define key returned in `1_download.Rmd` and country:
 
 ```{r define_key_countries}
-key <- "0002883-200127171203522"
-countries <- c("IT")
+key <- "0003154-200613084148143"
+countries <- c("BE")
 ```
 
 Download the occurrences from GBIF:
@@ -194,8 +194,7 @@ cols_to_use <- c(
   "lastInterpreted", "hasCoordinate", "hasGeospatialIssues", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters",
   "coordinatePrecision", "pointRadiusSpatialFit", "verbatimCoordinateSystem", 
   "verbatimSRS", "eventDate", "startDayOfYear", "endDayOfYear", "year", "month",
-  "day", "verbatimEventDate", "samplingProtocol", "samplingEffort", "issue",
-  "taxonKey", "acceptedTaxonKey", "kingdomKey", "phylumKey", "classKey", 
+  "day", "verbatimEventDate", "samplingProtocol", "samplingEffort", "issue", "identificationVerificationStatus", "taxonKey", "acceptedTaxonKey", "kingdomKey", "phylumKey", "classKey", 
   "orderKey", "familyKey", "genusKey", "subgenusKey", "speciesKey", "species"
 )
 ```
@@ -247,10 +246,24 @@ occurrenceStatus_to_discard <- c(
 )
 ```
 
-We create an index based on these two columns if not already present:
+We won't take into account unverified observations neither:
+
+```{r identificationVerificationStatus}
+identificationVerificationStatus_to_discard <- c(
+  "unverified",
+  "unvalidated",
+  "not able to validate",
+  "control could not be conclusive due to insufficient knowledge",
+  "unconfirmed",
+  "unconfirmed - not reviewed",
+  "validation requested"
+)
+```
+
+We create an index based on these three columns if not already present:
 
 ```{r create_idx_occStatus_issue}
-idx_occStatus_issue <- "idx_occStatus_issue"
+idx_occStatus_issue <- "idx_verifStatus_occStatus_issue"
 # get indexes on table
 query <- glue_sql(
     "PRAGMA index_list({table_name})",
@@ -265,7 +278,9 @@ if (!idx_occStatus_issue %in% indexes_all$name) {
   "CREATE INDEX {`idx`} ON {table_name} ({`cols_idx`*})",
   idx = idx_occStatus_issue,
   table_name = table_name,
-  cols_idx = c("occurrenceStatus", "issue"),
+  cols_idx = c("identificationVerificationStatus",
+               "occurrenceStatus",
+               "issue"),
   .con = sqlite_occ
   )
   dbExecute(sqlite_occ, query)
@@ -302,10 +317,11 @@ if (!table_name_subset %in% dbListTables(sqlite_occ)) {
                fields = field_types_subset)
   query <- glue_sql(
   "INSERT INTO {small_table} SELECT {`some_cols`*} FROM {big_table} WHERE 
-  occurrenceStatus NOT IN ({bad_status*}) AND ", issue_condition, 
+  LOWER(identificationVerificationStatus) NOT IN ({unverified*}) AND LOWER(occurrenceStatus) NOT IN ({bad_status*}) AND ", issue_condition, 
   small_table = table_name_subset,
   some_cols = names(field_types_subset),
   big_table = table_name,
+  unverified = identificationVerificationStatus_to_discard,
   bad_status = occurrenceStatus_to_discard,
   .con = sqlite_occ
   )
@@ -403,7 +419,48 @@ any(map_lgl(issues_to_discard,
             }))
 ```
 
-Overview of alll indexes present on `occ`:
+We create an index on `identificationVerificationStatus`:
+
+```{r idx_identificationVerificationStatus}
+idx_issue <- "idx_identificationVerificationStatus"
+if (!idx_issue %in% indexes$name) {
+  query <- glue_sql(
+    "CREATE INDEX {idx} ON {table_name} ({cols_idx})",
+    idx = idx_issue,
+    table_name = table_name_subset,
+    cols_idx = c("identificationVerificationStatus"),
+    .con = sqlite_occ
+  )
+  dbExecute(sqlite_occ, query)
+}
+```
+
+Identification verification status left in the filtered data:
+
+```{r check_identificationVerificationStatus_values}
+query <- glue_sql(
+    "SELECT DISTINCT identificationVerificationStatus FROM {table}",
+    table = table_name_subset,
+    .con = sqlite_occ
+  )
+status_verification_left <- dbGetQuery(sqlite_occ, query)
+status_verification_left
+```
+
+Number of occurrences left:
+
+```{r n_occs}
+query <- glue_sql(
+    "SELECT COUNT() FROM {table}",
+    table = table_name_subset,
+    .con = sqlite_occ
+  )
+n_occs <- dbGetQuery(sqlite_occ, query)
+n_occs <- n_occs$`COUNT()`
+n_occs
+```
+
+Overview of all indexes present on `occ`:
 
 ```{r index_filtered_table}
 query <- glue_sql(

diff --git a/src/3_assign_grid.Rmd b/src/3_assign_grid.Rmd
@@ -35,8 +35,8 @@ library(RSQLite)        # To interact with SQlite databases
 Define key returned in `1_download.Rmd` and country:
 
 ```{r define_key_countries}
-key <- "0002883-200127171203522"
-countries <- c("IT")
+key <- "0003154-200613084148143"
+countries <- c("BE")
 ```
 
 Name and path of `.sqlite` file:
@@ -186,7 +186,13 @@ write_tsv(geodata_df, temp_file_coords, na = "")
 remove(geodata_df)
 ```
 
-We define the function to apply to each chunk:
+Set random number generator seed (this helps reproducibility). We use the unique identifier of the [Zenodo dataset's DOI](https://doi.org/10.5281/zenodo.3637911) which the occurrence cube will be published to:
+
+```{r set_seed}
+set.seed(3637911)
+```
+
+We define the function `reproject_assign()` to apply to each chunk:
 
 ```{r transform_to_3035_assign_pts_in_circle}
 reproject_assign <- function(df, pos){
@@ -255,7 +261,7 @@ file.remove(temp_file_coords)
 
 ## Add grid cell code to sqlite file
 
-We can now add the column `eea_cell_code` to the table `occ_be` of sqlite file. We first create the new column `eaa_cell_code` in the table:
+We can now add the column `eea_cell_code` to the table `occ` of sqlite file. We first create the new column `eaa_cell_code` in the table:
 
 ```{r add_eaa_cellcode_to_sqlite}
 new_col <- "eea_cell_code"

diff --git a/src/4_aggregate.Rmd b/src/4_aggregate.Rmd
@@ -33,8 +33,8 @@ library(RSQLite)        # To interact with SQlite databases
 Define key returned in `1_download.Rmd` and country:
 
 ```{r define_key_countries}
-key <- "0002883-200127171203522"
-countries <- c("IT")
+key <- "0003154-200613084148143"
+countries <- c("BE")
 ```
 
 Name and path of `.sqlite` file:
@@ -235,13 +235,19 @@ taxa_species <-
   
   # get unique 'speciesKey'
   distinct(speciesKey) %>%
-  
-  # rename column 'speciesKey' to 'key'
-  rename(key = speciesKey) %>%
-  
-  # GBIF query via name_usage with 'key' column as input
-  pmap_dfr(name_usage, return = "data") %>%
-  
+
+  # extract speciesKey
+  pull(speciesKey) %>%
+
+  # GBIF query via name_usage
+  map(~name_usage(key = .x)) %>%
+
+  # Select data
+  map(~.x[["data"]]) %>%
+
+  # Merge all taxa in a data.frame
+  reduce(full_join) %>%
+
   # select columns of interest
   select(speciesKey, scientificName, rank, taxonomicStatus) %>%