-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathStateoftheDB.Rmd
660 lines (527 loc) · 29.4 KB
/
StateoftheDB.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
---
title: "State of the Neotoma Paleoecology Database"
shorttitle: "Neotoma Status"
author:
- name: "Simon Goring"
email: "[email protected]"
- affiliation:
- id: "1"
institution: "Department of Geography, University of Wisconsin--Madison"
date: "`r lubridate::today()`"
description: |
A summary of current Neotoma activity, potential data issues and other summary information.
keywords: "SQL, Neotoma Paleoecology Database, paleoecology"
output:
html_document:
theme: flatly
css: "assets/text.css"
toc: true
toc_depth: 4
toc_float: true
number_sections: true
self_contained: true
fig_caption: true
code_folding: hide
---
```{r setup, include=FALSE, result = 'hide', echo=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(pacman)
p_load(lubridate, RPostgreSQL, DT, ggplot2,
ggthemes, svglite,
httr, jsonlite, dplyr, leaflet, geojsonsf, sf, purrr, forcats, scales,
leaflet.providers, dotenv)
load_dot_env()
lastdate <- today() - years(1)
con <- dbConnect(dbDriver("PostgreSQL"),
dbname = Sys.getenv("DBNAME"),
host = Sys.getenv("HOST"),
port = Sys.getenv("PORT"),
user= Sys.getenv("USER"),
password= Sys.getenv("PASSWORD"))
p <- function(x) format(x, scientific=FALSE, big.mark=',')
```
# Data Summary
## Overall Database Summary
```{r getSummStats, echo=FALSE}
datasets <- dbGetQuery(con, "SELECT COUNT(*) FROM ndb.datasets")
sites <- dbGetQuery(con, "SELECT COUNT(*) FROM ndb.sites")
lastyear <- dbGetQuery(con, "SELECT * FROM ndb.rawbymonth(0,12)")
piset <- dbGetQuery(con, "SELECT COUNT(DISTINCT contactid) FROM ndb.datasetpis;")
analystset <- dbGetQuery(con, "SELECT COUNT(DISTINCT contactid) FROM ndb.sampleanalysts;")
constdb <- dbGetQuery(con, "SELECT COUNT(*) FROM ndb.constituentdatabases;")
```
This report details changes to the [Neotoma Paleoecology database](https://neotomadb.org) since **`r lastdate`**, and is current up to **`r lubridate::today()`**. Full documentation of the database can be found in the [Neotoma Database Manual](open.neotomadb.org/neotoma_bookdown/_book/index.html). Recent snapshots of the database can be obtained from the [Neotoma Snapshot website](https://neotomadb.org/snapshots). This report is generated automatically from an [RMarkdown document hosted on GitHub](https://github.com/NeotomaDB/NeotomaStats).
Neotoma contains data from `r p(datasets)` datasets and `r p(sites)` unique sites. This represents a considerable contribution from members of the scientific community, including `r p(piset)` primary investigators, `r p(analystset)` analysts, and stewards for all `r p(constdb)` constituent databases. There are also invaluable and incalculable contributions from the members of the Neotoma Paleoecology Database Community.
```{r dftable, echo=FALSE, fig.cap=""}
outtable <- t(lastyear) %>%
as.data.frame %>%
rename('New Entries'='1') %>%
mutate('New Entries' = p(`New Entries`))
DT::datatable(outtable,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;',
"Table 1: ", htmltools::em("Database contributions from the prior 12 months for the Neotoma Paleoecology Database.")
),
options = list(dom = 't', scrollX = "100%"))
```
### Recent Data Updates
#### Site Additions
```{r plotSites, echo=FALSE, warning=FALSE, fig.cap='**Figure 1.** *Locations of newly added sites in Neotoma during the past year. The map is interactive and supports zoom/pan operations. Individual sites can be selected and you will be directed to a link for the [Neotoma Explorer](https://apps.neotomadb.org)*.'}
use_providers()
newSites <- dbGetQuery(con, "SELECT ST_AsGeoJSON(geog)::varchar AS loc FROM ndb.sites WHERE recdatecreated > now() - interval '1 year'") %>%
unlist() %>%
na.omit()
shape <- newSites %>% stringr::str_detect('Polygon')
sf <- geojson_sf(newSites) %>% sf::st_cast("POINT")
map <- leaflet::leaflet(sf) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
leaflet::addCircleMarkers(color='red',
stroke = FALSE,
fillOpacity = 0.6,
clusterOptions = markerClusterOptions())
map
```
Of the `r p(length(newSites))` sites added, `r p(sum(shape == TRUE))` have been entered as polygons, while `r p(sum(shape == FALSE))` are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
### Dataset Additions
```{r dbcontribds, echo=FALSE, warning=FALSE}
newDsByDB <- dbGetQuery(con, "SELECT * FROM ndb.stewardcontrib(endperiod:=12) ORDER BY counts DESC;")
DT::datatable(newDsByDB,
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;',
"Table 1: ", htmltools::em("Dataset contributions from the prior 12 months for the Neotoma Paleoecology Database, organized by constituent database.")
),
options = list(dom = 't', scrollX = "100%"))
newDsByDt <- dbGetQuery(con, "SELECT * FROM ndb.datasettypecontrib(endperiod:=12) ORDER BY counts DESC;")
datasettypes <- function(x) {
# We have to set a long timeout here:
call <- httr::GET("http://api.neotomadb.org/v2.0/data/summary/dstypemonth",
query = list(start = x, end = x + 1), httr::timeout(10000))
result <- jsonlite::fromJSON(httr::content(call, as = "text"))$data$data
result$month <- x
result$counts <- as.numeric(result$counts)
return(result)
}
c_trans <- function(a, b, breaks = b$breaks, format = b$format) {
a <- as.trans(a)
b <- as.trans(b)
name <- paste(a$name, b$name, sep = "-")
trans <- function(x) a$trans(b$trans(x))
inv <- function(x) b$inverse(a$inverse(x))
trans_new(name, trans, inv, breaks, format = format)
}
rev_date <- c_trans("reverse", "time")
monthly_change <- 0:18 %>% map(datasettypes) %>% bind_rows()
mc2 <- fct_collapse(monthly_change$datasettype,
biomarker = "biomarker",
charcoal = c("charcoal",
"macrocharcoal",
"microcharcoal",
"charcoal surface sample"),
diatom = c("diatom", "diatom surface sample"),
geochemistry = "geochemistry",
insect = "insect",
LOI = "loss-on-ignition",
macroinvertebrate = "macroinvertebrate",
ostracode = c("ostracode surface sample"),
sedimentology = c("physical sedimentology",
"paleomagnetic",
"X-ray fluorescence (XRF)"),
"phytolith" = "phytolith",
"plant macrofossil" = "plant macrofossil",
pollen = c("pollen", "pollen trap",
"pollen surface sample"),
"stable isotope" = "specimen stable isotope",
"testate amoebae" = c("testate amoebae",
"testate amoebae surface sample"),
"vertebrate fauna" = "vertebrate fauna",
"water chemistry" = "water chemistry")
monthly_change$datasettype <- mc2
monthly_out <- monthly_change %>%
group_by(datasettype, month) %>%
summarise(count = sum(counts)) %>%
mutate(month = (Sys.Date() %m-% months(month, abbreviate = FALSE)))
out <- ggplot(monthly_out, aes(x = month, y = count, fill = datasettype)) +
geom_bar(stat = "identity") +
theme_tufte() +
scale_fill_viridis_d(name = "Dataset Type") +
scale_x_date(date_labels = "%b %Y") +
xlab("Month") +
ylab("Number of Datasets Uploaded") +
theme(axis.title = element_text(face = "bold", size = 18),
axis.text = element_text(family = "sans-serif", size = 10))
ggsave("outputs/datasetsPerMonth.png", out, width = 8, height = 6, units = "in")
ggsave("outputs/datasetsPerMonth.svg", out, width = 8, height = 6, units = "in")
```
Of the `r lastyear$datasets` datasets added to Neotoma over the past year, there have been contributions to `r nrow(newDsByDB)` constituent databases, with the majority from `r newDsByDB[1,1]`. This pattern of contribution is reflected in contributions to dataset types, where we see contributions to `r nrow(newDsByDt)` dataset types.
```{r dbcontribdst}
DT::datatable(newDsByDt,
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;',
"Table 2: ", htmltools::em("Dataset contributions from the prior 12 months for the Neotoma Paleoecology Database, organized by dataset type.")
),
options = list(dom = 't', scrollX = "100%"))
```
```{r, echo=FALSE, fig.cap="Dataset contributions to Neotoma over the previous 18 months. The large number of Neotoma dataset types makes color coding difficult, however results are detailed specifically in Table 1."}
out
```
### Constituent Databases
Neotoma consists of `r constdb` constituent databases. At any one time some databases may be more active than others.
### Contributors
```{r}
newDsByCt <- dbGetQuery(con, "SELECT * FROM ndb.datasetcontribs(endperiod:=12) ORDER BY counts DESC;")
DT::datatable(newDsByCt,
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;',
"Table 1: ", htmltools::em("Dataset contributions from the prior 12 months for the Neotoma Paleoecology Database, organized by the dataset contributor.")
),
options = list(dom = 't', scrollX = "100%"))
```
Neotoma relies on significant the efforts of a volunteer group of data stewards and data contributors. Over the last 12 months `r nrow(newDsByCt)` stewards have contributed data to Neotoma, across a range of constituent databases.
### API Calls
```{r pullLogs, echo=FALSE, fig.caption="Volume of API calls to API and Tilia servers, aggregated by day over the last 6 months."}
pullLog <- jsonlite::fromJSON('log_run.json')['results'][[1]] %>%
purrr::map(function(x) tidyr::pivot_wider(x,names_from=field)) %>%
bind_rows() %>%
mutate(date = lubridate::date(date),
calls = as.integer(calls),
volume = as.integer(volume))
pullCalls <- jsonlite::fromJSON('log_run_calls.json')['results'][[1]] %>%
purrr::map(function(x) tidyr::pivot_wider(x,names_from=field)) %>%
bind_rows() %>%
mutate(calls = as.integer(calls),
path = stringr::str_replace(path, '/[0-9,]*$', '')) %>%
group_by(path) %>%
summarize(fullcount = sum(calls)) %>%
arrange(desc(fullcount))
ggplot(pullLog) +
geom_bar(aes(x = date,
y = calls,
fill = explorerCall),
position = 'stack',
stat = 'identity') +
scale_fill_viridis_d() +
theme_tufte() +
scale_y_sqrt() +
labs(y = 'Total API Calls by Day', x= 'Date of API Access')
```
Since the API has been implemented there have been a total of `r p(sum(pullLog$calls))` calls to the Neotoma API. These include calls to the core API ([`api.neotomadb.org`](https://api.neotomadb.org)), calls to support the Neotoma Landing Pages ([`data.neotomadb.org`](https://data.neotomadb.org)) and calls to support Neotoma Explorer ([`apps.neotomadb.org/explorer`](https://data.neotomadb.org)).
The main APIs delivered a total of `r p(floor(sum(pullLog$volume, na.rm=TRUE) /1000000000))` GB of data to users since `r min(pullLog$date)`.
#### Specific API Calls
```{r callFrequency, echo=FALSE, warning=FALSE}
ggplot(pullCalls[1:20,]) +
geom_bar(aes(x = 1:20,
y=fullcount/1000), stat='identity') +
theme_tufte() +
labs(x='', y='Thousands of Calls') +
theme(axis.text.x=element_blank())
```
Several API calls are called thousands of times, but these are not necessarily the fastest, or slowest queries. There is no relationship between speed and the number of times an API endpoint is used. The most frequent API calls are:
```{r freqCall, echo=FALSE}
pullCalls %>%
slice_max(fullcount, n = 20) %>%
select(path, fullcount) %>%
DT::datatable(options=list(dom='tip', scrollX = "100%"), rownames = FALSE)
```
## Data Overview
### Site Spatial Types
Sites can be added as either points or polygons. Of the `r p(length(newSites))` sites added, `r p(sum(shape == TRUE))` of those are entered as site polygons, while `r p(sum(shape == FALSE))` are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
### Site Metadata
```{r siteMeta, echo=FALSE}
siteMeta <- dbGetQuery(con, "SELECT sitename,
altitude,
area,
sitedescription,
notes
FROM ndb.sites
WHERE recdatecreated > now() - interval '1 year'")
missing <- colSums(is.na(siteMeta)) %>%
data.frame(Missing=.)
DT::datatable(missing,
options = list(dom = 'tip', scrollX = "100%"))
```
Among the `r p(nrow(siteMeta))` sites added to the Neotoma Paleoecology Database in the past year, not all sites were entered with complete metadata. Complete metadata is critical for better understanding data context, particularly when site notes & descriptions are required to better understand data.
#### Dataset Metadata
```{r datasetMeta, echo=FALSE}
datasetMeta <- dbGetQuery(con,
"SELECT cu.handle AS handle,
cu.collunitname AS collunitname,
cu.colldate AS colldate,
cu.colldevice AS colldevice,
cu.waterdepth AS waterdepth,
cu.slopeangle AS slopeangle,
cu.slopeaspect AS aspect,
cu.location AS location,
cu.notes AS collunitnotes,
ds.datasetname AS datasetname,
ds.notes AS datasetnotes
FROM ndb.datasets AS ds
INNER JOIN ndb.collectionunits AS cu ON ds.collectionunitid = cu.collectionunitid
WHERE ds.recdatecreated > now() - interval '1 year'")
missing <- colSums(is.na(datasetMeta)) %>%
data.frame(Missing=.)
DT::datatable(missing,
options = list(dom = 'tip', scrollX = "100%"))
```
### Taxon Overview
```{r totalcount, include=FALSE}
count <- dbGetQuery(con, "SELECT COUNT(*) FROM ndb.taxa")
```
There are `r p(count)` taxa recorded in the Neotoma Taxonomy table. These are not exclusively taxonomic records, but include other variables, such as laboratory measurements and other detected features within samples.
#### Taxon Hierarchy
```{r taxonHierarchy, include=FALSE}
query <- "SELECT * FROM ndb.taxa WHERE taxonid = highertaxonid"
result <- dbGetQuery(con, query)
```
Taxonomic records are structured hierarchically, with `highertaxonid` pointing to the next highest `taxonid` in the database. **These hierarchies do not necessarily reflect taxonomic hierarchy**. Issues with taxon hierarchy may be the result of improper identification of high level taxa, failure to identify high level taxa, or duplicate records were multiple higher level taxa are identified.
##### Highest-Level Taxa
The highest-level taxa can be identified because they have `taxonid==highertaxonid`. Within the database there are `r p(nrow(result))` highest level taxa:
```{r highestOrderTaxa, echo=FALSE}
DT::datatable(result,
options=list(dom='tip', scrollX = "100%"), rownames = FALSE)
```
This table is provided largely for information, to help identify records that are identified as "highest level", that should be otherwise grouped.
##### Taxa with no relationships
```{r terminalLeaves, include=FALSE, warning=FALSE}
termCount <- dbGetQuery(con,
"SELECT tx.*, COUNT(var.variableid) AS count
FROM ndb.taxa AS tx
LEFT JOIN ndb.taxa AS htx ON tx.taxonid = htx.highertaxonid
LEFT JOIN ndb.variables AS var ON var.taxonid = tx.taxonid
WHERE htx.taxonid IS NULL
GROUP BY tx.taxonid")
```
There are `r p(nrow(termCount))` taxa that represent "leaves" in the Neotoma taxon tree. Of these, `r p(sum(termCount$count == 0))` have no recorded counts (the `taxonid` does not appear in the `ndb.variables` table). These are taxa that are not part of a morpohotaxonomic hierarchy (so there are no dependent taxa), and also have no associated sample records:
```{r emptyTerminals, echo=FALSE, warning=FALSE}
termCount[termCount$count==0,] %>%
select(taxonid, taxonname, author, highertaxonid, taxagroupid,count) %>%
DT::datatable(extensions = 'Buttons',
rownames = FALSE,
options = list(dom = 'Bfrtip',
buttons = c('csv', 'print'),
scrollX = "100%")
)
```
##### Taxa with Undefined Higher Taxa
```{r missingTaxonId, echo=FALSE}
missCount <- dbGetQuery(con,
"SELECT tx.*, COUNT(var.variableid) AS count
FROM ndb.taxa AS tx
LEFT JOIN ndb.variables AS var ON var.taxonid = tx.taxonid
WHERE tx.highertaxonid IS NULL
GROUP BY tx.taxonid")
```
Some taxa do not have defined `highertaxonid` values. Currently there is a count of `r nrow(missCount)` taxa without defined higher taxon IDs. It is unclear why these taxa do not have related higher taxonomic elements.
```{r missingHigher, echo=FALSE}
missCount %>%
select(taxonid, taxonname, author, highertaxonid, taxagroupid,count) %>%
DT::datatable(extensions = 'Buttons',
rownames = FALSE,
options = list(dom = 'Bfrtip',
buttons = c('csv', 'print'),
scrollX = "100%")
)
```
#### Duplicated Taxa
```{r dupTaxa, echo=FALSE}
dupTx <- dbGetQuery(con,
"SELECT tx.taxonname, COUNT(*)
FROM ndb.taxa AS tx
GROUP BY tx.taxonname
HAVING COUNT(*) > 1")
dupTxEg <- dbGetQuery(con,
"WITH taxSum AS (
SELECT tx.*,
COUNT(var.*) AS records
FROM ndb.taxa AS tx
LEFT JOIN ndb.variables AS var ON var.taxonid = tx.taxonid
WHERE tx.valid = true
GROUP BY tx.taxonid)
SELECT tx.taxonname, tx.taxagroupid,
json_agg(jsonb_build_object('id', tx.taxonid,
'code', tx.taxoncode,
'count', tx.records))::varchar
FROM taxSum AS tx
GROUP BY tx.taxonname, tx.taxagroupid
HAVING COUNT(*) > 1")
```
Taxa are identified by `taxonname` and `taxagroupid`. There are instances of duplicate `taxonname`, but these should be represented by distinct `taxagroupid` values. There are `r p(nrow(dupTxEg))` taxa where the `taxonname` is duplicated (and the taxon is `valid`).
```{r duptaxaTable, echo=FALSE}
dupTxEg %>%
mutate(json_agg = unlist(purrr::map(dupTxEg$json_agg, prettify))) %>%
DT::datatable(extensions = 'Buttons',
rownames = FALSE,
options = list(dom = 'Bfrtip',
buttons = c('csv', 'print'),
scrollX = "100%")
)
```
##### Duplicated Taxon Codes
```{r dupTaxaCode, echo=FALSE}
dupTc <- dbGetQuery(con,
"SELECT tx.taxoncode, COUNT(*)
FROM ndb.taxa AS tx
GROUP BY tx.taxoncode
HAVING COUNT(*) > 1")
dupTcEg <- dbGetQuery(con,
"WITH taxSum AS (
SELECT tx.*,
COUNT(var.*) AS records
FROM ndb.taxa AS tx
LEFT JOIN ndb.variables AS var ON var.taxonid = tx.taxonid
WHERE tx.valid = true
GROUP BY tx.taxonid)
SELECT tx.taxoncode, tx.taxagroupid,
json_agg(jsonb_build_object('id', tx.taxonid,
'code', tx.taxonname,
'count', tx.records))::varchar
FROM taxSum AS tx
GROUP BY tx.taxoncode, tx.taxagroupid
HAVING COUNT(*) > 1")
```
It is possible to have duplicate taxon codes in the database provided the taxa are within different taxon group IDs. However, there may be instances where a taxon code is repeated within the same group. The following taxon identifiers are repeated multiple times within an ecological group:
```{r taxongroupcodetable, echo=FALSE}
DT::datatable(dupTcEg,
rownames = FALSE,
options=list(dom='tip', scrollX = "100%"))
```
#### Taxon Synonymys
Although taxonomies are continually updated, Neotoma provides the ability to have users enter the original taxonomic information, and then reference particular synonomies, associated with particular publications, or attributed to specific Neotoma stewards or contacts. This relies on several interacting tables, in particular `ndb.synonyms`, and `ndb.synonomy`. `ndb.synonyms` indicates the links between taxa (in this case `validtaxonid` and `invalidtaxonid`).
Critically, there is no direct *PK*/*FK* link between these tables. Thus, it is possible for a synonymy at the dataset level to have no attribution for the synonymy. While `ndb.synonyms` also provides the opportunity to define a `synonymtype`, the `synonymy` does not, except by relating the `validtaxonid` and `invalidtaxonid` in `ndb.synonyms` to the `taxonid` and `reftaxonid` of `ndb.synonymy`.
```{r synonymyCount}
synCount <- dbGetQuery(con, "SELECT COUNT(*) FROM ndb.synonyms")
synds <- dbGetQuery(con, "SELECT COUNT(DISTINCT datasetid) FROM ndb.synonymy")
query <- "SELECT array_agg(DISTINCT syns.synonymid) synids,
array_agg(DISTINCT syn.datasetid) synds,
curtax.taxonname AS current,
reftax.taxonname AS prior,
array_agg(DISTINCT syn.publicationid) pubs,
array_agg(DISTINCT syn.contactid) contacts,
array_agg(DISTINCT sty.synonymtype)
FROM ndb.synonymy AS syn
INNER JOIN ndb.taxa AS curtax ON curtax.taxonid = syn.taxonid
INNER JOIN ndb.taxa AS reftax ON reftax.taxonid = syn.reftaxonid
FULL JOIN ndb.synonyms AS syns ON syns.validtaxonid = syn.taxonid AND syns.invalidtaxonid = syn.reftaxonid
FULL JOIN ndb.synonymtypes AS sty ON sty.synonymtypeid = syns.synonymtypeid
GROUP BY curtax.taxonname, reftax.taxonname"
getSynos <- dbGetQuery(con, query)
```
The database currently contains `r p(synds)` datasets with synonymys, and a total of `r p(synCount)` attributed synonyms. Of the synonyms with associated datastids, there are `r p(nrow(getSynos %>% filter(synids == '{NULL}')))` synonymys without links in the synonyms table. There are `r nrow(getSynos %>% filter(pubs == '{NULL}' & contacts == '{NULL}'))` synonyms where there is no attributed contactid or publication.
There are `r p(nrow(getSynos %>% filter(stringr::str_detect(pubs, ','))))` synonymys where multiple different publications are used to attribute the synonymy. There are also `r p(nrow(getSynos %>% filter(stringr::str_detect(contacts, ','))))` where multiple different individuals are identified as assigning the synonym. There are `r p(nrow(getSynos %>% filter(stringr::str_detect(pubs, 'NULL'))))` synonyms without any associated publication.
#### Duplicated Variables
We use variable IDs (PK: `ndb.variables.variableid`) to link a taxon, the element, context and units. In general, we don't expect that these should ever be duplicated, since we can use the same variable ID over and over again, for the given combination. Having said that, we do see replication, and it's not clear why.
```{r dupvars}
query <- "WITH mult AS (
SELECT DISTINCT UNNEST(array_agg(variableid)) as varid
FROM ndb.variables
GROUP BY taxonid, variableelementid, variableunitsid, variablecontextid
HAVING array_length(array_agg(variableid),1) > 1
), dmult AS (
SELECT var.*, COUNT(dt.*) FROM mult
JOIN ndb.data AS dt ON dt.variableid = mult.varid
JOIN ndb.variables AS var ON var.variableid = mult.varid
GROUP BY var.variableid
ORDER BY var.taxonid, var.variableelementid, var.variableunitsid, var.variablecontextid)
SELECT * FROM dmult;"
synds <- dbGetQuery(con, query)
DT::datatable(synds,
rownames = FALSE,
options=list(dom='tip', scrollX = "100%"))
```
In `r p(nrow(synds))` variables we see that there is duplication of the keys in the `variableids`. Interestingly it seems that this is an issue that primarily affects the mammal records:
```{r vardupissue}
query <- "WITH mult AS (
SELECT DISTINCT UNNEST(array_agg(variableid)) as varid
FROM ndb.variables
GROUP BY taxonid, variableelementid, variableunitsid, variablecontextid
HAVING array_length(array_agg(variableid),1) > 1
), dmult AS (
SELECT var.*, COUNT(dt.*) FROM mult
JOIN ndb.data AS dt ON dt.variableid = mult.varid
JOIN ndb.variables AS var ON var.variableid = mult.varid
GROUP BY var.variableid
ORDER BY var.taxonid, var.variableelementid, var.variableunitsid, var.variablecontextid)
SELECT tx.taxonid, COUNT(*), tx.taxonname FROM dmult JOIN ndb.taxa AS tx ON tx.taxonid = dmult.taxonid GROUP BY tx.taxonid;"
varthing <- dbGetQuery(con, query)
DT::datatable(varthing,
rownames = FALSE,
options=list(dom='tip', scrollX = "100%"))
```
The ground sloth `Paramylodon harlani` seems to have the biggest issues. Some possible reasons for this larger issue may be associated with the ways "specimens" are added to the database, potentially causing a conflict. This issue should possibly be flagged as a situation where we could add a composite primary key to the table.
### Sites and Datasets
Issues with sites include sites with no associated datasets, duplicated sites and, potentially, sites with missing data.
```{r dupsites, echo=FALSE}
dupTc <- dbGetQuery(con,
"WITH dscount AS (
SELECT DISTINCT siteid, jsonb_agg(DISTINCT datasetid) AS dscount
FROM ndb.dslinks
GROUP BY siteid)
SELECT json_agg(DISTINCT jsonb_build_object('siteid', st.siteid,
'sitename', st.sitename,
'db', cdb.databasename,
'datasets', dscount.dscount,
'datemodified', st.recdatemodified,
'datecreated', st.recdatecreated))::varchar,
ST_AsGeoJSON(st.geog)::varchar
FROM ndb.sites AS st
JOIN dscount ON dscount.siteid = st.siteid
JOIN ndb.dslinks AS dsl ON dsl.siteid = st.siteid
JOIN ndb.datasetdatabases AS dsdb ON dsl.datasetid = dsdb.datasetid
JOIN ndb.constituentdatabases AS cdb ON cdb.databaseid = dsdb.databaseid
GROUP BY st.geog
HAVING COUNT(DISTINCT st.siteid) > 1")
```
When we examine sites, we find that there are `r p(nrow(dupTc))` sites with *exactly* duplicated site geometries. These sites are distributed globally, and distributed across constituent databases.
```{r dtDupSites, echo=FALSE}
dupTc %>%
mutate(st_asgeojson = purrr::map(st_asgeojson, function(x) {
ifelse(is.na(x), NA, prettify(x))})) %>%
DT::datatable(options=list(dom='tip', scrollX = "100%"), rownames = FALSE)
```
#### Sites without CollectionUnits or Datasets
Some sites appear to have been submitted, but have no associated collectionunit or dataset data:
```{r missinginfo}
missingds <- dbGetQuery(con,
"SELECT st.siteid, st.sitename, array_agg(cu.collectionunitid) AS collectionunits
FROM ndb.sites AS st
LEFT OUTER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
LEFT OUTER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
WHERE cu.collectionunitid IS NULL OR ds.datasetid IS NULL
GROUP BY st.siteid, st.sitename;") %>%
DT::datatable(options=list(dom='tip', scrollX = "100%"), rownames = FALSE)
```
This is likely the result of failed uploads during the Tilia upload process (see [Tilia upload reference here](https://github.com/NeotomaDB/tilia_api/blob/production/tiliauploadref.yaml)). To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
#### Sites without analysis units
Some sites appear to have been submitted, with collection units but no registered analysis units associated with them:
```{r missinginfoaus}
dbGetQuery(con,
"SELECT st.siteid, st.sitename, cu.collectionunitid AS collectionunits, array_agg(ds.datasetid) AS datasets
FROM ndb.sites AS st
LEFT OUTER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
LEFT OUTER JOIN ndb.analysisunits AS au ON au.collectionunitid = cu.collectionunitid
LEFT OUTER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
WHERE au.analysisunitid IS NULL
GROUP BY st.siteid, st.sitename, cu.collectionunitid;") %>%
DT::datatable(options=list(dom='tip', scrollX = "100%"), rownames = FALSE)
```
This is also likely the result of failed uploads during the Tilia upload process. To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
### Stewards & Tilia Usage
```{r tiliaLogs, echo=FALSE, eval=TRUE}
logTime <- jsonlite::fromJSON('log_run_tilia.json')['results'][[1]] %>%
purrr::map(function(x) tidyr::pivot_wider(x, names_from=field)) %>%
bind_rows() %>%
mutate(date = lubridate::date(date),
calls = as.integer(calls))
ggplot(logTime) +
geom_bar(aes(x = date,
y = calls),
position = 'stack',
stat = 'identity') +
scale_fill_viridis_d() +
theme_tufte() +
scale_y_sqrt() +
labs(y = 'Total Tilia Calls by Day', x= 'Date of Tilia Access')
```
A total of `r p(sum(logTime$calls))` calls to the Tilia API were made since `r min(logTime$date)`.