-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathsimple_workflow.Rmd
514 lines (371 loc) · 25.7 KB
/
simple_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
---
title: "A Simple Workflow"
author: "Simon Goring, Socorro Dominguez Vidaña"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: show
fig_caption: yes
keep_md: yes
self_contained: yes
theme: readable
toc: yes
toc_float: yes
css: "text.css"
pdf_document:
pandoc_args: "-V geometry:vmargin=1in -V geometry:hmargin=1in"
---
## Introduction
This document is intended to act as a primer for the use of the new Neotoma R package, `neotoma2`. [The `neotoma2` package](https://github.com/NeotomaDB/neotoma2) is available from GitHub and can be installed in R using the `devtools` package by using:
```r
devtools::install_github('NeotomaDB/neotoma2')
library(neotoma2)
```
In this tutorial you will learn how to:
* Search for sites using site names and geographic parameters
* Filter results using temporal and spatial parameters
* Obtain sample information for the selected datasets
* Perform basic analysis including the use of climate data from rasters
### Accessing and Manipulating Data with `neotoma2`
For this workbook we use several packages, including `leaflet`, `sf` and others. We load the packages using the `pacman` package, which will automatically install the packages if they do not currently exist in your set of packages.
```{r setup}
options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, raster, DT)
```
Note that R is sensitive to the order in which packages are loaded. Using `neotoma2::` tells R explicitly that you want to use the `neotoma2` package to run a particular function. So, for a function like `filter()`, which exists in other packages such as `dplyr`, you may see an error that looks like:
```bash
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "sites"
```
In that case it's likely that the wrong package is trying to run `filter()`, and so explicitly adding `dplyr::` or `neotoma2::` in front of the function name (i.e., `neotoma2::filter()`)is good practice.
### Getting Help with Neotoma
If you're planning on working with Neotoma, please join us on [Slack](https://join.slack.com/t/neotomadb/shared_invite/zt-cvsv53ep-wjGeCTkq7IhP6eUNA9NxYQ) where we manage a channel specifically for questions about the R package. You may also wish to join our Google Groups mailing list, please [contact us](mailto:[email protected]) to be added.
## Site Searches
### `get_sites()`
There are several ways to find sites in `neotoma2`, but we think of `sites` as being spatial objects primarily. They have names, locations, and are found within the context of geopolitical units, but within the API and the package, the site itself does not have associated information about taxa, dataset types or ages. It is simply the container into which we add that information. So, when we search for sites we can search by:
* siteid
* sitename
* location
* altitude (maximum and minimum)
* geopolitical unit
#### Site names: `sitename="%Lait%"` {.tabset}
We may know exactly what site we're looking for ("Lac Mouton"), or have an approximate guess for the site name (for example, we know it's something like "Lait Lake", or "Lac du Lait", but we're not sure how it was entered specifically).
We use the general format: `get_sites(sitename="XXXXX")` for searching by name.
PostgreSQL (and the API) uses the percent sign as a wildcard. So `"%Lait%"` would pick up ["Lac du Lait"](https://data.neotomadb.org/4180) for us (and would pick up "Lake Lait" and "This Old **Lait**y Hei-dee-ho Bog" if they existed). Note that the search query is also case insensitive, so you could simply write `"%lait%"`.
##### Code
```{r sitename, eval=FALSE}
spo_sites <- neotoma2::get_sites(sitename = "%Lait%")
plotLeaflet(spo_sites)
```
##### Result
```{r sitenamePlot, echo=FALSE}
spo_sites <- neotoma2::get_sites(sitename = "%Lait%")
plotLeaflet(spo_sites)
```
#### Location: `loc=c()` {.tabset}
The `neotoma` package used a bounding box for locations, structured as a vector of latitude and longitude values: `c(xmin, ymin, xmax, ymax)`. The `neotoma2` R package supports both this simple bounding box, but also more complex spatial objects, using the [`sf` package](https://r-spatial.github.io/sf/). Using the `sf` package allows us to more easily work with raster and polygon data in R, and to select sites from more complex spatial objects. The `loc` parameter works with the simple vector, [WKT](https://arthur-e.github.io/Wicket/sandbox-gmaps3.html), [geoJSON](http://geojson.io/#map=2/20.0/0.0) objects and native `sf` objects in R. **Note however** that the `neotoma2` package is a wrapper for a simple API call using a URL ([api.neotomadb.org](https://api.neotomadb.org)), and URL strings can only be 1028 characters long, so the API cannot accept very long/complex spatial objects.
Looking for sites using a location. We're putting three representations of the Czech Republic here as part of a list with three elements, a geoJSON, WKT and bounding box representation. We've also transformed the `cz$geoJSON` element to an object for the `sf` package. Any of these four spatial representations work with the `neotoma2` package.
```{r boundingBox}
cz <- list(geoJSON = '{"type": "Polygon",
"coordinates": [[
[12.40, 50.14],
[14.10, 48.64],
[16.95, 48.66],
[18.91, 49.61],
[15.24, 50.99],
[12.40, 50.14]]]}',
WKT = 'POLYGON ((12.4 50.14,
14.1 48.64,
16.95 48.66,
18.91 49.61,
15.24 50.99,
12.4 50.14))',
bbox = c(12.4, 48.64, 18.91, 50.99))
cz$sf <- geojsonsf::geojson_sf(cz$geoJSON)[[1]]
cz_sites <- neotoma2::get_sites(loc = cz$geoJSON, all_data = TRUE)
```
You can always simply `plot()` the `sites` objects, but you will lose some of the geographic context. The `plotLeaflet()` function returns a `leaflet()` map, and allows you to further customize it, or add additional spatial data (like our original bounding polygon, `cz$sf`, which works directly with the R `leaflet` package):
##### Code
```{r plotL, eval=FALSE}
neotoma2::plotLeaflet(cz_sites) %>%
leaflet::addPolygons(map = .,
data = cz$sf,
color = "green")
```
##### Result
```{r plotLeaf, echo=FALSE}
neotoma2::plotLeaflet(cz_sites) %>%
leaflet::addPolygons(map = .,
data = cz$sf,
color = "green")
```
#### Site Helpers {.tabset}

If we look at the [UML diagram](https://en.wikipedia.org/wiki/Unified_Modeling_Language) for the objects in the `neotoma2` R package we can see that there are a set of functions that can operate on `sites`. As we add to `sites` objects, using `get_datasets()` or `get_downloads()`, we are able to use more of these helper functions. As it is, we can take advantage of sunctions like `summary()` to get a more complete sense of the types of data we have as part of this set of sites. The following code gives the summary table. We do some R magic here to change the way the data is displayed (turning it into a `datatable()` object), but the main piece is the `summary()` call.
##### Code
```{r summary_sites, eval=FALSE}
neotoma2::summary(cz_sites)
```
##### Result
```{r summarySitesTable, eval=TRUE, echo=FALSE}
neotoma2::summary(cz_sites) %>%
DT::datatable(data = ., rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
We can see that there are no chronologies associated with the `site` objects. This is because, at present, we have not pulled in the `dataset` information we need. All we know from `get_sites()` are the kinds of datasets we have.
### Searching for datasets: {.tabset}
We know that collection units and datasets are contained within sites. Similarly, a `sites` object contains `collectionunits` which contain `datasets`. From the table above we can see that some of the sites we've looked at contain pollen records. That said, we only have the `sites`, it's just that (for convenience) the `sites` API returns some information about datasets so to make it easier to navigate the records.
With a `sites` object we can directly call `get_datasets()`, to pull in more metadata about the datasets. At any time we can use `datasets()` to get more information about any datasets that a `sites` object may contain. Compare the output of `datasets(cz_sites)` to the output of a similar call using the following:
#### Code
```{r datasetsFromSites, eval=FALSE}
cz_datasets <- neotoma2::get_datasets(cz_sites, all_data = TRUE)
datasets(cz_datasets)
```
#### Result
```{r datasetsFromSitesResult, echo=FALSE, message=FALSE}
cz_datasets <- neotoma2::get_datasets(cz_sites, all_data = TRUE)
datasets(cz_datasets) %>%
as.data.frame() %>%
DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
### Filter Records {.tabset}
If we choose to pull in information about only a single dataset type, or if there is additional filtering we want to do before we download the data, we can use the `filter()` function. For example, if we only want pollen records, and want records with known chronologies, we can filter:
#### Code
```{r downloads, eval=FALSE}
cz_pollen <- cz_datasets %>%
neotoma2::filter(datasettype == "pollen" & !is.na(age_range_young))
neotoma2::summary(cz_pollen)
```
#### Result
```{r downloadsCode, echo = FALSE}
cz_pollen <- cz_datasets %>%
neotoma2::filter(datasettype == "pollen" & !is.na(age_range_young))
neotoma2::summary(cz_pollen) %>% DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
We can see now that the data table looks different, and there are fewer total sites.
### Pulling in `sample()` data.
Because sample data adds a lot of overhead (for the Czech pollen data, the object that includes the dataset with samples is 20 times larger than the `dataset` alone), we try to call `get_downloads()` after we've done our preliminary filtering. After `get_datasets()` you have enough information to filter based on location, time bounds and dataset type. When we move to `get_download()` we can do more fine-tuned filtering at the analysis unit or taxon level.
The following call can take some time, but we've frozen the object as an RDS data file. You can run this command on your own, and let it run for a bit, or you can just load the object in.
```{r taxa}
## This line is commented out because we've already run it for you.
## cz_dl <- cz_pollen %>% get_downloads(all_data = TRUE)
cz_dl <- readRDS('data/czDownload.RDS')
```
Once we've downloaded, we now have information for each site about all the associated collection units, the datasets, and, for each dataset, all the samples associated with the datasets. To extract all the samples we can call:
```{r allSamples}
allSamp <- samples(cz_dl)
```
When we've done this, we get a `data.frame` that is `r nrow(allSamp)` rows long and `r ncol(allSamp)` columns wide. The reason the table is so wide is that we are returning data in a **long** format. Each row contains all the information you should need to properly interpret it:
```{r colNamesAllSamp, echo = FALSE}
colnames(allSamp)
```
For some dataset types, or analyses some of these columns may not be needed, however, for other dataset types they may be critically important. To allow the `neotoma2` package to be as useful as possible for the community we've included as many as we can.
#### Extracting Taxa {.tabset}
If you want to know what taxa we have in the record you can use the helper function `taxa()` on the sites object. The `taxa()` function gives us, not only the unique taxa, but two additional columns, `sites` and `samples` that tell us how many sites the taxa appear in, and how many samples the taxa appear in, to help us better understand how common individual taxa are.
##### Code
```{r taxa2, eval=FALSE}
neotomatx <- neotoma2::taxa(cz_dl)
```
##### Results
```{r taxaprint, echo=FALSE, message=FALSE}
neotomatx <- neotoma2::taxa(cz_dl)
neotoma2::taxa(cz_dl) %>%
DT::datatable(data = head(neotomatx, n = 20), rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
#### {-}
The `taxonid` values can be linked to the `taxonid` column in the `samples()`. This allows us to build taxon harmonization tables if we choose to. You may also note that the `taxonname` is in the field `variablename`. Individual sample counts are reported in Neotoma as [`variables`](https://open.neotomadb.org/manual/taxonomy-related-tables-1.html#Variables). A "variable" may be either a species, something like laboratory measurements, or a non-organic proxy, like charcoal or XRF measurements, and includes the units of measurement and the value.
#### Simple Harmonization {.tabset}
Lets say we want all samples from which *Plantago* taxa have been reported to be grouped together into one pseudo-taxon called *Plantago*. There are several ways of doing this, either directly by exporting the file and editing each individual cell, or by creating an external "harmonization" table (which we did in the prior `neotoma` package).
Programmatically, we can harmonize taxon by taxon using matching and transformation. We're using `dplyr` type coding here to `mutate()` the column `variablename` so that any time we detect (`str_detect()`) a `variablename` that starts with `Plantago` (the `.*` represents a wildcard for any character [`.`], zero or more times [`*`]) we `replace()` it with the character string `"Plantago"`. Note that this changes *Plantago* in the `allSamp` object, but if we were to call `samples()` again, the taxonomy would return to its original form.
We're going to filter the ecological groups to include only *UPHE* (upleand/heath) and *TRSH* (trees and shrubs). More information about ecological groups is available from the [Neotoma Online Manual](https://open.neotomadb.org/manual).
```{r simpleTaxonChange, eval=FALSE}
allSamp <- allSamp %>%
dplyr::filter(ecologicalgroup %in% c("UPHE", "TRSH")) %>%
mutate(variablename = replace(variablename,
stringr::str_detect(variablename, "Plantago.*"),
"Plantago"))
```
There were originally `r sum(stringr::str_detect(neotomatx$variablename, 'Plantago.*'))` different taxa identified as being within the genus *Plantago* (including *Plantago*, *Plantago major*, and *Plantago alpina-type*). The above code reduces them all to a single taxonomic group *Plantago*.
If we want to have an artifact of our choices, we can use an external table. For example, a table of pairs (what we want changed, and the name we want it replaced with) can be generated, and it can include regular expressions (if we choose):
| original | replacement |
| -------- | ----------- |
| Abies.* | Abies |
| Vaccinium.* | Ericaceae |
| Typha.* | Aquatic |
| Nymphaea | Aquatic |
| ... | ... |
We can get the list of original names directly from the `taxa()` call, applied to a `sites` object, and then export it using `write.csv()`.
##### Code
```{r countbySitesSamples, eval=FALSE}
taxaplots <- taxa(cz_dl)
# Save the taxon list to file so we can edit it subsequently.
readr::write_csv(taxaplots, "data/mytaxontable.csv")
```
##### Result
```{r PlotTaxonCounts, echo=FALSE, fig.cap="**Figure**. A plot of the number of sites a taxon appears in, against the number of samples a taxon appears in.", message=FALSE}
taxaplots <- taxa(cz_dl)
ggplot(data = taxaplots, aes(x = sites, y = samples)) +
geom_point() +
stat_smooth(method = 'glm',
method.args = list(family = 'poisson')) +
xlab("Number of Sites") +
ylab("Number of Samples") +
theme_bw()
```
#### {-}
The plot is mostly for illustration, but we can see, as a sanity check, that the relationship is as we'd expect.
You can then export either one of these tables and add a column with the counts, you could also add extra contextual information, such as the `ecologicalgroup` or `taxongroup` to help you out. Once you've cleaned up the translation table you can load it in, and then apply the transformation:
```{r translationTable, message=FALSE, eval=FALSE}
translation <- readr::read_csv("data/taxontable.csv")
```
```{r translationDisplay, message=FALSE, echo = FALSE}
translation <- readr::read_csv("data/taxontable.csv")
DT::datatable(translation, rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
You can see we've changed some of the taxon names in the taxon table (don't look too far, I just did this as an example). To replace the names in the `samples()` output, we'll join the two tables using an `inner_join()` (meaning the `variablename` must appear in both tables for the result to be included), and then we're going to select only those elements of the sample tables that are relevant to our later analysis:
```{r joinTranslation, eval = FALSE}
allSamp <- samples(cz_dl)
allSamp <- allSamp %>%
inner_join(translation, by = c("variablename" = "variablename")) %>%
dplyr::select(!c("variablename", "sites", "samples")) %>%
group_by(siteid, sitename, replacement,
sampleid, units, age,
agetype, depth, datasetid,
long, lat) %>%
summarise(value = sum(value), .groups='keep')
```
```{r harmonizationTableOut, message = FALSE, echo=FALSE}
DT::datatable(head(allSamp, n = 50), rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
## Simple Analytics
### Stratigraphic Plotting
We can use packages like `rioja` to do stratigraphic plotting for a single record, but first we need to do some different data management. Although we could do harmonization again we're going to simply take the top ten most common taxa at a single site and plot them in a stratigraphic diagram.
We're using the `arrange()` call to sort by the number of times that the taxon appears within the core. This way we can take out samples and select the taxa that appear in the first ten rows of the `plottingTaxa` `data.frame`.
```{r stratiplot, message = FALSE}
# Get a particular site, select only taxa identified from pollen (and only trees/shrubs)
plottingSite <- cz_dl[[1]]
plottingTaxa <- taxa(plottingSite) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
arrange(desc(samples)) %>%
head(n = 10)
# Clean up. Select only pollen measured using NISP.
# We repeat the filters for pollen & ecological group on the samples
shortSamples <- samples(plottingSite) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
filter(units == "NISP")
# Transform to proportion values.
onesite <- shortSamples %>%
group_by(age) %>%
mutate(pollencount = sum(value, na.rm = TRUE)) %>%
group_by(variablename) %>%
mutate(prop = value / pollencount) %>%
arrange(desc(age))
# Spread the data to a "wide" table, with taxa as column headings.
widetable <- onesite %>%
dplyr::select(age, variablename, prop) %>%
mutate(prop = as.numeric(prop))
counts <- tidyr::pivot_wider(widetable,
id_cols = age,
names_from = variablename,
values_from = prop,
values_fill = 0)
```
This appears to be a fairly long set of commands, but the code is pretty straightforward, and it provides you with significant control over the taxa, units and other elements of your data before you get them into the wide matrix (`depth` by `taxon`) that most statistical tools such as the `vegan` package or `rioja` use.
To plot we can use `rioja`'s `strat.plot()`, sorting the taxa using weighted averaging scores (`wa.order`). I've also added a CONISS plot to the edge of the the plot, to show how the new *wide* data frame works with distance metric funcitons.
```{r plotStrigraph, message=FALSE, warning=FALSE}
clust <- rioja::chclust(dist(sqrt(counts)),
method = "coniss")
plot <- rioja::strat.plot(counts[,-1] * 100, yvar = counts$age,
title = cz_dl[[1]]$sitename,
ylabel = "Calibrated Years BP",
xlabel = "Pollen (%)",
y.rev = TRUE,
clust = clust,
wa.order = "topleft", scale.percent = TRUE)
rioja::addClustZone(plot, clust, 4, col = "red")
```
### Change in Time Across Sites
We now have site information across the Czech Republic, with samples, and with taxon names. I'm interested in looking at the distributions of taxa across time, their presence/absence. I'm going to pick the top 20 taxa (based on the number of times they appear in the records) and look at their distributions in time:
```{r summarizeByTime, message = FALSE}
plottingTaxa <- taxa(plottingSite) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
arrange(desc(sites)) %>%
head(n = 20)
taxabyage <- samples(cz_dl) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
group_by(variablename, "age" = round(age * 2, -3) / 2) %>%
summarise(n = length(unique(siteid)), .groups = 'keep')
samplesbyage <- samples(cz_dl) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
group_by("age" = round(age * 2, -3) / 2) %>%
summarise(samples = length(unique(siteid)), .groups = 'keep')
groupbyage <- taxabyage %>%
inner_join(samplesbyage, by = "age") %>%
mutate(proportion = n / samples)
ggplot(groupbyage, aes(x = age, y = proportion)) +
geom_point() +
geom_smooth(method = 'gam',
method.args = list(family = 'binomial')) +
facet_wrap(~variablename) +
coord_cartesian(xlim = c(20000, 0), ylim = c(0, 1)) +
scale_x_reverse(breaks = c(10000, 20000)) +
xlab("Proportion of Sites with Taxon") +
theme_bw()
```
We can see clear patterns of change, and the smooths are modeled using Generalized Additive Models (GAMs) in R, so we can have more or less control over the actual modeling using the `gam` or `mgcv` packages. Depending on how we divide the data we can also look at shifts in altitude, latitude or longitude to better understand how species distributions and abundances changed over time in this region.
### Distributions in Climate (July max temperature) from Rasters
We are often interested in the interaction between taxa and climate, assuming that time is a proxy for changing environments. The development of large-scale global datasets for climate has made it relatively straightforward to access data from the cloud in raster format. R provides a number of tools (in the `sf` and `raster` packages) for managing spatial data, and providing support for spatial analysis of data.
The first step is taking our sample data and turning it into a spatial object using the `sf` package in R:
```{r makeSamplesSpatial}
modern <- samples(cz_dl) %>%
filter(age < 50) %>%
filter(ecologicalgroup == "TRSH" & elementtype == "pollen" & units == "NISP")
spatial <- sf::st_as_sf(modern,
coords = c("long", "lat"),
crs = "+proj=longlat +datum=WGS84")
```
The data is effectively the same, `sf` makes an object called `spatial` that is a `data.frame` with all the information from `samples()`, and a column (`geometry`) that contains the spatial data.
We can use the [`getData()` function](https://www.rdocumentation.org/packages/raster/versions/3.5-15/topics/getData) in the `raster` package to get climate data from WorldClim. The operations that follow here can be applied to any sort of raster data, provided it is loaded into R as a `raster` object.
Here we pull in the raster data, at a 10 minute resolution for the $T_{max}$ variable, maximum monthly temperature. The raster itself has 12 layers, one for each month. With the `extract()` function we just get information for the seventh month, July.
```{r worldTmax}
worldTmax <- raster::getData('worldclim', var = 'tmax', res = 10)
spatial$tmax7 <- raster::extract(worldTmax, spatial)[,7]
```
This adds a column to the `data.frame` `spatial`, that contains the maximum July temperature for each taxon at each site (all taxa at a site will share the same value). We've already filtered to all UPHE taxa, but that still leaves us with `r length(length(unique(spatial$variablename)))` distinct names for the taxa. We're going to use `dplyr`'s `mutate()` function to extract just the genus:
```{r toGenus}
spatial <- spatial %>%
mutate(variablename = stringr::str_replace(variablename, "[[:punct:]]", " ")) %>%
mutate(variablename = stringr::word(variablename, 1)) %>%
group_by(variablename, siteid) %>%
summarise(tmax7 = max(tmax7), .groups = "keep") %>%
group_by(variablename) %>%
filter(n() > 3)
```
#### Setting the Background
We want to get the background distribution of July temperatures in the Czech Republic, to plot our taxon distributions against by taking the maximum value of the temperature, however, since all values at the site are the same (because we used a spatial overlay) the maximum is the same as the actual July temperature at the site.
```{r topten}
maxsamp <- spatial %>%
dplyr::group_by(siteid) %>%
dplyr::summarise(tmax7 = max(tmax7), .groups = 'keep')
```
Now we're going to plot it out, using `facet_wrap()` to plot each taxon in its own panel:
```{r ggplot}
ggplot() +
geom_density(data = spatial,
aes(x = round(tmax7 / 10, 0)), col = 2) +
facet_wrap(~variablename) +
geom_density(data = maxsamp, aes(x = tmax7 / 10)) +
xlab("Maximum July Temperature") +
ylab("Kernel Density")
```
## Conclusion
So, we've done a lot in this example. We've (1) searched for sites using site names and geographic parameters, (2) filtered results using temporal and spatial parameters, (3) obtained sample information for the selected datasets and (4) performed basic analysis including the use of climate data from rasters. Hopefully you can use these examples as templates for your own future work, or as a building block for something new and cool!