-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDOI_assignment.Rmd
482 lines (329 loc) · 31.6 KB
/
DOI_assignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
---
title: "DOI workflow"
author: "Simon Goring"
date: "May 22, 2016"
output:
html_document:
code_folding: show
highlight: pygment
keep_md: yes
number_sections: no
theme: journal
toc: no
toc_depth: 3
pdf_document:
keep_md: yes
toc: yes
toc_depth: '3'
word_document:
keep_md: yes
toc: no
toc_depth: '3'
reference_docx: styles/word_template.docx
csl: styles/elsevier-harvard.csl
bibliography: styles/doi_publications.bib
---
```{r, echo=FALSE, message = FALSE, warnings=FALSE, tidy=TRUE}
doi_sens <- readr::read_lines("doi_sens.txt")
library(dplyr, quietly = TRUE, verbose = FALSE)
library(RPostgreSQL, quietly = TRUE, verbose = FALSE)
library(httr, quietly = TRUE, verbose = FALSE)
library(XML, quietly = TRUE, verbose = FALSE)
source('builder/R/sql_calls.R')
schema <- xmlSchemaParse('data/metadata.xsd')
#sens <- read.table('doi_sens.txt', stringsAsFactors = FALSE)
con_string <- jsonlite::fromJSON("connect_remote.json")
con <- dbConnect(PostgreSQL(),
host = con_string$host,
port = con_string$port,
user = con_string$user,
password = con_string$password,
dbname = con_string$database)
```
**120 Character Summary**
*Using \@neotomadb as a case study for improving data discoverability through DOIs.*
\pagebreak
#Title
Implementing the Digital Object Identifier system within the Neotoma Paleoecological Database
#Abstract
The Neotoma Paleoecological Database contains over 12000 unique datasets, representing the combined work of thousands of researchers. In an effort to improve recognition and discoverability of the intellectual contributions of contributing researchers, Neotoma implemented Digital Object Identifier names for records within the database. Using the DataCite DOI system and the EZID API, this paper details the methods used and simple case studies to indicate the possibilities now available to both contributing researchers and to individuals interested in exploring the connections made possible through the implementation of DOI names within Neotoma.
\pagebreak
**Key Words**: metadata, discoverability, dois, data publication, digital curation, neotoma, paleoecology, databases
#Introduction
Digital object identifier (DOI) names for data provide a critical resource that ties the intellectual contribution of an individual or team of individuals to derived products beyond the life cycle of the publication. Discoverability of data relies on clear, standardized and fully described metadata, which allow for greater discoverability through progressively more powerful semantic search technologies [@leinfelder2011using]. Greater data discoverability ensures that the rewards for parcticing science in an open and transparent manner are fully realized [@reichman2011challenges;@piwowar2007sharing].
DOI names have been minted for data since 2004 [@brase2015tenth]. While the minting of DOI names are strongly associated with publication, in the case of data, the publication model may be inappropriate [@parsons2013data], and as such a new model was neccessary. The Neotoma Paleoecological Database approximates the publication metaphor by providing two value-added services following data contribution. Neotoma maintains a strong central data structure, defined through collaboration with end-user groups, providing data that adheres to community standards, facilitating data re-use and sustainability [@gallagher2015facilitating].
A key component of long term data-sustainability is the use of domain specific data repositories [@goodman2014ten]. The Neotoma Paleoecological Database is a key domain repository within the biogeosciences, focusing on botanical and faunal fossil data across the Neogene. The core of the Neotoma Paleoecological Database rests on two fundamental data contributions. The first, COHMAP [@anderson1988climatic], was the result of an international, interdisciplinary effort that resulted in the contribution of 241 datasets across 207 publications from 333 allied researchers. This research contribution led to one of the first model-data intercomparison projects, with a lasting legacy for paleoclimate and climate modeling [@wright1993reflections]. The second large data contribution was the result of the FAUNMAP project, centered on assembling fossil records from the United States spanning the Pleistocene. FAUNMAP was the result of contributions representing 3202 researchers, across 2005 unique publications, resulting in fundamental contributions to our understanding of species responses to climate change over long time scales [@graham1996spatial]. In both cases the call for open and unrestricted data sharing was less stringent than it is currently, with fewer enforcement or regulatory elements, but the success of these efforts reflects the enormous value of open data.
Currently, Neotoma contains `r length(neotoma::get_dataset())` datasets (as of `r Sys.Date()`). There are `r nrow(neotoma::get_table("DatasetTypes"))` unique dataset types, and these data products represent the cummulative work of `r nrow(neotoma::get_table("Contacts"))` individuals or organizations. Access to Neotoma is obtained primarily through three platforms. The database itself is a PostgreSQL database that is available as snapshots from the Neotoma website -- [http://neotomadb.org]() -- while data search capacity is supported through a map-based web engine, the Neotoma Explorer - [http://apps.neotomadb.org/explorer](), or through the API and associated R package `neotoma` [@goring2015neotoma]. The datasbase itself is described elsewhere [@neotomamanual].
Here we describe the efforts to facilitate the minting and provisioning of DOI names for the Neotoma Database through DataCite, the steps we have taken to ensure increased discoverability of data, the steps to support ongoing versioning of datasets, and, ultimately, the possibilities that DOIs provide both to Neotoma, to facilitate ongoing database development and evolution, and also to individuals who provide data to Neotoma as part of their data management planning.
#Background
In 2016, Neotoma, through the University of Wisconsin Library System, obtained authorization to mint DOI names for data contained within the database. The DOI system originated through a collaboration among three large publishing associations in 1997 [@doihandbook], and, although it was primarily focused on minting unique identifiers for published materials, the vision of DOI names for data came soon afterwards [@paskin2005digital]. While DOI names do not resolve to the data categorically, they provide metadata (supplied in the process of minting a DOI) and a resolution service that facilitates the long term discoverability and sustainability of data generated by researchers. A review of available object identifiers found that DOIs provide strong support for DOIs, although a clear choice for identifiers remained purpose-specific [@duerr2011utility].
In the case of Neotoma the use of DOI names for individual datasets facilitates serveral key objectives:
1. DOI names allow Neotoma to act as a repository for journals following the Joint Data Archiving Policy (JDAP, [http://datadryad.org/pages/jdap]()), where proof of data archiving often requires the provisioning of a DOI name. This provides additional incentives to researchers contributing data to Neotoma as part of a sustainable Data Management plan [*e.g.*, @michener2015ten;@hsu2015data].
2. DOI names connect Neotoma data to a broader set of services, in this case facilitated through the DataCite metadata search engine ([https://search.datacite.org/]()). Discoverability is facilitated through the use of additional DOI system metadata, and can be linked to unique researchers through services such as ORCID. This provides added value to researchers contributing data to Neotoma as it allows a full and open accounting of research contributions, particularly valuable for early career researchers or researchers involved in interdisciplinary research [@goring2014improving].
3. A DOI provides a unique and persistent identifer to datasets and their associated metadata that can be resolved from any browser connected to the internet, whether or not the user has knowledge of Neotoma and its specific database location, schema or API.
# Use Case
This document lays out the outline for establishing DOIs for the database using the DataCite API and the DataCite metadata schema [REF] as the basis of the process. We have consulted resources describing best practices in scholarly data citation and data provision [@starr2015achieving; @hourcle2012linking], and have made a significant effort to provide the most accurate and complete metadata record for Neotoma data possible.
Given the database structure [@neotomamanual; http://neotoma-manual.readthedocs.io/en/latest/], there is no single database table that maps directly to the DataCite schema. To help provide best-practices support for data archiving, Neotoma provides a "frozen" data object, which reflects the complete data object at the time of submission, stored as a JSON entry in the Neotoma database. The frozen data object is required since Neotoma is a "live" database. Relationships between the dataset object and elements such as taxa, chronologies and publications may change over time as taxonomic standards change, or new chronologies are added for records. The frozen dataset is the atomic unit for DOI assignment and will be refered to as a "dataset" in this document, although a second Neotoma API service exists that refers to "datasets" [@goring2015neotoma]. The API "dataset" service excludes count or sample data for a site, and as such doesn't serve DOI assignment, since it does not provide the complete set for required metadata.
Neotoma, in establishing a DOI service, had two main use cases. (1) Pre-existing records, and records that will provided to Neotoma as a set of datasets (either the inclusion of a new Constituent Database, or as the result of a data synthesis project) will require bulk assignment records. (2) New records are uploaded individually through Tilia ([http://tiliait.com]()) or a data upload web service (in development). For each of these two use cases the process is generally the same for the actual call to DataCite.
In each case the process for minting DOIs and providing suitable, searchable metadata is as follows:
* Discover data for which DOIs have been minted (DataCite API search)
* Freeze data uploaded to Neotoma that has not been "frozen"
*
* The script generates an XML document that conforms to the DataCite 3.1 XML Schema
* On success, the script calls out to DataCite/EZID & generates a DOI and assigns metadata
* On success, a landing page is created programmatically
* The DOI is attached to the `Datasets` table
* An email is sent to the Dataset PI with the dataset DOI & metadata, and to the relevant Data Steward
This system is implemented in R [@Rcitation] for the purposes of educational outreach. For replicability we also chose to use `Rmarkdown` [@rmarkdown], thus providing an open and reproducible research product, following the concepts of the "geoscience paper of the future" [@gil2016towards]. R is widely used within the academic research community, with a long history in the geosciences [@grunsky2002r]. Additionally, R provides functions for working with SQL [@RMySQL;@rodbc2016], XML [@rxml;@xml22016] and curl [@httr].
Initialization for the processing of records is fairly straightforward. The script for the SQL commands is available in the supplemental material (Supplement One, or as a [gist](https://gist.github.com/SimonGoring/defa8bcfd2216f80913afebfa29a901a) for GitHub).
# Implementation
##The Landing Page
![site_page_output.png]
<div style="float: right; width: 300px; margin: 0 0 15px 20px; padding: 15px; border: 1px solid black; text-align: left;">
<img src="site_page_output.png">
<br><b>Figure 1</b>. A sample landing page for a Neotoma record.
</div>
The landing page for each record provides an important resource. The landing pages were written as RMarkdown documents that could be run serially or individually as part of the workflow for assigning DOIs to records. Hourclé &al [@hourcle2012linking] indicate that best practice is to resolve documents to both HTML (human-readable) and machine readable formats through content negotiation systems. DataCite resolves `text/html` content to the unique identifier provided by the data service (Neotoma), wheras other machine readable formats are resolved to DataCite's own metadata service [http://crosscite.org/cn/]() which has a cached version of the XML document used in producing the DOI.
The landing page provides an endpoint for citations, with relevant metadata. It provides an aesthetically clean display with an interactive map element supported through `leaflet` [@cheng2016leaflet]. The DOI resolves directly to the landing page, and the landing page will be updated (along with versioning) if the underlying content changes. In addition, the landing page includes machine readable data, embedded as single-block JSON-LD markup using a Resource Description Framework model, for the record. This provides greater functionality for the entire DOI system implemented within Neotoma.
One limitation for the current procedure is that extended descriptors or laboratory analytic methods for the datasets are unavailable through Neotoma within the current database data model. While individual authors may provide methods, this information is often not stored in Neotoma itself, although certain aspects of the methods may be. This limitation of the current database structure means that methodological information may be lacking. However, content delivered through the Tilia XML is more complete and may provide greater information.
##Building the XML Document:
As mentioned in the Introduction, there is no single object in Neotoma that represents all the metadata associated with a dataset. The dataset, in the sense of a citable unit, is a synthetic construct pulled from `Site`, `Dataset`, `CollectionUnit` and `Sample` information, although a `dataset` table does exist in Neotoma, that helps link these components together [@neotomamanual]. To fill in the metadata fields we need to generate several SQL queries to the database. Here we will use a dataset ID of `1001`, a record for Hail Lake, YT, Canada [@cwynar1995paleovegetation].
To first generate the XML document, we use namespace definitions and the schema provided by DataCite as v3.1 ([https://schema.datacite.org/meta/kernel-3.1/]()).
```{r, results='hide', tidy = TRUE, eval=TRUE}
# Building the XML document:
ds_id <- 1001
# Generating the new XML framework and associated namespaces:
doc <- XML::newXMLDoc()
root <- XML::newXMLNode("resource",
namespaceDefinitions = c("http://datacite.org/schema/kernel-4",
"xsi" = "http://www.w3.org/2001/XMLSchema-instance",
"xml" = "http://www.w3.org/XML/1998/namespace"),
attrs = c("xsi:schemaLocation" = "http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd"),
doc = doc)
XML::newXMLNode("version", "1.0", parent = root)
```
##Assigning Metadata Tags
We structure this paper in the same order as the [DataCite Schema Documentation](https://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf). SQL queries to Neotoma have been wrapped within R functions that use dataset number (`1001` here) as the key parameter (Supplement One). Within Neotoma there is a process that stores submitted datasets as "frozen" objects in JSON format upon submission. This ensure that, even though Neotoma is a living database where elements like taxonomic names may change, the database also maintains a record of the dataset as it was in its original format. This record can be obtained using the Neotoma API.
##Identifier
The `identifier` links directly to the dataset's Landing Page. The generation of the landing page is covered elsewhere.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=TRUE}
# This is the empty shoulder for assigning DOIs:
XML::newXMLNode("identifier",
attrs = list("identifierType" = "DOI"),
parent = root)
```
###Creators:
While affiliated researchers occur futher in the list, we use the `datasetpi` field from the Neotoma Database to assign the Creator field. The Neotoma API provides an endpoint to obtain all dataset authors for a given dataset. While the majority of datasets have only a single assigned PI, there are `r sum(table(dbGetQuery(conn = con, "SELECT datasetid FROM ndb.datasetpis"))>1)` datasets (of `r dbGetQuery(conn = con, "SELECT COUNT(DISTINCT datasetid) FROM ndb.datasetpis"))` total datasets) with multiple creators. The Dataset PI is the individual (or individuals) who are cited in the data citation.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
XML::newXMLNode("creators", parent = root)
lapply(contact,
function(x) {
if(length(x) > 0) {
XML::addChildren(root[["creators"]],
XML::newXMLNode("creator",
.children = list(XML::newXMLNode("creatorName",
x$fullName),
XML::newXMLNode("affiliation",
gsub(pattern = "\r\n", ", ", x$address)))))
}
})
```
###Titles
The title is a synthetic object, produced by combining the site name and the data type. In this case, we generate the name `r default$SiteName[1]`.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
title <- paste0(frozen$frozendata$data$dataset$site$sitename, " ",
frozen$frozendata$data$dataset$dataset$datasettype,
" dataset")
XML::newXMLNode("titles", parent = root)
XML::newXMLNode("title",
title,
attrs = c("xml:lang" = "en-us"),
parent = root[["titles"]])
```
###Publisher & Publication Year
Neotoma consists of two levels of publication. The model for data publication in Neotoma is published volumes within a book, where each constituent database is a "chapter" with a unique editor, published as part of the Neotoma Paleoecological Database. Because Neotoma acts as a portal for many unique databases it is important to recognize both the primacy of the original database, but also the effort taken by the managers of those databases in assembling data and building community around those data products. However, for `Publication`, we consider the creation of the DOI the moment of publication and Neotoma, as the licensing agent, the publisher. There is an opportunity to describe dates with more granularity in the metadata schema below.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
newXMLNode("publisher", "Neotoma Paleoecological Database", parent = root)
# Number 5:
newXMLNode("publicationYear", format(Sys.Date(), "%Y"), parent = root)
```
###Subject
The Subject and Subject Scheme information is critical for metadata discovery, but while DataCite requests the use of existing subject schemes, some research in the library sciences cautions against the use of formal subject schemes, given that modern web searching has not formally adopted the structured schemes used in library classification schemes.
<-- I've created a document [here](https://docs.google.com/document/d/1wcR6WWAV3COD_MeOgaOEqtpJ3bbe1nxF9omnmmlEA5Q/edit?usp=sharing) that discusses it in more detail.. For now I will assign everything "Paleoecology". -->
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
newXMLNode("subjects", newXMLNode("subject",
"Paleoecology",
attrs = c("subjectScheme" = "Library of Congress",
"schemeURI" = "http://id.loc.gov/authorities/subjects")),
parent = root)
```
###Contributors
The contributor information comes from the `Contact` tables for the record. There is a query (in `sql_calls.R`) that pulls in contributor information. Right now we are not tagging the name identifiers (e.g., ORCID). Historically Neotoma has not had the ability to add or store identifier systems such as ORCID ([http://orcid.org]()), however this functionality has recently been introduced, although more work is required to fully implement the system.
Contributors to the dataset include individuals beyond the dataset PI (above). When a dataset is entered into Neotoma there may be a number of individuals who interact with the data, including data entry technicians, individuals who may update the age model, individuals who assisted in field collection, etc. Neotoma strongly believes in transparency, and credit for data contribution, however, we believe that this is balanced by clear attribution for data generation. As such, the *Contributors* fields in the metadata lists all individuals associated with the dataset, but the data citation lists only the dataset PI.
```{r results='as-is', echo=FALSE, eval=runner}
contacts <- sqlQuery(con, query = contributor_call(ds_id))
contacts$affiliation <- gsub('\r\n', ', ', contacts$affiliation)
knitr::kable(contacts)
```
```{r, results='hide', eval=runner}
# The contributors come from the DB call. This duplicates
newXMLNode("contributors", parent = root)
lapply(1:nrow(contacts),
function(x) {
newXMLNode("contributor",
attrs = c("contributorType" = contacts$contributorType[x]),
.children = list(newXMLNode("contributorName",
contacts$creatorName[x])), parent = root[["contributors"]])
})
```
###Dates
Dates in Neotoma include datas of original accession to the North American Pollen Database (or other constituent database), accession into Neotoma, and subsequent revision dates.
```{r results='as-is', echo = FALSE, eval=TRUE}
dates <- sqlQuery(con, query = date_call(ds_id))
knitr::kable(dates)
```
```{r, results='hide'}
# Adding the dates in one at a time, we use the lapply to insert them
# into the `dates` node.
newXMLNode("dates", parent = root)
lapply(1:nrow(dates),
function(x) {
newXMLNode("date",
format(as.Date(dates[1,1]), "%Y-%m-%d"),
attrs = c("dateType" = dates[x,2]),
parent = root[["dates"]])
} )
```
###Language and Resource Type
All resources in the database are assigned the default language "English". Neotoma is likely to change this choice as data holdings continue to grow from Europe, Asia, South America and Africa, however, the database data model has no field or table for describing resource language at present.
All data are assigned the default resource type `"Dataset/Paleoecological Sample Data"`.
```{r, echo = TRUE, results='hide', eval=runner}
newXMLNode("language", "English", parent = root)
newXMLNode("resourceType", "Dataset/Paleoecological Sample Data",
attrs = c("resourceTypeGeneral" = "Dataset"), parent = root)
```
###Related Identifiers
`RelatedIdentifiers` are key metadata for the data records. These identifiers will point to the URL of the publications relating to the dataset using `isCitedBy` (as opposed to `IsReferencedBy`), they will point to the DOI for Neotoma using `IsCompiledBy` (Neotoma has been assigned the DOI [http://doi.org/10.17616/R3PD38]() by Re3Data [@pampel2013making]). The Related Identifiers are critical pieces of cyberinfrastructure, explicitly linking the data to other resources through the DOI infrastructure, allowing other resources to build connected networks of data, and to place Neotoma into those networks.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
newXMLNode("relatedIdentifiers", parent = root)
newXMLNode("relatedIdentifier", paste0("api.neotomadb.org/v1/downloads/", ds_id),
attrs = list(relationType = "IsVariantFormOf",
relatedIdentifierType = "URL",
relatedMetadataScheme = "json"),
parent = root[["relatedIdentifiers"]])
newXMLNode("relatedIdentifier", "http://doi.org/10.17616/R3PD38",
attrs = list(relationType = "IsCompiledBy",
relatedIdentifierType = "DOI"),
parent = root[["relatedIdentifiers"]])
# Publication DOI tags (if available)
dois <- sqlQuery(con, query = doi_call(ds_id))
if (is.na(dois)) {
# There's no current DOI
} else {
lapply(unlist(dois), function(x){
newXMLNode("relatedIdentifier", paste0("doi:", x),
attrs = list(relationType = "IsDocumentedBy",
relatedIdentifierType = "DOI"),
parent = root[["relatedIdentifiers"]])
})
}
```
###Size
Data object size can be reported for each component of the data source, JSON, XML, TLX or others. This term is straightforward and needs little explanation.
```{r, results='hide', eval=runner}
# Number 13: size
size <- as.numeric(object.size(GET(paste0("api.neotomadb.org/v1/downloads/", ds_id))))
newXMLNode("sizes", newXMLNode("size", paste0('JSON: ', ceiling(size/1000), " KB")), parent = root)
```
###Format
The data, as indicated above, is available through the landing page in only two formats, JSON or TLX. Future formats such JSON-LD, as in the Linked Paleo Data standards [@mckay2016lipd], will be supported. At such a point, metadata for these records will be updated.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
# Number 14:
newXMLNode("formats", parent = root)
newXMLNode("format", "XML", parent = root[["formats"]])
newXMLNode("format", "TLX", parent = root[["formats"]])
newXMLNode("format", "JSON", parent = root[["formats"]])
```
###Version
The data Version is defined for all datasets as v1.0, from the date of initial generation of the DOI. Following this, we have developed a script that is triggered when a dataset is modified, indicating the modification date and an incremented version number. One reccomendation for data that may change over time, is to cite an authoratative dataset (with DOI), but to include access date and time to the citation. Netoma will recommend this practice going forward, but this is not currently implemented within the database.
###Rights List
The Neotoma Database provides a User Agreement for the database and constituent datasets. Ultimately, the goal of the database is to provide open exchange of data to researchers and the public. In 2015 the a decision was made to license Neotoma data under a Creative Commons BY4 License -- [http://creativecommons.org/licenses/by/4.0/deed.en_US](). Under this license there is a requirement for appropriate credit, in a reasonable manner, however the license provides broad protection to the licensee for modification and downstream use of the dataset.
```{r, results='hide', eval=runner}
# Number 16
addChildren(newXMLNode("rightsList", parent = root),
children = newXMLNode("rights", "CC-BY4",
attrs = c("rightsURI" = "http://creativecommons.org/licenses/by/4.0/deed.en_US")))
```
###Description
`Description` and `DescriptionType` are described as critical resources in the DataCite metadata schema. Currently the Neotoma Database lacks the ability to link the publication abstract into the metadata schema. As such we generate a synthetic dataset description that includes the site name and dataset type.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
newXMLNode("Description",
paste0("Raw data for the ",
default$SiteName,
" obtained from the Neotoma Paleoecological Database."),
parent = root)
newXMLNode("descriptionType",
"Abstract", parent = root[["Description"]])
```
###GeoLocation
The `geoLocation` is defined as the four element bounding box. In cases where the bounding box represents a point, all latitude and longitude values will be, respectively, equivalent.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
loc <- sqlQuery(con, query = geoloc_call(1001))
newXMLNode("geoLocations", parent = root)
newXMLNode("geoLocation",
newXMLNode("geoLocationBox", loc, parent = root),
parent = root[["geoLocations"]])
```
##Validation and Upload
```{r, message = FALSE, warning = FALSE, eval=runner}
xmlSchemaValidate('data/metadata.xsd', doc)
```
The schema can be validated against the DataCite metadata schema, and, if it validates properly, the XML document can then be sent through the EZID API to register the DOI.
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
urlbase = 'https://ezid.cdlib.org/'
# We need to clean up the XML formatting, removing hard returns and changing quotes:
parse_doc <- gsub('\\n', '', paste0('datacite: ',
saveXML(xmlParse(paste0('data/datasets/',
ds_id, '_output.xml')))))
parse_doc <- gsub('\\"', "'", parse_doc)
r = POST(url = paste0(urlbase, 'shoulder/doi:10.5072/FK2'),
authenticate(user = 'apitest', password = 'apitest'),
add_headers(c('Content-Type' = 'text/plain; charset=UTF-8',
'Accept' = 'text/plain')),
body = parse_doc,
format = 'xml')
out_doi <- substr(content(r),
regexpr("doi:", content(r)),
regexpr("\\s\\|", content(r)) - 1)
```
```{r results = 'hide', message = FALSE, warning = FALSE, echo = FALSE}
if (!runner) {
out_doi <- 1001
}
```
##Updating Records
In cases where a data set is updated, Neotoma still directly modifies the record. This will (or may) change. Now that we can build the XML okay we should be able to modify the DOI metadata using:
```{r, results = 'hide', message = FALSE, warning = FALSE, eval=runner}
xmlValue(root[["identifier"]]) <- out_doi
saveXML(doc = doc,
file = paste0('data/datasets/', ds_id, '_output.xml'),
prefix = '<?xml version="1.0" encoding="UTF-8"?>')
```
```
r = POST(url = paste0(urlbase, paste0('identifier/', out_doi)),
authenticate(user = XXXXXXX, password = XXXXXXXX),
add_headers(c('Content-Type' = 'text/plain; charset=UTF-8',
'Accept' = 'text/plain')),
body = paste0('datacite: ',paste(readLines(paste0('data/datasets/',
ds_id, '_output.xml')), collapse = " ")),
format = 'xml')
content(r)
```
#Conclusion
This paper showcases a fully realized implementation of the process required to mint a DOI for datasets within the Neotoma Database. The R function that explicitly calls the RMardown renderer and generates the landing pages, is then run as a `chron` job on the Neotoma Servers, searching for both new records and for records that have had recent changes to the key metadata parameters.
Neotoma lies at the center of a number of scientific disciplines, examplifying the concept of the paleoecoinformatics [@brewer2012paleoecoinformatics]. Along with efforts to link Neotoma contact information to ORCID identifiers, publications to DOIs and datasets to a larger number of resources including the Global Biodiversity Information Facility (GBIF: [http://gbif.org]()), the Paleobiology Database (through the EarthLife Consortium - [http://earthlifeconsortium.org]()), Neotoma is placing itself within a cyberinfrastructure network that has the capacity to improve access, reproducibility and discoverability in a way that was previously unavailable for most researchers.
# Acknowledgements
The authors would like to acknowledge the support of the University of Wisconsin Library System, particularly Brianna Marshall, Chair of Research Data Services. This work was funded through NSF grants 1541002 (SJG, JWW), 1550805 (ECG), and 1550755 (RWG and MA).
# Competing Interests
The authors declare no competing interests.
# References