analytics/workflow.qmd at master · databrary/analytics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Workflow {-}

This page describes the set of tools we use to generate these data and present them in this site.

## GitHub {-}

We use GitHub's pages feature to serve the web site files.

At present, we are using a repository associated with the Databrary organization <https://github.com/databrary/analytics>.
This results in the analytics site having the following url: https://databrary.github.io/analytics/.

The site is built locally by Rick Gilmore or Andrea Seisler, then pushed to GitHub.

## RStudio {-}

We use [RStudio](https://posit.co/products/open-source/rstudio/) as the integrated development environment for the site.
Most of the code is in Quarto and R, with some CSS and JavaScript.

We use a number of R packages in the workflow.

### Databraryr {-}

The [`databraryr` package](https://github.com/NYU-Databrary/databraryr) provides a set of tools for interacting with the Databrary API and gathering data from the site.
This package may be useful to some analysts whether or not they care about Databrary-specific analytics.

Most data and metadata used in these reports can be accessed by the public, but specific data about individual participants requires that the user be authorized and logged in to the site using the `databraryr::db_login()` function.

The package may be installed via `devtools::install_github(repo="https://github.com/NYU-Databrary/databraryr")`.

See <https://databrary.github.io/databraryr/> for documentation about the package.

::: {.callout-warning}
The package is under active development. The documentation may not be up-to-date.

Note that users wishing to script access to Databrary 2.0 must apply for permission to do so.
See <https://databrary.github.io/guide/more-information/api-access.html> for details.
:::

<!-- ### Targets {-} -->

<!-- We use the [`targets`](https://books.ropensci.org/targets/) package to generate data and metadata files that are used to create the visualizations and summaries. -->

<!-- Some of the components are rendered on a regular, time-determined basis, like the weekly report. Others are rendered less often, typically quarterly. -->

<!-- Most of the targets call functions in `R/functions.R`. The specific targets can be viewed in the `_targets.R` file in the root directory of the repository. -->

<!-- A typical workflow to 'make' or update the *data* files, is as follows: -->

<!-- ``` -->
<!-- library(targets) -->
<!-- tar_make() -->
<!-- ``` -->

### Quarto {-}

To generate the site, we use [`Quarto`](https://quarto.org).

A typical sequence of commands to regenerate the site is the following:

```
quarto render src
```

Configuration files using the YAML markup language control the rendering process.
These files and the source R Markdown (.Rmd) files used to generate the site are in the `src/` directory.

The rendering command creates a full website in the `docs/` directory.

### Package Reproducibility {-}

We use the [`renv`](https://rstudio.github.io/renv/articles/renv.html) package to track package dependencies.

## Strategy {-}

Some of the data elements in the report change often, but others do not.
We have found it is faster and more convenient in many cases to download various data files from Databrary and store copies as comma-separated value (CSV) text files in a private (local) directory.

Some of the CSVs contain potentially identifiable human subjects data, so we use a special `.gitignore` file to keep those files out of the git tracking scheme and prevent uploading them to GitHub.

These data analyses and visualizations have developed piecemeal over several years.
They could undoubtedly be optimized and improved.

The primary developer (Rick Gilmore) has had a 'git-er-done' attitude toward the project.

Gilmore takes some solace in the following quotation from the father of literate programming, [Donald Knuth](https://en.wikiquote.org/wiki/Donald_Knuth):

>...the real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

<!-- ### File-level dependencies -->

<!-- Several files contain historical data and are used for generating the time series plots in the weekly report: -->

<!-- - `src/csv/volumes-shared-unshared.csv` -->
<!-- - `src/csv/citations-monthly.csv` -->
<!-- - `src/csv/institutions-investigators.csv` -->
<!-- - `src/csv/max-ids.csv` -->

### Roadmap {-}

- Devise visualizations of assets by investigator/institution.
- Create JSON lat, lon file for Databrary home page.
- Report by-user, by-institution summary data.
- Access and report on summary data about "private" volumes.

## Clean-up

Log out of Databrary.

```{r logout-databrary}
l <- databraryr::logout_db()
if (file.exists("../.databrary.RData")) {
  unlink("../.databrary.RData", recursive = TRUE)
}
```