title | subtitle | author | job | framework | highlighter | hitheme | widgets | mode | |
---|---|---|---|---|---|---|---|---|---|
Data sharing policy |
Adapted from "How to share data with a statistician" |
Carsten & Osvaldo |
Institute of Medical Virology, University of Zurich |
io2012 |
highlight.js |
tomorrow |
|
selfcontained |
These slides are an adaptation of How to share data with a statistician by Jeff Leek (Johns Hopkins Bloomberg School of Public Health).
Code available on GitHub
Moreover:
- Reduces errors and iterations (and more iterations means more time)
- Improves reproducibility (should your analysis be questioned)
- Helps communicating
--- &vcenter
- The raw data: because it's the most trustable source.
- A tidy data set: because it is directly processable, more on this later.
- A code book describing each variable and its values in the tidy data set: it reduces errors, helps understanding, enforces reproducibility.
- An explicit and exact recipe you used to go from 1 -> 2,3.
Raw data are often messy, we can't do much for them. But when we derive other data from them we can try to make it in a tidy way.
Examples:
- FACS output (the
.fcs
file, before using Flowjo or anything else). - The
.csv
or.txt
file from the plate reader (before loading into Excel). - Microscopy
.tiff
images. - NGS sequences in
.fastq
format
As we will see, this is an example of messy data.
- No software analysis
- No manipulation/removal of data
- Data were not summarised
If manipulated data is reported as raw, the statistician has to perform an autopsy to find out what went wrong.
Autopsies are
as fun as being hit by a (large) truck, with the downside of not being a fast process.
(adapted)
Tidy data are easy to clean and analyse.
There is no need to reinvent the wheel for each new dataset.
The development of tidy data has been driven by my struggles working with real-world datasets, which are often organised in bizarre ways. I have spent countless hours struggling to get these datasets organised in a way that makes data analysis possible, let alone easy.
(Hadley Wickham)
A tidy dataset follows three fundamental principles:
- Measured variables in the columns
- Single observations of the variables in the rows
- Different tables for different types of variables
On point 3: no Excel Worksheets and use unique identifiers to link different tables.
--- &twocol w1:50% w2:50%
The id
column identifies the patient and will be used to link with
the next tables (first column is row number).
*** left
## id dob male female
## 1 25 1979-01-16 yes
## 2 64 20 sep 1984 y
Variables are listed in the columns rather than in the row. Dates and sex are reported inconsistently.
*** right
## id date_of_birth sex
## 1 25 1979-01-16 M
## 2 64 1984-09-20 F
Dates are reported in a consistent format YYYY-MM-YY
, sex is now a variable
(reported in the column) and reported consistently (initial, capitalised).
--- &twocol w1:50% w2:50%
This table reports results of some virology tests. The id
column identifies
the patient so it can be used to link the previous table.
*** left
## id HIV HCV
## 1 25 3100 45000
## 2 64 0 85000
The analysts would need to adapt their tool if, say, another test were added.
*** right
## id test viral_load
## 1 25 HIV 3100
## 2 25 HCV 45000
## 3 64 HIV 0
## 4 64 HCV 85000
Easier to parse and analyse.
parse: analyse (a string or text) into logical syntactic components (Oxford Dictionary)
Please, remove the space from the file name: JRCSF all.pzf
becomes
JRCSFall.pzf
or JRCSF_all.pzf
. Statistician often use linux for data
analysis and there empty spaces mark the beginning of a new command.
Then, applying the principles of tidy data, one has
## inhibitor assay_n log10_conc inhibition_percent other_info
## 1 Cd4IgG2 1 1.3979 99 <NA>
## 2 Cd4IgG2 1 0.7959 90 <NA>
## 3 Cd4IgG2 1 0.1938 66 <NA>
...
## inhibitor assay_n log10_conc inhibition_percent other_info
## 117 PGT145 2 -1.010 76 star
## 118 PGT145 2 -1.612 47 star
## 119 PGT145 2 -2.214 16 star
The code book contains a more detailed description of what is in the tidy dataset.
It should include
- information about the measured/reported variables (e.g. units)
- whether and how measurements were summarised
- information about the experimental design (e.g. study design, instrument used, experimenter).
Generally speaking, variables can be:
- continuous (weight, speed, fluorescence)
- ordinal (discrete, but quantitative: low, medium, high)
- categorical (no order relation given: male/female, vaccinated/non vaccinated)
- missing (only when you don't know what happened, code with
NA
) - censored (missing, but you know more or less why,
NA
and set an additional columncensored
toTRUE
)
Do not use anything that would not be kept in a simple text.
One reason why statisticians prefer to write programs/scripts to analyse data is that a set of written instructions can be reproduced exactly, unlike a set of mouse clicks.
If you don't know a programming language and you need to request/describe an analysis, you can use pseudocode: a detailed cooking recipe.
- Take the file for sample A, analyse it with the program X and save column Y
- Repeat for samples B, C, D
- Plot mean and standard deviation (or median, or boxplot) as a function of the property Z of the sample that is listed in file W.
--- &vcenter