index.html

<!DOCTYPE html>
<html>
<head>
  <title>Data sharing policy</title>
  <meta charset="utf-8">
  <meta name="description" content="Data sharing policy">
  <meta name="author" content="Carsten &amp; Osvaldo">
  <meta name="generator" content="slidify" />
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta http-equiv="X-UA-Compatible" content="chrome=1">
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/default.css" media="all" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/phone.css" 
    media="only screen and (max-device-width: 480px)" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/slidify.css" >
  <link rel="stylesheet" href="libraries/highlighters/highlight.js/css/tomorrow.css" />
  <base target="_blank"> <!-- This amazingness opens all links in a new tab. -->
  <script data-main="libraries/frameworks/io2012/js/slides" 
    src="libraries/frameworks/io2012/js/require-1.0.8.min.js">
  </script>
  
    <link rel="stylesheet" href = "assets/css/ribbons.css">

</head>
<body style="opacity: 0">
  <slides class="layout-widescreen">
    
    <!-- LOGO SLIDE -->
    <!-- END LOGO SLIDE -->
    

    <!-- TITLE SLIDE -->
    <!-- Should I move this to a Local Layout File? -->
    <slide class="title-slide segue nobackground">
      <hgroup class="auto-fadein">
        <h1>Data sharing policy</h1>
        <h2>Adapted from &quot;How to share data with a statistician&quot;</h2>
        <p>Carsten &amp; Osvaldo<br/>Institute of Medical Virology, University of Zurich</p>
      </hgroup>
          </slide>

    <!-- SLIDES -->
      <slide class="" id="slide-1" style="background:;">
  <hgroup>
    <h2>Introduction</h2>
  </hgroup>
  <article>
    <p>These slides are an adaptation of
<a href="https://github.com/jtleek/datasharing">How to share data with a statistician</a>
by Jeff Leek (Johns Hopkins Bloomberg School of Public Health).</p>

<p>Code available on <a href="https://github.com/ozagordi/DataSharingPolicy">GitHub</a></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-2" style="background:;">
  <hgroup>
    <h2>Why prescribe how to share data</h2>
  </hgroup>
  <article>
    <h4>Chiefly, because it takes more time to make sense of messy data.</h4>

<hr>

<p>Moreover:</p>

<ol>
<li>Reduces errors and iterations (and more iterations means more time)</li>
<li>Improves reproducibility (should your analysis be questioned)</li>
<li>Helps communicating</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-3">
<hgroup>
  <h2>..and above all</h2>
</hgroup>
<article class = 'flexbox vcenter'>
<h3>It makes the life of the statistician much easier</h3>

</article>
<!-- Presenter Notes -->
</slide>
      <slide class="" id="slide-4" style="background:;">
  <hgroup>
    <h2>What you should deliver and why</h2>
  </hgroup>
  <article>
    <ol>
<li>The raw data: because it&#39;s the most trustable source.</li>
<li>A tidy data set: because it is directly processable, more on this later.</li>
<li>A code book describing each variable and its values in the tidy data set:
it reduces errors, helps understanding, enforces reproducibility.</li>
<li>An <em>explicit</em> and <em>exact</em> recipe you used to go from 1 -&gt; 2,3.</li>
</ol>

<p>Raw data are often messy, we can&#39;t do much for them. But when we derive other
data from them we can try to make it in a tidy way.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-5" style="background:;">
  <hgroup>
    <h2>The raw data</h2>
  </hgroup>
  <article>
    <p>Examples:</p>

<ol>
<li>FACS output (the <code>.fcs</code> file, before using Flowjo or anything else).</li>
<li>The <code>.csv</code> or <code>.txt</code> file from the plate reader (<em>before</em> loading into Excel).</li>
<li>Microscopy <code>.tiff</code> images.</li>
<li>NGS sequences in <code>.fastq</code> format</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-6" style="background:;">
  <hgroup>
    <h2>Example of raw data: file from the plate reader</h2>
  </hgroup>
  <article>
    <p>As we will see, this is an example of <em>messy</em> data.</p>

<p><img src="figures/Pico_screenshot.png" alt="Plate reader"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-7" style="background:;">
  <hgroup>
    <h2>Raw means:</h2>
  </hgroup>
  <article>
    <ol>
<li>No software analysis</li>
<li>No manipulation/removal of data</li>
<li>Data were not summarised</li>
</ol>

<p>If manipulated data is reported as raw, the statistician has to perform an
autopsy to find out what went wrong.</p>

<p>Autopsies are</p>

<blockquote>
<p>as fun as being hit by a (large) truck, with the downside of
not being a fast process.</p>

<p>(adapted)</p>
</blockquote>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-8" style="background:;">
  <hgroup>
    <h2>Tidy data set: why</h2>
  </hgroup>
  <article>
    <p>Tidy data are easy to clean and analyse.</p>

<p>There is no need to reinvent the wheel for each new dataset.</p>

<blockquote>
<p>The development of tidy data has been driven by my struggles working with
real-world datasets, which are often organised in bizarre ways. I have spent
countless hours struggling to get these datasets organised in a way that
makes data analysis possible, let alone easy.</p>

<p>(Hadley Wickham)</p>
</blockquote>

<h4>Tidy datasets are not <em>pretty</em> datasets. They are not meant to be visualised.</h4>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-9" style="background:;">
  <hgroup>
    <h2>Tidy dataset</h2>
  </hgroup>
  <article>
    <p>A tidy dataset follows three fundamental principles:</p>

<ol>
<li>Measured variables in the columns</li>
<li>Single observations of the variables in the rows</li>
<li>Different tables for different types of variables</li>
</ol>

<p>On point 3: <strong>no</strong> Excel Worksheets and use unique identifiers to link
different tables.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-10" style="background:;">
  <hgroup>
    <h2>Toy example: patient features</h2>
  </hgroup>
  <article>
    <p>The <code>id</code> column identifies the patient and will be used to link with
the next tables (first column is row number).</p>


<div style='float:left;width:48%;' class='centered'>
  <h4>Messy</h4>

<pre><code>##   id         dob male female
## 1 25  1979-01-16  yes       
## 2 64 20 sep 1984           y
</code></pre>

<p>Variables are listed in the columns rather than in the row. Dates and sex are
reported inconsistently.</p>


</div>
<div style='float:right;width:48%;'>
  <h4>Tidy</h4>

<pre><code>##   id date_of_birth sex
## 1 25    1979-01-16   M
## 2 64    1984-09-20   F
</code></pre>

<p>Dates are reported in a consistent format <code>YYYY-MM-YY</code>, sex is now a variable
(reported in the column) and reported consistently (initial, capitalised).</p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-11" style="background:;">
  <hgroup>
    <h2>Toy example: virology diagnosis</h2>
  </hgroup>
  <article>
    <p>This table reports results of some virology tests. The <code>id</code> column identifies
the patient so it can be used to link the previous table.</p>


<div style='float:left;width:48%;' class='centered'>
  <h4>Messy</h4>

<pre><code>##   id  HIV   HCV
## 1 25 3100 45000
## 2 64    0 85000
</code></pre>

<p>The analysts would need to adapt their tool if, say, another test were added.</p>


</div>
<div style='float:right;width:48%;'>
  <h4>Tidy</h4>

<pre><code>##   id test viral_load
## 1 25  HIV       3100
## 2 25  HCV      45000
## 3 64  HIV          0
## 4 64  HCV      85000
</code></pre>

<p>Easier to parse and analyse.</p>

<p><em>parse: analyse (a string or text) into logical syntactic components</em> (Oxford
Dictionary)</p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-12" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-13" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot_1.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-14" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot_2.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-15" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot_3.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-16" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot_4.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-17" style="background:;">
  <hgroup>
    <h2>Excerpt of <code>JRCSF all.pzf</code></h2>
  </hgroup>
  <article>
    <p><img src="figures/JRCSF_screenshot_5b.png" alt="JRCSF excerpt"></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-18" style="background:;">
  <hgroup>
    <h2>Tidy up!</h2>
  </hgroup>
  <article>
    <p>Please, remove the space from the file name: <code>JRCSF all.pzf</code> becomes
<code>JRCSFall.pzf</code> or <code>JRCSF_all.pzf</code>. Statistician often use linux for data
analysis and there empty spaces mark the beginning of a new command.</p>

<p>Then, applying the principles of tidy data, one has</p>

<pre><code>##   inhibitor assay_n log10_conc inhibition_percent other_info
## 1   Cd4IgG2       1     1.3979                 99       &lt;NA&gt;
## 2   Cd4IgG2       1     0.7959                 90       &lt;NA&gt;
## 3   Cd4IgG2       1     0.1938                 66       &lt;NA&gt;
</code></pre>

<p><code>...</code></p>

<pre><code>##     inhibitor assay_n log10_conc inhibition_percent other_info
## 117    PGT145       2     -1.010                 76       star
## 118    PGT145       2     -1.612                 47       star
## 119    PGT145       2     -2.214                 16       star
</code></pre>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-19" style="background:;">
  <hgroup>
    <h2>Other resources to share: the code book</h2>
  </hgroup>
  <article>
    <p>The code book contains a more detailed description of what is in the tidy
dataset.</p>

<p>It should include </p>

<ul>
<li>information about the measured/reported variables (<em>e.g.</em> units)</li>
<li>whether and how measurements were summarised</li>
<li>information about the experimental design (<em>e.g.</em> study design, instrument
used, experimenter).</li>
</ul>

<h4>This will be an invaluable resource for writing the paper later!</h4>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-20" style="background:;">
  <hgroup>
    <h2>How to code variables</h2>
  </hgroup>
  <article>
    <p>Generally speaking, variables can be:</p>

<ol>
<li>continuous (weight, speed, fluorescence)</li>
<li>ordinal (discrete, but quantitative: low, medium, high)</li>
<li>categorical (no order relation given: male/female, vaccinated/non vaccinated)</li>
<li>missing (only when you don&#39;t know what happened, code with <code>NA</code>)</li>
<li>censored (missing, but you know more or less why, <code>NA</code> and set an 
additional column <code>censored</code> to <code>TRUE</code>)</li>
</ol>

<p>Do not use anything that would not be kept in a simple text.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-21" style="background:;">
  <hgroup>
    <h2>Reproducibility</h2>
  </hgroup>
  <article>
    <p>One reason why statisticians prefer to write programs/scripts to analyse data
is that a set of written instructions can be reproduced exactly, unlike a set
of mouse clicks.</p>

<p>If you don&#39;t know a programming language and you need to request/describe an
analysis, you can use pseudocode: a detailed cooking recipe.</p>

<ol>
<li>Take the file for sample A, analyse it with the program X and save column Y</li>
<li>Repeat for samples B, C, D</li>
<li>Plot mean and standard deviation (or median, or boxplot) as a function of
the property Z of the sample that is listed in file W.</li>
</ol>

<p><a href="http://www.sciencemag.org/content/334/6060/1226">More on reproducibility</a></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-22">
<hgroup>
  
</hgroup>
<article class = 'flexbox vcenter'>
<p><img src="figures/merci.png" alt="merci"></p>

</article>
<!-- Presenter Notes -->
</slide>
    <slide class="backdrop"></slide>
  </slides>

  <!--[if IE]>
    <script 
      src="http://ajax.googleapis.com/ajax/libs/chrome-frame/1/CFInstall.min.js">  
    </script>
    <script>CFInstall.check({mode: 'overlay'});</script>
  <![endif]-->
</body>
<!-- Grab CDN jQuery, fall back to local if offline -->
<script src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.7.min.js"></script>
<script>window.jQuery || document.write('<script src="libraries/widgets/quiz/js/jquery-1.7.min.js"><\/script>')</script>
<!-- Load Javascripts for Widgets -->
<!-- MathJax: Fall back to local if CDN offline but local image fonts are not supported (saves >100MB) -->
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$'], ['\\(','\\)']],
      processEscapes: true
    }
  });
</script>
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<!-- <script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script> -->
<script>window.MathJax || document.write('<script type="text/x-mathjax-config">MathJax.Hub.Config({"HTML-CSS":{imageFont:null}});<\/script><script src="libraries/widgets/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><\/script>')
</script>
<!-- LOAD HIGHLIGHTER JS FILES -->
<script src="libraries/highlighters/highlight.js/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- DONE LOADING HIGHLIGHTER JS FILES -->
</html>