-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path02-getting-organised.Rmd
196 lines (148 loc) · 8.01 KB
/
02-getting-organised.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# Structuring your project
The scientific process is naturally incremental, and many projects start
life as random notes, some code, then a manuscript, and eventually everything
is a bit mixed together. It does not have to be like that.
Objectives:
* To be able to create self-contained projects in RStudio
* To be able to load data from a CSV file
* To be able to transfer things from the console to a file
* To be able to create a R markdown document
## Projects
Most people tend to organize their projects like this:

There are many reasons why we should *ALWAYS* avoid this:
1. It is really hard to tell which version of your data is
the original and which is the modified;
2. It gets really messy because it mixes files with various
extensions together;
3. It probably takes you a lot of time to actually find
things, and relate the correct figures to the exact code
that has been used to generate it;
A good project layout will ultimately make your life easier:
* It will help ensure the integrity of your data;
* It makes it simpler to share your code with someone else
(a lab-mate, collaborator, or supervisor);
* It allows you to easily upload your code with your manuscript submission;
* It makes it easier to pick the project back up after a break.
Fortunately, there are tools and packages which can help you manage your
work effectively. One of the most powerful and useful aspects of RStudio
is its project management functionality.
> ## Creating a self-contained project
>
> We're going to create a new project in RStudio:
>
> 1. Click the "File" menu button, then "New Project".
> 2. Click "New Directory".
> 3. Click "Empty Project".
> 4. Type in the name of the directory to store your project, e.g. "my_project".
> 5. Click the "Create Project" button.
## Best practices for project organization
Although there is no "best" way to lay out a project, there are some general
principles to adhere to that will make project management easier:
### Treat data as read only
This is probably the most important goal of setting up a project. Data is
typically time consuming and/or expensive to collect. Working with them
interactively (e.g., in Excel) where they can be modified means you are never
sure of where the data came from, or how it has been modified since collection.
It is therefore a good idea to treat your data as "read-only".
### Data Cleaning
In many cases your data will be "dirty": it will need significant preprocessing
to get into a format R (or any other programming language) will find useful. This
task is sometimes called "data munging". I find it useful to store these scripts
in a separate folder, and create a second "read-only" data folder to hold the
"cleaned" data sets.
### Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should
all be able to be regenerated from your scripts.
There are lots of different was to manage this output. I find it useful to
have an output folder with different sub-directories for each separate
analysis. This makes it easier later, as many of my analyses are exploratory
and don't end up being used in the final project, and some of the analyses
get shared between projects.
## Good Enough Practices for Scientific Computing
[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization:
1. Put each project in its own directory, which is named after the project.
2. Put text documents associated with the project in the `doc` directory.
3. Put raw data and metadata in the `data` directory, and files generated
during cleanup and analysis in a `results` directory.
4. Put source for the project's scripts and programs in the `src` directory.
5. Name all files to reflect their content or function.
### Challenge
1. Create a new project for today's workshop with a folder for the raw data,
scripts, and outputs. Remember where you created the project.
2. Place the workshop data you downloaded earlier in the `data/` folder of
your project. Open the directory containing your new project in Explorer or
Finder to copy the datafile.
## Interactive work vs scripts
Running commands interactively is great for learning something new or figuring
out how to get something just right. After a while it becomes tedious to have
to start from scratch everytime. As you work more with R you will want to reuse
parts of your analysis or run the automatically. This is where scripts come in.
Scripts are text files that hold the code you write. We will work both with
scripts and the console during this workshop. To create a new script you use
"File > New File > R Script".
When I get a new dataset one of the first things I do is load it into an
interactive session and explore it a bit.
```{r}
gapminder <- read.csv("data/gapminder-FiveYearData.csv")
summary(gapminder)
head(gapminder)
# selecting only the gdpPercap column
first_gdps <- head(gapminder$gdpPercap)
sum(first_gdps)
```
### Challenge
1. Load the data and inspect the **last** six rows interactively. Calculate the
sum of the `gdpPercap` column for the last six rows.
1. Create a script that does the same thing.
## Reports
Humans understand the world through narratives, or stories. Figures, tables
and code by themselves do not tell a story. They are more like a collection of
factoids. However a story without figures, tables and code is just a rumour or
folklore. The combination of facts and narrative is what allows people to
understand and be convinced.
One way to combine your narrative with figures and tables is
[Rmarkdown](http://rmarkdown.rstudio.com/lesson-1.html). You can use a single
Rmarkdown file to both:
* save and execute code; and
* generate high quality reports that can be shared with people who do not have
R installed.
In fact most of this tutorial is written using Rmarkdown!
> ## R notebooks
>
> At the start of October RStudio gained a ["R notebook"
> interface](https://www.r-bloggers.com/r-notebooks/). If you like Rmarkdown
> check it out in the preview release of RStudio.
There are several different output formats you can choose: HTML, PDF, Word.
Depending on how much time you are willing to invest you can make these
reports look extremely pretty. They are very useful for reports that change
often or need to be produced frequently. Personally I would still give the
individual plots to a graphic designer for an external publication, but for
sharing within the organisation and friends Rmarkdown is a life saver. It
means you always know exactly what data went into a figure or table and how
it was made. You can also make [presentations from Rmarkdown](https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations).
> ## Output options
>
> Chunk output can be customized with [knitr](http://yihui.name/knitr/options/)
> options, arguments set in the `{}` of a chunk header. Five useful arguments:
>
> * `include = FALSE` prevents code and results from appearing in the
> finished file. Rmarkdown still runs the code in the chunk, and the
> results can be used by other chunks.
> * `echo = FALSE` prevents code, but not the results from appearing in
> the finished file. This is a useful way to embed figures.
> * `message = FALSE` prevents messages that are generated by code from
> appearing in the finished file.
> * `warning = FALSE` prevents warnings that are generated by code from
> appearing in the finished.
> * `fig.cap = "..."` adds a caption to graphical results.
>
> Check the [R Markdown Reference
> Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) [**PDF!**] for more knitr
> options.
### Challenge
1. Create an Rmarkdown file that loads the gapminder data and shows the
first three rows of the data as a table and the sum of the `gdpPercap`
column for the first six rows.
2. Add a sentence explaining what you are doing and why to the notebook.
3. "knit" the document and open it in your browser