forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-data-frames.Rmd
211 lines (167 loc) · 7.67 KB
/
03-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
layout: topic
title: The `data.frame` class
author: Data Carpentry contributors
minutes: 30
---
```{r, echo=FALSE, purl=FALSE, message = FALSE}
source("setup.R")
surveys <- read.csv("data/portal_data_joined.csv")
```
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
------------
> ## Learning Objectives
>
> * understand the concept of a `data.frame`
> * use sequences
> * know how to access any element of a `data.frame`
------------
# What are data frames?
`data.frame` is the _de facto_ data structure for most tabular data and what we
use for statistics and plotting.
A `data.frame` is a collection of vectors of identical lengths. Each vector
represents a column, and each vector can be of a different data type (e.g.,
characters, integers, factors). The `str()` function is useful to inspect the
data types of the columns.
A `data.frame` can be created by the functions `read.csv()` or `read.table()`, in
other words, when importing spreadsheets from your hard drive (or the web).
By default, `data.frame` converts (= coerces) columns that contain characters
(i.e., text) into the `factor` data type. Depending on what you want to do with
the data, you may want to keep these columns as `character`. To do so,
`read.csv()` and `read.table()` have an argument called `stringsAsFactors` which
can be set to `FALSE`:
```{r, eval=FALSE, purl=FALSE}
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
```
<!--- talk about colClasses argument?, row names? --->
You can also create `data.frame` manually with the function `data.frame()`. This
function can also take the argument `stringsAsFactors`. Compare the output of
these examples, and compare the difference between when the data are being read
as `character` and when they are being read as `factor`.
```{r, results='show', purl=TRUE}
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
```
### Challenge
1. There are a few mistakes in this hand crafted `data.frame`, can you spot and
fix them? Don't hesitate to experiment!
```
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge
## There are a few mistakes in this hand crafted `data.frame`,
## can you spot and fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
1. Can you predict the class for each of the columns in the following example?
```
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge:
## Can you predict the class for each of the columns in the following example?
## Check your guesses using `str(country_climate)`. Are they what you expected?
## Why? why not?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
Check your guesses using `str(country_climate)`. Are they what you expected?
Why? Why not?
R coerces (when possible) to the data type that is the least common
denominator and the easiest to coerce to.
# Inspecting `data.frame` objects
We already saw how the functions `head()` and `str()` can be useful to check the
content and the structure of a `data.frame`. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data.
* Size:
* `dim()` - returns a vector with the number of rows in the first element, and
the number of columns as the second element (the __dim__ensions of the object)
* `nrow()` - returns the number of rows
* `ncol()` - returns the number of columns
* Content:
* `head()` - shows the first 6 rows
* `tail()` - shows the last 6 rows
* Names:
* `names()` - returns the column names (synonym of `colnames()` for `data.frame`
objects)
* `rownames()` - returns the row names
* Summary:
* `str()` - structure of the object and information about the class, length and
content of each column
* `summary()` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
# Indexing and sequences
```{r, echo=FALSE, purl=TRUE}
## Indexing and sequences
```
If we want to extract one or several values from a vector, we must provide one
or several indices in square brackets, just as we do in math. For instance:
```{r, results='show', purl=FALSE}
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
animals[c(3, 2)]
animals[2:4]
more_animals <- animals[c(1:3, 2:4)]
more_animals
```
R indexes start at 1. Programming languages like Fortran, MATLAB, and R start
counting at 1, because that's what human beings typically do. Languages in the C
family (including C++, Java, Perl, and Python) count from 0 because that's
simpler for computers to do.
`:` is a special function that creates numeric vectors of integers in increasing
or decreasing order, test `1:10` and `10:1` for instance. The function `seq()`
(for __seq__uence) can be used to create more complex patterns:
```{r, results='show', purl=FALSE}
seq(1, 10, by=2)
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
```
Our survey data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we want
from it. Row numbers come first, followed by column numbers.
```{r, purl=FALSE}
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ] # the 3rd element for all columns
surveys[, 8] # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)
```
### Challenge
1. The function `nrow()` on a `data.frame` returns the number of rows. Use it,
in conjuction with `seq()` to create a new `data.frame` called
`surveys_by_10` that includes every 10th row of the survey data frame
starting at row 10 (10, 20, 30, ...)
```{r, echo=FALSE, purl=TRUE}
### The function `nrow()` on a `data.frame` returns the number of
### rows. Use it, in conjuction with `seq()` to create a new
### `data.frame` called `surveys_by_10` that includes every 10th row
### of the survey data frame starting at row 10 (10, 20, 30, ...)
```
<!---
```{r, purl=FALSE}
surveys_by_10 <- surveys[seq(10, nrow(surveys), by=10), ]
```
--->