Skip to content

Commit f591e4c

Browse files
authored
Add files via upload
1 parent 6a4acfe commit f591e4c

17 files changed

+8669
-0
lines changed

wrangling/combining-tables.Rmd

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
## Combining tables
2+
3+
```{r, message=FALSE, warning=FALSE}
4+
library(tidyverse)
5+
library(ggrepel)
6+
library(dslabs)
7+
ds_theme_set()
8+
```
9+
10+
The information we need for a given analysis may not be in just one table. For example, when forecasting elections we used the function `left_join` to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables.
11+
12+
Suppose we want to explore the relationship between population size for US states, which we have in this table
13+
14+
```{r}
15+
data(murders)
16+
head(murders)
17+
```
18+
19+
and electoral votes, which we have in this one:
20+
21+
```{r}
22+
data(polls_us_election_2016)
23+
head(results_us_election_2016)
24+
```
25+
26+
Notice that just joining these two tables together will not work since the order of the states is not quite the same:
27+
28+
```{r}
29+
identical(results_us_election_2016$state, murders$state)
30+
```
31+
32+
The _join_ functions, described below, are designed to handle this challenge.
33+
34+
### Joins
35+
36+
The `join` functions in the `dplyr` package, which are based on SQL joins, make sure that the tables are combined so that matching rows are together.
37+
The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Note what happens if we join the two tables above by `state` using `left_join`:
38+
39+
```{r}
40+
tab <- left_join(murders, results_us_election_2016, by = "state")
41+
42+
tab %>% select(state, population, electoral_votes) %>%
43+
head()
44+
```
45+
46+
The data has been successfully joined and we can now, for example, make a plot to explore the relationship between population and electoral votes:
47+
48+
```{r}
49+
tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
50+
geom_point() +
51+
geom_text_repel() +
52+
scale_x_continuous(trans = "log2") +
53+
scale_y_continuous(trans = "log2") +
54+
geom_smooth(method = "lm", se = FALSE) +
55+
xlab("Population (in millions)") +
56+
ylab("Electoral Votes")
57+
```
58+
59+
We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting a higher ratio.
60+
61+
62+
In practice, it is not always the case that each row in one table has a matching row in the other. For this reason we have several different ways to join. To illustrate this challenge, take subsets of the matrices above:
63+
64+
```{r}
65+
tab1 <- slice(murders, 1:6) %>%
66+
select(state, population)
67+
tab1
68+
```
69+
70+
so that we no longer have the same states in the two tables:
71+
```{r}
72+
tab2 <- slice(results_us_election_2016, c(1:3, 5, 14, 44)) %>%
73+
select(state, electoral_votes)
74+
tab2
75+
```
76+
77+
We will use these two tables as examples.
78+
79+
#### Left join
80+
81+
Suppose we want a table like `tab1` but adding electoral votes to whatever states we have available. For this we use left join with `tab1` as the first argument.
82+
83+
```{r}
84+
left_join(tab1, tab2)
85+
```
86+
87+
Note that `NA`s are added to the three states not appearing in `tab2`. Also note that this function, as well as all the other joins, can receive the first arguments through the pipe:
88+
89+
```{r}
90+
tab1 %>% left_join(tab2)
91+
```
92+
93+
94+
#### Right join
95+
96+
If instead of a table like `tab1` we want one like `tab2` we can use `right_join`:
97+
98+
```{r}
99+
tab1 %>% right_join(tab2)
100+
```
101+
102+
Notice that now the NAs are in the column coming from `tab1`.
103+
104+
#### Inner join
105+
106+
If we want to keep only the rows that have information in both tables we use inner join. You can think of this an intersection:
107+
108+
```{r}
109+
inner_join(tab1, tab2)
110+
```
111+
112+
#### Full join
113+
114+
And if we want to keep all the rows, and fill the missing parts with NAs, we can use a full join. You can think of this as a union:
115+
116+
```{r}
117+
full_join(tab1, tab2)
118+
```
119+
120+
#### Semi join
121+
122+
The `semi_join` let's us keep the part of the first table for which we have information in the second. It does not add the columns of the second:
123+
124+
```{r}
125+
semi_join(tab1, tab2)
126+
```
127+
128+
129+
#### Anti join
130+
131+
The function `anti_join` is the opposite of `semi_join`. It keeps the elements of the first table for which there is no information in the second:
132+
133+
```{r}
134+
anti_join(tab1, tab2)
135+
```
136+
137+
### Binding
138+
139+
Although we have yet to use it in this course, another common way in which datasets are combined is by _binding_ them. Unlike the join function, the binding functions do no try to match by a variable but rather just combine datasets. If the datasets don't match by the appropriate dimensions one obtains an error.
140+
141+
#### Columns
142+
143+
The `dplyr` function _bind_cols_ binds two objects by making them columns in a tibble. For example, if we quickly want to make a data frame consisting of numbers we can use.
144+
145+
```{r}
146+
bind_cols(a = 1:3, b = 4:6)
147+
```
148+
149+
This function requires that we assign names to the columns. Here we chose `a` and `b`.
150+
151+
Note there is an R-base function `cbind` that performs the same function but creates objects other than tibbles.
152+
153+
`bind_cols` can also bind data frames. For example, here we break up the `tab` data frame and then bind them back together:
154+
155+
```{r}
156+
tab1 <- tab[, 1:3]
157+
tab2 <- tab[, 4:6]
158+
tab3 <- tab[, 7:9]
159+
new_tab <- bind_cols(tab1, tab2, tab3)
160+
head(new_tab)
161+
```
162+
163+
164+
#### Rows
165+
166+
The `bind_rows` is similar but binds rows instead of columns.
167+
168+
```{r}
169+
tab1 <- tab[1:2,]
170+
tab2 <- tab[3:4,]
171+
bind_rows(tab1, tab2)
172+
```
173+
174+
This is based on an R-base function `rbind`.
175+
176+
### Set Operators
177+
178+
Another set of commands useful for combing are the set operators. When applied to vectors, these behave as their names suggest. However, if the `tidyverse`, or more specifically, `dplyr` is loaded, these functions can be used on data frames as opposed to just on vectors.
179+
180+
#### Intersect
181+
182+
You can take intersections of vectors:
183+
184+
```{r}
185+
intersect(1:10, 6:15)
186+
```
187+
188+
```{r}
189+
intersect(c("a","b","c"), c("b","c","d"))
190+
```
191+
192+
But with `dplyr` loaded we can also do this for tables having the same column names:
193+
194+
```{r}
195+
tab1 <- tab[1:5,]
196+
tab2 <- tab[3:7,]
197+
intersect(tab1, tab2)
198+
```
199+
200+
201+
#### Union
202+
203+
Similarly _union_ takes the union:
204+
205+
```{r}
206+
union(1:10, 6:15)
207+
```
208+
209+
```{r}
210+
union(c("a","b","c"), c("b","c","d"))
211+
```
212+
213+
But with `dplyr` loaded we can also do this for tables having the same column names:
214+
215+
```{r}
216+
tab1 <- tab[1:5,]
217+
tab2 <- tab[3:7,]
218+
union(tab1, tab2)
219+
```
220+
221+
222+
#### Set differrence
223+
224+
The set difference between a first and second argument can be obtained with `setdiff`. Not unlike `instersect` and `union`, this function is not symmetric:
225+
226+
227+
```{r}
228+
setdiff(1:10, 6:15)
229+
setdiff(6:15, 1:10)
230+
```
231+
232+
As with the others above, we can apply it to data frames:
233+
```{r}
234+
tab1 <- tab[1:5,]
235+
tab2 <- tab[3:7,]
236+
setdiff(tab1, tab2)
237+
```
238+
239+
#### `setequal`
240+
241+
Finally, the function `set_equal` tells us if two sets are the same, regardless of order. So
242+
243+
```{r}
244+
setequal(1:5, 1:6)
245+
```
246+
247+
but
248+
249+
```{r}
250+
setequal(1:5, 5:1)
251+
```
252+
253+
It also works when applied to data frames that are not equal regardless of order:
254+
255+
```{r}
256+
setequal(tab1, tab2)
257+
```
258+

wrangling/combining-tables.html

Lines changed: 713 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)