|
| 1 | +## Combining tables |
| 2 | + |
| 3 | +```{r, message=FALSE, warning=FALSE} |
| 4 | +library(tidyverse) |
| 5 | +library(ggrepel) |
| 6 | +library(dslabs) |
| 7 | +ds_theme_set() |
| 8 | +``` |
| 9 | + |
| 10 | +The information we need for a given analysis may not be in just one table. For example, when forecasting elections we used the function `left_join` to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables. |
| 11 | + |
| 12 | +Suppose we want to explore the relationship between population size for US states, which we have in this table |
| 13 | + |
| 14 | +```{r} |
| 15 | +data(murders) |
| 16 | +head(murders) |
| 17 | +``` |
| 18 | + |
| 19 | +and electoral votes, which we have in this one: |
| 20 | + |
| 21 | +```{r} |
| 22 | +data(polls_us_election_2016) |
| 23 | +head(results_us_election_2016) |
| 24 | +``` |
| 25 | + |
| 26 | +Notice that just joining these two tables together will not work since the order of the states is not quite the same: |
| 27 | + |
| 28 | +```{r} |
| 29 | +identical(results_us_election_2016$state, murders$state) |
| 30 | +``` |
| 31 | + |
| 32 | +The _join_ functions, described below, are designed to handle this challenge. |
| 33 | + |
| 34 | +### Joins |
| 35 | + |
| 36 | +The `join` functions in the `dplyr` package, which are based on SQL joins, make sure that the tables are combined so that matching rows are together. |
| 37 | +The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Note what happens if we join the two tables above by `state` using `left_join`: |
| 38 | + |
| 39 | +```{r} |
| 40 | +tab <- left_join(murders, results_us_election_2016, by = "state") |
| 41 | +
|
| 42 | +tab %>% select(state, population, electoral_votes) %>% |
| 43 | + head() |
| 44 | +``` |
| 45 | + |
| 46 | +The data has been successfully joined and we can now, for example, make a plot to explore the relationship between population and electoral votes: |
| 47 | + |
| 48 | +```{r} |
| 49 | +tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) + |
| 50 | + geom_point() + |
| 51 | + geom_text_repel() + |
| 52 | + scale_x_continuous(trans = "log2") + |
| 53 | + scale_y_continuous(trans = "log2") + |
| 54 | + geom_smooth(method = "lm", se = FALSE) + |
| 55 | + xlab("Population (in millions)") + |
| 56 | + ylab("Electoral Votes") |
| 57 | +``` |
| 58 | + |
| 59 | +We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting a higher ratio. |
| 60 | + |
| 61 | + |
| 62 | +In practice, it is not always the case that each row in one table has a matching row in the other. For this reason we have several different ways to join. To illustrate this challenge, take subsets of the matrices above: |
| 63 | + |
| 64 | +```{r} |
| 65 | +tab1 <- slice(murders, 1:6) %>% |
| 66 | + select(state, population) |
| 67 | +tab1 |
| 68 | +``` |
| 69 | + |
| 70 | +so that we no longer have the same states in the two tables: |
| 71 | +```{r} |
| 72 | +tab2 <- slice(results_us_election_2016, c(1:3, 5, 14, 44)) %>% |
| 73 | + select(state, electoral_votes) |
| 74 | +tab2 |
| 75 | +``` |
| 76 | + |
| 77 | +We will use these two tables as examples. |
| 78 | + |
| 79 | +#### Left join |
| 80 | + |
| 81 | +Suppose we want a table like `tab1` but adding electoral votes to whatever states we have available. For this we use left join with `tab1` as the first argument. |
| 82 | + |
| 83 | +```{r} |
| 84 | +left_join(tab1, tab2) |
| 85 | +``` |
| 86 | + |
| 87 | +Note that `NA`s are added to the three states not appearing in `tab2`. Also note that this function, as well as all the other joins, can receive the first arguments through the pipe: |
| 88 | + |
| 89 | +```{r} |
| 90 | +tab1 %>% left_join(tab2) |
| 91 | +``` |
| 92 | + |
| 93 | + |
| 94 | +#### Right join |
| 95 | + |
| 96 | +If instead of a table like `tab1` we want one like `tab2` we can use `right_join`: |
| 97 | + |
| 98 | +```{r} |
| 99 | +tab1 %>% right_join(tab2) |
| 100 | +``` |
| 101 | + |
| 102 | +Notice that now the NAs are in the column coming from `tab1`. |
| 103 | + |
| 104 | +#### Inner join |
| 105 | + |
| 106 | +If we want to keep only the rows that have information in both tables we use inner join. You can think of this an intersection: |
| 107 | + |
| 108 | +```{r} |
| 109 | +inner_join(tab1, tab2) |
| 110 | +``` |
| 111 | + |
| 112 | +#### Full join |
| 113 | + |
| 114 | +And if we want to keep all the rows, and fill the missing parts with NAs, we can use a full join. You can think of this as a union: |
| 115 | + |
| 116 | +```{r} |
| 117 | +full_join(tab1, tab2) |
| 118 | +``` |
| 119 | + |
| 120 | +#### Semi join |
| 121 | + |
| 122 | +The `semi_join` let's us keep the part of the first table for which we have information in the second. It does not add the columns of the second: |
| 123 | + |
| 124 | +```{r} |
| 125 | +semi_join(tab1, tab2) |
| 126 | +``` |
| 127 | + |
| 128 | + |
| 129 | +#### Anti join |
| 130 | + |
| 131 | +The function `anti_join` is the opposite of `semi_join`. It keeps the elements of the first table for which there is no information in the second: |
| 132 | + |
| 133 | +```{r} |
| 134 | +anti_join(tab1, tab2) |
| 135 | +``` |
| 136 | + |
| 137 | +### Binding |
| 138 | + |
| 139 | +Although we have yet to use it in this course, another common way in which datasets are combined is by _binding_ them. Unlike the join function, the binding functions do no try to match by a variable but rather just combine datasets. If the datasets don't match by the appropriate dimensions one obtains an error. |
| 140 | + |
| 141 | +#### Columns |
| 142 | + |
| 143 | +The `dplyr` function _bind_cols_ binds two objects by making them columns in a tibble. For example, if we quickly want to make a data frame consisting of numbers we can use. |
| 144 | + |
| 145 | +```{r} |
| 146 | +bind_cols(a = 1:3, b = 4:6) |
| 147 | +``` |
| 148 | + |
| 149 | +This function requires that we assign names to the columns. Here we chose `a` and `b`. |
| 150 | + |
| 151 | +Note there is an R-base function `cbind` that performs the same function but creates objects other than tibbles. |
| 152 | + |
| 153 | +`bind_cols` can also bind data frames. For example, here we break up the `tab` data frame and then bind them back together: |
| 154 | + |
| 155 | +```{r} |
| 156 | +tab1 <- tab[, 1:3] |
| 157 | +tab2 <- tab[, 4:6] |
| 158 | +tab3 <- tab[, 7:9] |
| 159 | +new_tab <- bind_cols(tab1, tab2, tab3) |
| 160 | +head(new_tab) |
| 161 | +``` |
| 162 | + |
| 163 | + |
| 164 | +#### Rows |
| 165 | + |
| 166 | +The `bind_rows` is similar but binds rows instead of columns. |
| 167 | + |
| 168 | +```{r} |
| 169 | +tab1 <- tab[1:2,] |
| 170 | +tab2 <- tab[3:4,] |
| 171 | +bind_rows(tab1, tab2) |
| 172 | +``` |
| 173 | + |
| 174 | +This is based on an R-base function `rbind`. |
| 175 | + |
| 176 | +### Set Operators |
| 177 | + |
| 178 | +Another set of commands useful for combing are the set operators. When applied to vectors, these behave as their names suggest. However, if the `tidyverse`, or more specifically, `dplyr` is loaded, these functions can be used on data frames as opposed to just on vectors. |
| 179 | + |
| 180 | +#### Intersect |
| 181 | + |
| 182 | +You can take intersections of vectors: |
| 183 | + |
| 184 | +```{r} |
| 185 | +intersect(1:10, 6:15) |
| 186 | +``` |
| 187 | + |
| 188 | +```{r} |
| 189 | +intersect(c("a","b","c"), c("b","c","d")) |
| 190 | +``` |
| 191 | + |
| 192 | +But with `dplyr` loaded we can also do this for tables having the same column names: |
| 193 | + |
| 194 | +```{r} |
| 195 | +tab1 <- tab[1:5,] |
| 196 | +tab2 <- tab[3:7,] |
| 197 | +intersect(tab1, tab2) |
| 198 | +``` |
| 199 | + |
| 200 | + |
| 201 | +#### Union |
| 202 | + |
| 203 | +Similarly _union_ takes the union: |
| 204 | + |
| 205 | +```{r} |
| 206 | +union(1:10, 6:15) |
| 207 | +``` |
| 208 | + |
| 209 | +```{r} |
| 210 | +union(c("a","b","c"), c("b","c","d")) |
| 211 | +``` |
| 212 | + |
| 213 | +But with `dplyr` loaded we can also do this for tables having the same column names: |
| 214 | + |
| 215 | +```{r} |
| 216 | +tab1 <- tab[1:5,] |
| 217 | +tab2 <- tab[3:7,] |
| 218 | +union(tab1, tab2) |
| 219 | +``` |
| 220 | + |
| 221 | + |
| 222 | +#### Set differrence |
| 223 | + |
| 224 | +The set difference between a first and second argument can be obtained with `setdiff`. Not unlike `instersect` and `union`, this function is not symmetric: |
| 225 | + |
| 226 | + |
| 227 | +```{r} |
| 228 | +setdiff(1:10, 6:15) |
| 229 | +setdiff(6:15, 1:10) |
| 230 | +``` |
| 231 | + |
| 232 | +As with the others above, we can apply it to data frames: |
| 233 | +```{r} |
| 234 | +tab1 <- tab[1:5,] |
| 235 | +tab2 <- tab[3:7,] |
| 236 | +setdiff(tab1, tab2) |
| 237 | +``` |
| 238 | + |
| 239 | +#### `setequal` |
| 240 | + |
| 241 | +Finally, the function `set_equal` tells us if two sets are the same, regardless of order. So |
| 242 | + |
| 243 | +```{r} |
| 244 | +setequal(1:5, 1:6) |
| 245 | +``` |
| 246 | + |
| 247 | +but |
| 248 | + |
| 249 | +```{r} |
| 250 | +setequal(1:5, 5:1) |
| 251 | +``` |
| 252 | + |
| 253 | +It also works when applied to data frames that are not equal regardless of order: |
| 254 | + |
| 255 | +```{r} |
| 256 | +setequal(tab1, tab2) |
| 257 | +``` |
| 258 | + |
0 commit comments