datasciencelabs
diff --git a/‎wrangling/combining-tables.Rmd‎
Lines changed: 258 additions & 0 deletions b/‎wrangling/combining-tables.Rmd‎
Lines changed: 258 additions & 0 deletions
diff --git a/‎wrangling/combining-tables.html‎
Lines changed: 713 additions & 0 deletions b/‎wrangling/combining-tables.html‎
Lines changed: 713 additions & 0 deletions
@@ -0,0 +1,258 @@
+## Combining tables
+
+```{r, message=FALSE, warning=FALSE}
+library(tidyverse)
+library(ggrepel)
+library(dslabs)
+ds_theme_set()
+```
+
+The information we need for a given analysis may not be in just one table. For example, when forecasting elections we used the function `left_join` to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables.
+
+Suppose we want to explore the relationship between population size for US states, which we have in this table
+
+```{r}
+data(murders)
+head(murders)
+```
+
+and electoral votes, which we have in this one:
+
+```{r}
+data(polls_us_election_2016)
+head(results_us_election_2016)
+```
+
+Notice that just joining these two tables together will not work since the order of the states is not quite the same:
+
+```{r}
+identical(results_us_election_2016$state, murders$state)
+```
+
+The _join_ functions, described below, are designed to handle this challenge.
+
+### Joins
+
+The `join` functions in the `dplyr` package, which are based on SQL joins, make sure that the tables are combined so that matching rows are together.
+The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Note what happens if we join the two tables above by `state` using `left_join`:
+
+```{r}
+tab <- left_join(murders, results_us_election_2016, by = "state")
+
+tab %>% select(state, population, electoral_votes) %>% 
+        head()
+```
+
+The data has been successfully joined and we can now, for example, make a plot to explore the relationship between population and electoral votes:
+
+```{r}
+tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
+        geom_point() +
+        geom_text_repel() + 
+        scale_x_continuous(trans = "log2") +
+        scale_y_continuous(trans = "log2") +
+        geom_smooth(method = "lm", se = FALSE) +
+	      xlab("Population (in millions)") + 
+	      ylab("Electoral Votes")
+```
+
+We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting a higher ratio.
+
+
+In practice, it is not always the case that each row in one table has a matching row in the other. For this reason we have several different ways to join. To illustrate this challenge, take subsets of the matrices above:
+
+```{r}
+tab1 <- slice(murders, 1:6) %>% 
+        select(state, population)
+tab1
+```
+
+so that we no longer have the same states in the two tables:
+```{r}
+tab2 <- slice(results_us_election_2016, c(1:3, 5, 14, 44)) %>% 
+        select(state, electoral_votes)
+tab2
+```
+
+We will use these two tables as examples.
+
+#### Left join
+
+Suppose we want a table like `tab1` but adding electoral votes to whatever states we have available. For this we use left join with `tab1` as the first argument.
+
+```{r}
+left_join(tab1, tab2)
+```
+
+Note that `NA`s are added to the three states not appearing in `tab2`. Also note that this function, as well as all the other joins, can receive the first arguments through the pipe:
+
+```{r}
+tab1 %>% left_join(tab2)
+```
+
+
+#### Right join
+
+If instead of a table like `tab1` we want one like `tab2` we can use `right_join`:
+
+```{r}
+tab1 %>% right_join(tab2)
+```
+
+Notice that now the NAs are in the column coming from `tab1`.
+
+#### Inner join
+
+If we want to keep only the rows that have information in both tables we use inner join. You can think of this an intersection:
+
+```{r}
+inner_join(tab1, tab2)
+```
+
+#### Full join
+
+And if we want to keep all the rows, and fill the missing parts with NAs, we can use a full join. You can think of this as a union:
+
+```{r}
+full_join(tab1, tab2)
+```
+
+#### Semi join
+
+The `semi_join` let's us keep the part of the first table for which we have information in the second. It does not add the columns of the second:
+
+```{r}
+semi_join(tab1, tab2)
+```
+
+
+#### Anti join
+
+The function `anti_join` is the opposite of `semi_join`. It keeps the elements of the first table for which there is no information in the second:
+
+```{r}
+anti_join(tab1, tab2)
+```
+
+### Binding
+
+Although we have yet to use it in this course, another common way in which datasets are combined is by _binding_ them. Unlike the join function, the binding functions do no try to match by a variable but rather just combine datasets. If the datasets don't match by the appropriate dimensions one obtains an error.
+
+#### Columns
+
+The `dplyr` function _bind_cols_ binds two objects by making them columns in a tibble. For example, if we quickly want to make a data frame consisting of numbers we can use.
+
+```{r}
+bind_cols(a = 1:3, b = 4:6)
+```
+
+This function requires that we assign names to the columns. Here we chose `a` and `b`. 
+
+Note there is an R-base function `cbind` that performs the same function but creates objects other than tibbles. 
+
+`bind_cols` can also bind data frames. For example, here we break up the `tab` data frame and then bind them back together:
+
+```{r}
+tab1 <- tab[, 1:3]
+tab2 <- tab[, 4:6]
+tab3 <- tab[, 7:9]
+new_tab <- bind_cols(tab1, tab2, tab3)
+head(new_tab)
+```
+
+
+#### Rows
+
+The `bind_rows` is similar but binds rows instead of columns.
+
+```{r}
+tab1 <- tab[1:2,]
+tab2 <- tab[3:4,]
+bind_rows(tab1, tab2)
+```
+
+This is based on an R-base function `rbind`.
+
+### Set Operators
+
+Another set of commands useful for combing are the set operators. When applied to vectors, these behave as their names suggest. However, if the `tidyverse`, or more specifically, `dplyr` is loaded, these functions can be used on data frames as opposed to just on vectors.
+
+#### Intersect
+
+You can take intersections of vectors:
+
+```{r}
+intersect(1:10, 6:15)
+```
+
+```{r}
+intersect(c("a","b","c"), c("b","c","d"))
+```
+
+But with `dplyr` loaded we can also do this for tables having the same column names:
+
+```{r}
+tab1 <- tab[1:5,]
+tab2 <- tab[3:7,]
+intersect(tab1, tab2)
+```
+
+
+#### Union
+
+Similarly _union_ takes the union:
+
+```{r}
+union(1:10, 6:15)
+```
+
+```{r}
+union(c("a","b","c"), c("b","c","d"))
+```
+
+But with `dplyr` loaded we can also do this for tables having the same column names:
+
+```{r}
+tab1 <- tab[1:5,]
+tab2 <- tab[3:7,]
+union(tab1, tab2)
+```
+
+
+#### Set differrence
+
+The set difference between a first and second argument can be obtained with `setdiff`. Not unlike `instersect` and `union`, this function is not symmetric:
+
+
+```{r}
+setdiff(1:10, 6:15)
+setdiff(6:15, 1:10)
+```
+
+As with the others above, we can apply it to data frames:
+```{r}
+tab1 <- tab[1:5,]
+tab2 <- tab[3:7,]
+setdiff(tab1, tab2)
+```
+
+#### `setequal`
+
+Finally, the function `set_equal` tells us if two sets are the same, regardless of order. So
+
+```{r}
+setequal(1:5, 1:6)
+```
+
+but
+
+```{r}
+setequal(1:5, 5:1)
+```
+
+It also works when applied to data frames that are not equal regardless of order:
+
+```{r}
+setequal(tab1, tab2)
+```
+