-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4. Manipulating Data with dplyr.Rmd
145 lines (98 loc) · 5.05 KB
/
4. Manipulating Data with dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "Manipulating Data with `dplyr`"
author: Mark Edward M. Gonzales^[De La Salle University, Manila, Philippines, [email protected]]
output: html_notebook
---
In this notebook, we will learn how to perform data manipulation — and create data processing pipelines — using a powerful package called `dplyr`.
`dplyr` is part of [`tidyverse`](https://www.tidyverse.org/), a collection of R packages for data science. `tidyverse` also includes `ggplot2`, one of the most widely used data visualization packages.
**Bonus:** If you just need a quick refresher on `dplyr`, you can refer to this cheat sheet: https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf.
## Preliminaries
`tidyverse` is not built into R, so we first have to install it:
```
install.packages("tidyverse")
```
Afterwards, we have to tell R that we want to use `tidyverse` (in technical terms, we are loading the package):
```{r}
library("tidyverse")
```
## Goodbye Data Frames! Hello Tibbles
`tidyverse` introduced a faster and better version of R's built-in data frame; we call this a **tibble**.
💡 If you are interested in diving into the differences between R's built-in data frame and `tidyverse`'s tibble, you may refer to this article: https://jtr13.github.io/cc21fall1/tibble-vs.-dataframe.html
Since this notebook aims to familiarize ourselves with `dplyr` (and `tidyverse`), we will make a shift from data frames to tibbles.
We start by loading our dataset (note that tibbles are loaded using `read_delim` while data frames are loaded using `read.delim`):
```{r}
data <- read_delim("phages.tsv")
```
We check the columns of our dataset:
```{r}
str(data)
```
We view our dataset (this opens a new tab in RStudio):
```{r}
View(data)
```
## Selecting Columns & Filtering Rows
To select columns, we use `select()`. The first argument is the dataset, and the succeeding arguments are the columns to be included.
_Use case: Suppose we want to get the family, order, and class of each phage in our dataset._
```{r}
data_subset <- select(data, Accession, Family, Order, Class)
data_subset
```
To filter rows, we use `filter()`.
_Use case: Suppose we want to remove all entries where the family, order, and class are unclassified._
```{r}
data_subset <- filter(data_subset, Family != "Unclassified" & Order != "Unclassified" & Class != "Unclassified")
data_subset
```
## Pipe: The "Then" Operator
The processing that we just performed — selecting columns then filtering rows — is actually a simple pipeline already! But notice how our code can easily become cluttered if we are to add more intermediate steps.
Fortunately, `dplyr` provides a convenient operator called a pipe: `%>%` (a shortcut to typing this operator is by pressing Ctrl+Shift+M or Cmd+Shift+M for Mac). We can think of `%>%` as equivalent to the English "then."
To illustrate its usage, we rewrite our pipeline like so:
```{r}
data_subset_using_pipe <- data %>%
select(Accession, Family, Order, Class) %>%
filter(Family != "Unclassified" & Order != "Unclassified" & Class != "Unclassified")
data_subset_using_pipe
```
Observe how the syntax is mostly the same, with the exception of the first argument of `select()` and `filter()`. Since we already specified `data` at the start of the pipeline, we do not need to pass it anymore as an argument to the data manipulation functions.
## Adding Columns (Mutate)
To add columns, we use `mutate()`.
_Use case: We have a column called `Genome Length (bp)` but we want a new column where the genome length is expressed in terms of kbp._
```{r}
# We enclose column names with spaces in backticks ``
data_with_new_column <- data %>%
mutate(`Genome Length (kbp)` = `Genome Length (bp)` / 1000)
data_with_new_column
```
## Getting a Column (Pull)
Suppose we want to get all the accessions of the phages in our dataset.
As we learned earlier, we can use `select()`.
```{r}
data %>% select(Accession)
```
This works, but this is a one-column tibble. What if we want a vector? We can use `pull()` instead.
```{r}
data %>% pull(Accession)
```
## Split-Apply-Combine Data Analysis
Let us now try to create more complex pipelines for exploratory data analysis — and, along the way, introduce some functions for aggregating entries and statistics!
Suppose we want to get the number of phages infecting each host genus. This is a good use case for `group_by()` and `n()` (for counting the number of entries in a group):
```{r}
data %>%
select(Accession, Host) %>%
filter(Host != "Unspecified") %>%
group_by(Host) %>%
summarize(count = n()) %>%
arrange(desc(count)) # desc() means arrange in descending order
```
Suppose we want to get the mean and the median guanine-cytosine (GC) content of the phages when grouped by host genus:
```{r}
# We enclose column names with spaces in backticks ``
data %>%
select(Accession, Host, `molGC (%)`) %>%
filter(Host != "Unspecified") %>%
group_by(Host) %>%
summarize(mean_gc = mean(`molGC (%)`),
median_gc = median(`molGC (%)`)) %>%
arrange(desc(mean_gc)) # desc() means arrange in descending order
```