-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path5. Fundamentals of Data Visualization with ggplot2.Rmd
331 lines (247 loc) · 12.2 KB
/
5. Fundamentals of Data Visualization with ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
title: "Fundamentals of Data Visualization with `ggplot2`"
author: Daphne Janelyn L. Go^[De La Salle University, Manila, Philippines, [email protected]]
output: html_notebook
---
The ggplot2 package is based on the principle that all plots consist of a few basic components: data, a coordinate system and a visual representation of the data. In ggplot2, you built plots incrementally, starting with the data and coordinates you want to use and then specifying the graphical features: lines, points, bars, color, etc.
We will be using three datasets in this tutorial. **First is our main dataset on phage host interaction, second is the diamond dataset given by the ggplot package, lastly is the carbon dioxide dataset given by R.**
For a list of preconfigured datasets, simply type **data()**.
```{r}
data()
```
```{r}
dataset <- read.delim("phages.tsv")
```
## Base R Plotting
Before we begin, load the ggplot2 package for R. ggplot2 is a graphics package that provides powerful plotting capabilities beyond R's base plotting functions. We won't actually get into ggplot2 itself quite yet. This will be a basic introduction to plotting in R.
```{r}
library(ggplot2)
```
```{r}
dataset
```
## Histograms
A histogram is a univariate plot (a plot that displays one variable) that groups a numeric variable into bins and displays the number of observations that fall within each bin. A histogram is a useful tool for getting a sense of the distribution of a numeric variable.
```{r}
hist(dataset$Positive.Strand....)
```
Note: When you create a plot in a local RStudio environment, it will appear in the bottom right pane under the "plots" tab. Use the left and right arrows to cycle through the plots you've created.
## Box Plots
Boxplots are another type of univariate plot for summarizing distributions of numeric data graphically.
```{r}
boxplot(dataset$molGC...)
```
The central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers.
One of the most useful features of the boxplot() function is the ability to make side-by-side boxplots. A side-by-side boxplot takes a numeric variable and splits it on based on some categorical variable, drawing a different boxplot for each level of the categorical variable.
```{r}
boxplot(dataset$molGC... ~ dataset$Molecule) # Plot GC content split based on molecular type
```
## Density Plots
A density plot shows the distribution of a numeric variable with a continuous curve. It is similar to a histogram but without discrete bins, a density plot gives a better picture of the underlying shape of a distribution.
```{r}
plot(density(dataset$molGC...))
```
## Bar Plots
Barplots are graphs that visually display counts of categorical variables.
```{r}
dataset$Jumbophage <- ifelse(dataset$Jumbophage, "Jumbophage", "Not Jumbophage")
```
```{r}
dataset$Jumbophage
```
```{r}
barplot(table(dataset$Molecule))
```
```{r}
barplot(table(dataset$Jumbophage, dataset$Molecule),
legend = levels(dataset$Jumbophage)
)
```
A grouped barplot is an alternative to a stacked barplot that gives each stacked section its own bar. To make a grouped barplot, create a stacked barplot and add the extra argument beside = TRUE.
```{r}
barplot(table(dataset$Jumbophage, dataset$Molecule),
legend = levels(dataset$Jumbophage),
beside = TRUE
) # Group instead of stacking
```
## Scatterplots
Scatterplots are bivariate (two variable) plots that take two numeric variables and plot data points on the x/y plane.
```{r}
plot(dataset$molGC..., dataset$Positive.Strand....)
```
```{r}
plot(dataset$molGC...,
dataset$Positive.Strand....,
col = rgb(red = 0, green = 0, blue = 0, alpha = 0.1)
)
```
Illustrating how we can make our different plots look more presentable.
```{r}
barplot(table(dataset$Jumbophage, dataset$Molecule),
legend = levels(dataset$Jumbophage),
beside = TRUE,
xlab = "Molecular Type",
ylab = "Jumbophage",
main = "Molecular Type, Grouped by Jumbophage",
col = c(
"#FFFFFF", "#F5FCC2", "#E0ED87", "#CCDE57", # Add color*
"#B3C732", "#94A813", "#718200"
)
)
```
## Using ggplot()
The ggplot() function creates plots incrementally in layers. Every ggplot starts with the same basic syntax. Every ggplot starts with a call to the ggplot() function along with an argument specifying the data set to be used and aesthetic mappings from variables in the data set to visual properties of the plot, such as x and y position.
```{r}
# install.packages("tidyverse")
library(tidyverse)
```
We are not going to spend much time learning about qplot() since learning the ggplot() syntax is at the heart of the package. Let's look at one qplot for illustrative purposes and then move on.
```{r}
library(ggplot2)
qplot(
x = carat, # x variable
y = price, # y variable
data = diamonds, # Data set
geom = "point", # Plot type
color = clarity, # Color points by variable clarity
xlab = "Carat Weight", # x label
ylab = "Price", # y label
main = "Diamond Carat vs. Price"
)
# Title
```
```{r}
ggplot(
dataset,
aes(Accession, molGC....)
) +
geom_point()
```
In the code above, we specify the data we want to work with and then assign the variables of interest, Accession and GC Content, to the x and y values of the plot. "aes()" is an aesthetics wrapper used in ggplot to map variables to visual properties. When you want a visual property to change based on the value of a variable, that specification belongs inside an aes() wrapper. If you are setting a fixed value that doesn't change based a variable, it belongs outside of aes().
_Note: Add a new element to a plot by putting a "+" after the preceding element._
The layers you add determine the type of plot you create. In this case, we used geom_point() which simply draws the data as points at the specified x and y coordinates, creating a scatterplot. ggplot2 has a wide range of geoms to create different types of plots. Here is a list of geoms for all the plot types we covered in the last lesson, plus a few more
```{r}
geom_histogram() # histogram
geom_density() # density plot
geom_boxplot() # boxplot
geom_violin() # violin plot (combination of boxplot and density plot)
geom_bar() # bar graph
geom_point() # scatterplot
geom_jitter() # scatterplot with points randomly perturbed to reduce overlap
geom_line() # line graph
geom_errorbar() # Add error bar
geom_smooth() # Add a best-fit line
geom_abline() # Add a line with specified slope and intercept
```
Notice the scatterplot we made above didn't have a nice coloring. We can attribute the colors of the points to its Molecular type.
```{r}
ggplot(dataset, aes(Accession, molGC...., colour = Molecule)) +
geom_point(alpha = 0.5)
```
```{r}
ggplot(dataset, aes(Accession, molGC....)) +
geom_point(aes(color = Molecule), alpha = 0.2)
```
We pass alpha in as an argument outside of the aes() mapping because we are **setting alpha to a fixed value instead of mapping it to a variable**.
By setting alpha to 0.1, each data point has 90% transparency. At such high transparency, single data points are hard to see, but it lets us focus on high density areas.
```{r}
dataset %>%
ggplot(aes(Genome.Length..bp., molGC....)) +
geom_point(aes(colour = Molecule), alpha = 0.5) +
labs(x = "Genome Length", y = "GC Content")
```
**Geosmooth**
Illustrating the use of geosmooth using a built in dataset in R.
In ggplot2, the geom_smooth() function is used to add a smooth line or curve to a plot. It is commonly used to visualize the trend or relationship between variables.
```{r}
sample_DataSet <- CO2
sample_DataSet
```
```{r}
ggplot(sample_DataSet, aes(conc, uptake, colour = Treatment)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method = lm, se = F)
```
If you want to classify it further into types based on your data, use **facet_wrap()**.
```{r}
ggplot(sample_DataSet, aes(conc, uptake, colour = Treatment)) +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method = lm, se = F) +
facet_wrap(~Type) +
labs(title = "Concentration of CO2") +
theme_bw()
```
## More Plot Examples
Now that we know the basics of creating plots with ggplot(), let's remake some of the plots we created last time and see how they look in ggplot2, starting with a histogram.
### Histograms
```{r}
ggplot(data = dataset, aes(x = molGC....)) +
geom_histogram(
fill = "skyblue",
col = "black"
) +
labs(x = "GC Content")
```
### Box Plots
```{r}
ggplot(dataset, aes(Molecule, Genome.Length..bp.)) +
geom_jitter(
alpha = 0.05, # Add jittered data points
color = "yellow"
) +
geom_boxplot() +
labs(
title = "Genome Length of Different Molecular Types",
y = "Genome Length"
)
```
### Density Plot
```{r}
ggplot(data = dataset, aes(x = molGC....)) +
geom_density(
position = "stack", # Create a stacked density chart
aes(fill = Molecule), # Fill based on cut
alpha = 0.5
) # Set transparency
```
## Multidimensional Plotting and Faceting
One of the most powerful aspects of plots is the ability to visually illustrate relationships between 3 or more variables. When we create a plot, each different dimension (variable) needs to map to a different perceptual feature (aesthetic) such as x position, y position, symbol, size or color. Making use of several of these aesthetics at once lets us make plots involving many dimensions. We've already seen some examples of multidimensional plots, such as the first scatterplot in this lesson that displayed carat weight and price colored by clarity.
Faceting is another way to add an extra dimension to a plot. Faceting breaks a plot up based on a factor variable and draws a different plot for each level of the factor. You can create a faceted plot in ggplot2 by adding a facet_wrap() layer.
```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) + # Initialize plot
geom_point(aes(color = color), # Color based on diamond color
alpha = 0.5
) +
facet_wrap(~clarity) + # Facet on clarity
geom_smooth() + # Add an estimated fit line*
theme(legend.position = c(0.85, 0.16)) # Set legend position
```
## Scales
Scales are parameters in ggplot2 that determine how a plot maps values to visual properties (aesthetics.). If you don't specify a scale for an aesthetic the plot will use a default scale. For instance, the plots we split on color all used a default color scale. You can specify custom scales by adding scale elements to your plot. Scale elements have the following structure:
scale_aesthetic_scaletype()
We already saw an example of a scale when made the grouped barplot above. In that case we wanted to manually set the fill color scale for the bars, so the scale we used was:
scale_fill_manual()
Let's make a new scatterplot with several aesthetic properties and alter some of the scales.
```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) + # Initialize plot
geom_point(aes(
size = carat, # Size points based on carat
color = color, # Color based on diamond color
alpha = clarity
)) + # Set transparency based on clarity
scale_color_manual(values = c(
"#FFFFFF", "#F5FCC2", # Use manual color values
"#E0ED87", "#CCDE57",
"#B3C732", "#94A813",
"#718200"
)) +
scale_alpha_manual(values = c(
0.1, 0.15, 0.2, # Use manual alpha values
0.3, 0.4, 0.6,
0.8, 1
)) +
scale_size_identity() + # Set size values to the actual values of carat*
xlim(0, 2.5) + # Limit x-axis
theme(panel.background = element_rect(fill = "#7FB2B8")) + # Change background color
theme(legend.key = element_rect(fill = "#7FB2B8")) # Change legend background color
```