-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathnotebook.Rmd
234 lines (172 loc) · 6.16 KB
/
notebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "R Notebook"
output: html_notebook
---
# SimpleGeneExpression
Programs to quantify expression of transcripts from public datasets
### Project team
* [Jose V Die](https://github.com/jdieramon)
* [Moamen Elmassry](https://github.com/MoamenElmassry)
* [Kim LeBlanc](https://github.com/kleblanc5909)
* [Ben Busby](https://github.com/DCGenomics)
<br>
This tutorial is based on the analysis of the Magic-BLAST table output. Please, read the
[documentation](https://ncbi.github.io/magicblast/) to understand how Magic-BLAST
works.
The first three rows of the typical magic-BLAST output file contains some info
related to the software version, the command line used and the fields description:
![](figures/magic_header.png)
<br>
We may want to remove those rows with the command line :
```{}
tail -n +4 my_magic > my_magic
```
Here, the starting point is a dataset made of 24 auxin receptor factors (ARF) genes from the chickpea genome. Code to identify those accessions is available from the [GeneHummus](https://github.com/NCBI-Hackathons/GeneHummus) repository. We will study the frequency of the 24 ARF genes in root tissues of two genotypes under drought stress and control conditions across 4 SRA libraries:
* [SRR5927129](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927129). Susceptible_Control
* [SRR5927130](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927130). Susceptible_Drought
* [SRR5927133](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927133). Tolerant_Control
* [SRR5927134](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5927134). Tolerant_Drought
First, load the functions needed for the analysis.
Load functions
```{r message=FALSE}
source("functions/expression.R")
```
First we load the `my_tabledemo.rda` object. This object contains the first 500,000
lines of the magic-BLAST table output. The original table has been cut for
illustrative purposes. After loading the object into the environment, the `my.magic`
dataset is available.
```{r}
load("data/my_tabledemo.rda")
```
Next, we want to name the columns.
```{r}
names(my.magic) <- c("query.acc", "reference.acc", "identity", "not.used", "not.used.1",
"not.used.2", "query.start", "query.end", "reference.start",
"reference.end", "not.used.3", "not.used.4", "score", "query.strand",
"reference.strand", "query.length", "BTOP", "num.placements",
"not.used.5", "compartment", "left.overhang", "right.overhang",
"mate.reference", "mate.ref..start", "composite.score")
```
Check the dataset dimensions.
```{r}
dim(my.magic)
```
Before analyzing the data, we have to apply a number of filters to tidy the dataset.
## Filter
### Filter 1 : alignement length score
First, check the query lengths (length reads ). We can see that the whole reads dataset is 125 bp length.
```{r qlengths}
table(my.magic$query.length)
```
Now, we check the alignement length scores:
```{r plot_alignlengths}
plot(density(my.magic$score))
```
**Filter1** involves filtering by high alignement length score.
Here we use at least 120 bp as an argument for the function `by_score`.
```{r byscore, cache = TRUE}
df_filtered <- by_score(my.magic, 120)
plot(density(df_filtered$score), col = "red")
```
### Filter 2 : identity
We want high scores for the alignements but also alignements with high identities.
This chunk plots the identity distribution for the filtered data :
```{r}
plot(density(df_filtered$identity))
```
**Filter2** involves filtering by high identities. Here we use at least 99% as
an argument for the function `by_identity`
```{r by_identity, cache = TRUE}
df_filtered <- by_identity(df_filtered, 99)
plot(density(df_filtered$identity), col = "red")
```
The function `tidy_query.acc`creates a new dataset with the number of sequence
counts from the filtered data.
```{r}
df_filtered <- tidy_query.acc(df_filtered)
# Create a tibble object
df_filtered <- as_data_frame(df_filtered)
```
### Plot of Counts
```{r cache = TRUE}
hist(df_filtered$Count)
```
### Plot Counts ~ SRA run
```{r}
boxplot(df_filtered$Count ~ df_filtered$Query,
outline = FALSE,
main = "Sequence Counts per run",
col = c("grey80", "grey50", "grey80", "grey50"))
```
### Optional filter:
Genes with presence at least in *n* SRA libraries
```{r genes4SRA}
df_filtered %>%
group_by(Reference) %>%
summarize(Total = n()) %>%
arrange(Total)
```
Filter by genes present in all 4 libraries.
```{r genes4SRAfilter}
# total number of occurrences for each gene
mydf <- df_filtered %>%
filter(Reference != "LOC101514738" )
mydf
```
Now, add a column with the SRA run size with the `gatherSize` function
```{r runsize, cache = TRUE}
mydf <- mutate(mydf, Runsize = gatherSize(mydf))
```
Look at head and tail of `mydf`
```{r}
mydf %>% head()
```
```{r}
mydf %>% tail()
```
### Normalized counts
```{r normCounts, cache = TRUE}
mydf_norm <- mydf %>%
mutate(norm_Counts = Count*mean(Runsize)/Runsize) %>%
select(Query, Reference, norm_Counts)
mydf_norm
```
## Analysis of normalized counts
**Boxplot of gene counts distribution**
```{r countsboxplot}
mydf_norm %>%
ggplot(aes(x = reorder(Reference, norm_Counts), y = norm_Counts)) +
geom_boxplot()
```
Dataset of normalized **counts per gene and SRA library**
```{r cache = TRUE}
library(tidyr)
bySRA <- spread(mydf_norm, Query, norm_Counts)
bySRA
```
**Density plots of normalized counts**
```{r density, cache = TRUE}
plot(density(bySRA$SRR5927129), col = 1, main = 'Counts', ylim = c(0, 0.00105))
lines(density(bySRA$SRR5927130), col = 2)
lines(density(bySRA$SRR5927133), col = 3)
lines(density(bySRA$SRR5927134), col = 4)
```
**Scatter-plot**, comparing counts of two samples to each other
### Example1
```{r scatterControl, cache = TRUE}
ggplot(bySRA, aes(x = SRR5927129, y = SRR5927133)) +
geom_point() +
ggtitle("Control roots") +
xlab("Tolerant plants") +
ylab("Susceptible plants") +
geom_smooth(method = "lm")
```
### Example2
```{r scatterTolerant, cache = TRUE}
ggplot(bySRA, aes(x = SRR5927134, y = SRR5927133)) +
geom_point() +
ggtitle("Roots", "Tolerant plants") +
xlab("Control") +
ylab("Drought") +
geom_smooth(method = "lm")
```