Skip to content

Commit 0e4ec8f

Browse files
committed
added 10xRNA tutorial
1 parent 3e819e9 commit 0e4ec8f

7 files changed

+302
-8
lines changed
Loading
Loading
63.3 KB
Loading

docs/content/tutorials.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Tutorials for using sincei on the command line
1010
tutorials/sincei_tutorial_sortChIC.rst
1111
tutorials/sincei_tutorial_10x.rst
1212
tutorials/sincei_tutorial_10xATAC.rst
13+
tutorials/sincei_tutorial_10xRNA.rst
1314

1415
Tutorials for using sincei inside python
1516
-----------------------------------------

docs/content/tutorials/sincei_tutorial_10x.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,17 @@ of the barcode counts.
9595
2. scATAC-seq analysis
9696
----------------------
9797

98-
Please follow :doc:`this tutorial <sincei_tutorial_10xATAC>` for further analysis of scATAC-seq
99-
samples from the above data.
98+
Please follow :doc:`this tutorial <sincei_tutorial_10xATAC>` for further analysis of scATAC-seq samples from the above data.
10099

101100
3. scRNA-seq analysis
102101
---------------------
103102

104-
Please follow :doc: `this tutorial <sincei_tutorial_10xATAC>` for further analysis of scRNA-seq
105-
samples from the above data.
103+
Please follow :doc: `this tutorial <sincei_tutorial_10xRNA>` for further analysis of scRNA-seq samples from the above data.
106104

107-
4. Joint analysis
108-
-----------------
105+
Notes
106+
------------
109107

110-
Tutorial in preparation.
108+
Currently, sincei doesn't provide a method for **doublet estimation and removal**, which is an important step in the analysis of droplet-based data.
109+
Instead, we use simpler filters of min and max number of detected features per cell, which, to some extent mitigates this issue. However, this could
110+
lead to some differences in results compared to the published data in used here. Despite this difference, the major published cell types can be separated
111+
with sincei for both ATAC and RNA fraction of the data, as shown in the 2 tutorials above.

docs/content/tutorials/sincei_tutorial_10xATAC.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ bigwigs with 1kb bins.
285285
--labels rep1_atac_rep1 rep2_atac_rep2 \
286286
-i sincei_output/atac/scClusterCells_UMAP.tsv \
287287
-o sincei_output/atac/sincei_cluster
288-
# creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
288+
# creates 9 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
289289
290290
We can now inspect these bigwigs on IGV. We can clearly see some regions
291291
with cell-type specific signal, such as the markers described in the
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
Analysis of 10x genomics scRNA-seq data using sincei
2+
====================================================
3+
4+
The data used here is part of our 10x genomics multiome tutorial,
5+
`described here <sincei_tutorial_10x.rst>`__
6+
7+
We will use the BAM file, barcodes and peaks.bed file obtained using the
8+
**cellrange-arc** workflow. Alternatively, the BAM file (tagged with
9+
cell barcode) and peaks.bed can be obtained from a custom mapping/peak
10+
calling workflow, and the list of filtered cell barcodes can be obtained
11+
using sincei (see the parent tutorial for explanation).
12+
13+
Define common bash variables:
14+
15+
.. code:: bash
16+
17+
# create dir
18+
mkdir sincei_output/rna
19+
20+
2. Quality control - I (read-level)
21+
-----------------------------------
22+
23+
In order to define high quality cells, we can use the read-level quality
24+
statistics from scFilterStats. There could be several indications of low
25+
quality cells in this data, such as:
26+
27+
- high PCR duplicates (marked by cellranger workflow, detected using
28+
``--samFlagExclude``)
29+
- high fraction of reads aligned to blacklisted regions (filtered using
30+
``--blacklist``)
31+
- high fraction of reads with poor mapping quality (filtered using
32+
``--minMappingQuality``)
33+
- very high/low GC content of the aligned reads, indicating reads
34+
mostly aligned to low-complexity regions (filtered using
35+
``--GCcontentFilter``)
36+
- high level of secondary/supplementary alignments (filtered using
37+
``--samFlagExclude/Include``)
38+
39+
.. code:: bash
40+
41+
42+
for r in 1 2
43+
do
44+
dir=cellranger_output_rep${r}/outs/
45+
bamfile=${dir}/gex_possorted_bam.bam
46+
barcodes=${dir}/filtered_feature_bc_matrix/barcodes.tsv.gz
47+
# from sincei or cellranger output
48+
zcat ${barcodes} > ${dir}/filtered_barcodes.txt scFilterStats -p 20
49+
–region chr1 \
50+
–GCcontentFilter ‘0.2,0.8’ \
51+
–minMappingQuality 10 \
52+
–samFlagExclude 256 –samFlagExclude 2048 \
53+
–barcodes ${dir}/filtered_barcodes.txt \
54+
--cellTag CB \
55+
--label rna_rep${r}
56+
-o sincei_output/rna/scFilterStats_output_rep${r}.txt
57+
-b ${bamfile}
58+
done
59+
60+
`scFilterStats` summarizes these outputs as a table, which can then be
61+
visualized using the `MultiQC tool <https://multiqc.info/docs/>`__, to select
62+
appropriate list of cells to include for counting.
63+
64+
.. code:: bash
65+
66+
multiqc sincei_output/rna/ # results in multiqc_report.html
67+
68+
3. Signal aggregation (counting reads)
69+
--------------------------------------
70+
71+
Below we use sincei to aggregate signal from single-cells. For the gene
72+
expression data, it makes sense to aggregate signal from genes only. For
73+
this we can use a GTF file, which defines gene annotations.
74+
75+
If needed, we can use the same parameters as in ``scFilterStats`` to
76+
count only high quality reads from our whitelist of barcodes.
77+
78+
.. code:: bash
79+
80+
## Download the GTF file
81+
curl -o sincei_output/hg38.gtf.gz \
82+
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz && gunzip sincei_output/hg38.gtf.gz
83+
84+
# count reads on the GTF file
85+
for r in 1 2 do dir=cellranger_output_rep${r}/outs/
86+
bamfile=${dir}/gex_possorted_bam.bam
87+
barcodes=${dir}/filtered_barcodes.txt scCountReads features -p 20
88+
–BED sincei_output/hg38.gtf –cellTag CB –minMappingQuality 10 \
89+
–samFlagExclude 256 –samFlagExclude 2048 -bc ${barcodes} --cellTag CB \
90+
-o sincei_output/rna/scCounts_rna_genes_rep${r}
91+
–label rna_rep${r}
92+
-b ${bamfile}
93+
done
94+
95+
# Number of bins found: 74538
96+
97+
98+
99+
4. Quality control - II (count-level)
100+
-------------------------------------
101+
102+
Even though we already performed read-level QC before, the counts distribution on our specified regions (bins/genes/peaks) could be different from the whole-genome stats.
103+
104+
**scCountQC**, with the `--outMetrics` option, outputs the count statistics at region and cell level (labelled as <prefix>.regions.tsv and <prefix>.cells.tsv).
105+
Just like `scFilterStats`, these outputs can then be visualized using the
106+
`MultiQC tool <https://multiqc.info/docs/>`__, to select appropriate metrics to
107+
filter out the unwanted cells/regions.
108+
109+
.. code:: bash
110+
111+
# list the metrics we can use to filter cells/regions
112+
for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom --describe; done
113+
114+
# export the single-cell level metrices
115+
for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom \
116+
-om sincei_output/rna/countqc_rna_genes_rep${r} & done
117+
118+
# visualize output using multiQC
119+
multiqc sincei_output/rna/ # see results in multiqc_report.html
120+
121+
In total >74500 genes have been detected in >13.7K cells here.
122+
123+
Below, we perform a basic filtering using **scCountQC**. We exclude the
124+
cells with <500 and >10000 detected genes (``--filterRegionArgs``).
125+
Also, we exclude the genes that are detected in too few cells (<100) or
126+
too many (approax >90%) of cells (``--filterCellArgs``).
127+
128+
.. code:: bash
129+
130+
for r in 1 2 do scCountQC -i
131+
sincei_output/rna/scCounts_rna_genes_rep\ :math:`{r}.loom \
132+
-o sincei_output/rna/scCounts_rna_genes_filtered_rep`\ {r}.loom
133+
–filterRegionArgs “n_cells_by_counts: 100, 6000”
134+
–filterCellArgs “n_genes_by_counts: 500, 15000” done
135+
136+
## rep 1
137+
#Applying filters
138+
#Remaining cells: 5314
139+
#Remaining features: 48219
140+
141+
## rep 2
142+
#Applying filters
143+
#Remaining cells: 4894
144+
#Remaining features: 48660
145+
146+
147+
5. Combine counts for the 2 replicates
148+
--------------------------------------
149+
150+
While it's useful to perform count QC seperately for each replicate, the counts can now be combined for downstream analysis.
151+
152+
We provide a tool `scCombineCounts`, which can concatenate counts for cells based on common features. Concatenating the filtered cells for the 2 replicates would result in a total of >12K cells.
153+
154+
.. code:: bash
155+
156+
scCombineCounts \
157+
-i sincei_output/rna/scCounts_rna_genes_filtered_rep1.loom \
158+
sincei_output/rna/scCounts_rna_genes_filtered_rep2.loom \
159+
-o sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \
160+
--method multi-sample --labels rep1 rep2
161+
#Combined cells: 10208
162+
#Combined features: 48059
163+
164+
5. Dimentionality reduction and clustering
165+
------------------------------------------
166+
167+
Finally, we will apply glmPCA to this data, assuming the data follows a
168+
poisson distribution (which is nearly appropritate for count-based data
169+
such as scRNA-seq), we will reduce the dimentionality of the data to 20
170+
principle components (the default), followed by a graph-based (louvain)
171+
clustering of the output.
172+
173+
.. code:: bash
174+
175+
scClusterCells -i sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \
176+
-m glmPCA -gf poisson --clusterResolution 1 \
177+
-op sincei_output/rna/scClusterCells_UMAP.png \
178+
-o sincei_output/rna/scCounts_rna_genes_clustered.loom
179+
180+
# Coherence Score: Coherence Score: -0.9893882070519782
181+
# also produces the tsv file "sincei_output/rna/scClusterCells_UMAP.tsv"
182+
183+
(optional) Confirmation of clustering using metadata
184+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
185+
186+
Below, we will load this data in R and compare it to the cell metadata
187+
provided with our files to see if our clustering separates celltypes in
188+
a biologically meaningful way.
189+
190+
We can color our UMAP output from ``scClusterCells`` with the cell-type
191+
information based on the metadata provided by the original authors.
192+
193+
.. code:: r
194+
195+
library(dplyr)
196+
library(magrittr)
197+
library(ggplot2)
198+
library(patchwork)
199+
200+
umap <- read.delim(“sincei_output/rna/scClusterCells_UMAP.tsv”) meta <-
201+
read.csv(“metadata_cd34_rna.csv”, row.names = 1)
202+
umap$celltype <- meta[gsub("rep1_|rep2_", "", umap`\ Cell_ID),
203+
“celltype”]
204+
205+
# keep only cells with published labels
206+
207+
umap %<>% filter(!is.na(celltype)) # remove clusters with low number of
208+
cells cl = umap %>% group_by(cluster) %>%
209+
summarise(Cell_ID = dplyr::n()) %>%
210+
filter(Cell_ID < 50) %>% .$cluster umap %<>%
211+
filter(!(cluster %in% cl))
212+
213+
# make plots
214+
df_center <- group_by(umap, cluster) %>%
215+
summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
216+
df_center2 <- group_by(umap, celltype) %>%
217+
summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
218+
219+
# colors for metadata (8 celltypes)
220+
221+
col_pallete <- RColorBrewer::brewer.pal(8, “Paired”)
222+
names(col_pallete) <- unique(umap$celltype) # grey is for NA
223+
224+
# colors for sincei UMAP (10 clusters)
225+
226+
colors_cluster <- RColorBrewer::brewer.pal(10, “Paired”)
227+
names(colors_cluster) <- sort(unique(umap$cluster))
228+
229+
p1 <- umap %>% ggplot(., aes(UMAP1, UMAP2, color=factor(cluster),
230+
label=cluster)) + geom_point() +
231+
geom_label(data = df_center, aes(UMAP1, UMAP2)) +
232+
scale_color_manual(values = colors_cluster) +
233+
theme_void(base_size = 12) + theme(legend.position = “none”) +
234+
235+
p2 <- umap %>% filter(!is.na(celltype)) %>% ggplot(., aes(UMAP1, UMAP2,
236+
color=factor(celltype), label=celltype)) + geom_point() +
237+
geom_label(data = df_center2, aes(UMAP1, UMAP2)) +
238+
scale_color_manual(values = col_pallete) + labs(color=“Cluster”) +
239+
theme_void(base_size = 12) + theme(legend.position = “none”) +
240+
ggtitle(“Published Cell Types”)
241+
242+
pl <- p1 + p2
243+
ggsave(plot=pl, “sincei_output/rna/UMAP_compared_withOrig.png”,
244+
dpi=300, width = 11, height = 6)
245+
246+
247+
.. image:: ./../images/UMAP_compared_withOrig_10xRNA.png
248+
:height: 800px
249+
:width: 1600 px
250+
:scale: 50 %
251+
252+
The figure above shows that we can easily replicate the expected cell-type
253+
results from the scRNA-seq data using **sincei**. However there are some
254+
interesting differences, especially, a separation of the CLP cluster into 2 clusters,
255+
where one of these cluster is similar to the annotated pDC.
256+
257+
This was done using basic pre-processing steps, therefore the better cell/region filtering and optimizing the analysis parameters.
258+
259+
6. Creating bigwigs and visualizing signal on IGV
260+
-------------------------------------------------
261+
262+
For further exploration of data, It's very useful to create in-silico bulk
263+
coverage files (bigwigs) that aggregate the signal across cells in our clusters.
264+
The tool **scBulkCoverage** takes sincei clustered `.tsv` file, along with the
265+
corresponding BAM files, and aggregate the signal to create these bigwigs.
266+
267+
The parameters here are same as other sincei tools that work on BAM files,
268+
except that we can ask for a normalized bulk signal (specified using
269+
`--normalizeUsing` option) . Below, we produce CPM-normalized bigwigs with 100 bp bins.
270+
271+
.. code:: bash
272+
273+
scBulkCoverage -p 20 --cellTag CB --region chr1 \
274+
--normalizeUsing CPM --binSize 100 \
275+
--minMappingQuality 10 --samFlagExclude 2048 \
276+
-b cellranger_output_rep1/outs/gex_possorted_bam.bam \
277+
cellranger_output_rep2/outs/gex_possorted_bam.bam \
278+
--labels rep1_rna_rep1 rep2_rna_rep2 \
279+
-i sincei_output/rna/scClusterCells_UMAP.tsv \
280+
-o sincei_output/rna/sincei_cluster
281+
# creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9
282+
283+
We can now inspect these bigwigs on IGV. Looking at the region around one of the markers described in the
284+
original manuscript, **TAL1**, we can see that the CLP (lyphoid) and pDCs lack it's expression, while the
285+
myloid cells and HSCs have the signal. The neighboring gene STIL, which is involved in cell-cycle regulation
286+
is not expressed in the highly proliferative HSC/HMP/MEP cell clusters. Overall this confirms that the signal
287+
coverage extracted from our clusters broadly reflects the biology of the underlying cell types.
288+
289+
.. image:: ./../images/igv_snapshot_10xRNA.png
290+
:height: 500px
291+
:width: 6000 px
292+
:scale: 50 %

0 commit comments

Comments
 (0)