|
| 1 | +Analysis of 10x genomics scRNA-seq data using sincei |
| 2 | +==================================================== |
| 3 | + |
| 4 | +The data used here is part of our 10x genomics multiome tutorial, |
| 5 | +`described here <sincei_tutorial_10x.rst>`__ |
| 6 | + |
| 7 | +We will use the BAM file, barcodes and peaks.bed file obtained using the |
| 8 | +**cellrange-arc** workflow. Alternatively, the BAM file (tagged with |
| 9 | +cell barcode) and peaks.bed can be obtained from a custom mapping/peak |
| 10 | +calling workflow, and the list of filtered cell barcodes can be obtained |
| 11 | +using sincei (see the parent tutorial for explanation). |
| 12 | + |
| 13 | +Define common bash variables: |
| 14 | + |
| 15 | +.. code:: bash |
| 16 | +
|
| 17 | + # create dir |
| 18 | + mkdir sincei_output/rna |
| 19 | +
|
| 20 | +2. Quality control - I (read-level) |
| 21 | +----------------------------------- |
| 22 | + |
| 23 | +In order to define high quality cells, we can use the read-level quality |
| 24 | +statistics from scFilterStats. There could be several indications of low |
| 25 | +quality cells in this data, such as: |
| 26 | + |
| 27 | +- high PCR duplicates (marked by cellranger workflow, detected using |
| 28 | + ``--samFlagExclude``) |
| 29 | +- high fraction of reads aligned to blacklisted regions (filtered using |
| 30 | + ``--blacklist``) |
| 31 | +- high fraction of reads with poor mapping quality (filtered using |
| 32 | + ``--minMappingQuality``) |
| 33 | +- very high/low GC content of the aligned reads, indicating reads |
| 34 | + mostly aligned to low-complexity regions (filtered using |
| 35 | + ``--GCcontentFilter``) |
| 36 | +- high level of secondary/supplementary alignments (filtered using |
| 37 | + ``--samFlagExclude/Include``) |
| 38 | + |
| 39 | +.. code:: bash |
| 40 | +
|
| 41 | +
|
| 42 | + for r in 1 2 |
| 43 | + do |
| 44 | + dir=cellranger_output_rep${r}/outs/ |
| 45 | + bamfile=${dir}/gex_possorted_bam.bam |
| 46 | + barcodes=${dir}/filtered_feature_bc_matrix/barcodes.tsv.gz |
| 47 | + # from sincei or cellranger output |
| 48 | + zcat ${barcodes} > ${dir}/filtered_barcodes.txt scFilterStats -p 20 |
| 49 | + –region chr1 \ |
| 50 | + –GCcontentFilter ‘0.2,0.8’ \ |
| 51 | + –minMappingQuality 10 \ |
| 52 | + –samFlagExclude 256 –samFlagExclude 2048 \ |
| 53 | + –barcodes ${dir}/filtered_barcodes.txt \ |
| 54 | + --cellTag CB \ |
| 55 | + --label rna_rep${r} |
| 56 | + -o sincei_output/rna/scFilterStats_output_rep${r}.txt |
| 57 | + -b ${bamfile} |
| 58 | + done |
| 59 | +
|
| 60 | +`scFilterStats` summarizes these outputs as a table, which can then be |
| 61 | +visualized using the `MultiQC tool <https://multiqc.info/docs/>`__, to select |
| 62 | +appropriate list of cells to include for counting. |
| 63 | + |
| 64 | +.. code:: bash |
| 65 | +
|
| 66 | + multiqc sincei_output/rna/ # results in multiqc_report.html |
| 67 | +
|
| 68 | +3. Signal aggregation (counting reads) |
| 69 | +-------------------------------------- |
| 70 | + |
| 71 | +Below we use sincei to aggregate signal from single-cells. For the gene |
| 72 | +expression data, it makes sense to aggregate signal from genes only. For |
| 73 | +this we can use a GTF file, which defines gene annotations. |
| 74 | + |
| 75 | +If needed, we can use the same parameters as in ``scFilterStats`` to |
| 76 | +count only high quality reads from our whitelist of barcodes. |
| 77 | + |
| 78 | +.. code:: bash |
| 79 | +
|
| 80 | + ## Download the GTF file |
| 81 | + curl -o sincei_output/hg38.gtf.gz \ |
| 82 | + http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz && gunzip sincei_output/hg38.gtf.gz |
| 83 | +
|
| 84 | + # count reads on the GTF file |
| 85 | + for r in 1 2 do dir=cellranger_output_rep${r}/outs/ |
| 86 | + bamfile=${dir}/gex_possorted_bam.bam |
| 87 | + barcodes=${dir}/filtered_barcodes.txt scCountReads features -p 20 |
| 88 | + –BED sincei_output/hg38.gtf –cellTag CB –minMappingQuality 10 \ |
| 89 | + –samFlagExclude 256 –samFlagExclude 2048 -bc ${barcodes} --cellTag CB \ |
| 90 | + -o sincei_output/rna/scCounts_rna_genes_rep${r} |
| 91 | + –label rna_rep${r} |
| 92 | + -b ${bamfile} |
| 93 | + done |
| 94 | +
|
| 95 | + # Number of bins found: 74538 |
| 96 | +
|
| 97 | +
|
| 98 | +
|
| 99 | +4. Quality control - II (count-level) |
| 100 | +------------------------------------- |
| 101 | + |
| 102 | +Even though we already performed read-level QC before, the counts distribution on our specified regions (bins/genes/peaks) could be different from the whole-genome stats. |
| 103 | + |
| 104 | +**scCountQC**, with the `--outMetrics` option, outputs the count statistics at region and cell level (labelled as <prefix>.regions.tsv and <prefix>.cells.tsv). |
| 105 | +Just like `scFilterStats`, these outputs can then be visualized using the |
| 106 | +`MultiQC tool <https://multiqc.info/docs/>`__, to select appropriate metrics to |
| 107 | +filter out the unwanted cells/regions. |
| 108 | + |
| 109 | +.. code:: bash |
| 110 | +
|
| 111 | + # list the metrics we can use to filter cells/regions |
| 112 | + for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom --describe; done |
| 113 | +
|
| 114 | + # export the single-cell level metrices |
| 115 | + for r in 1 2; do scCountQC -i sincei_output/rna/scCounts_rna_genes_rep${r}.loom \ |
| 116 | + -om sincei_output/rna/countqc_rna_genes_rep${r} & done |
| 117 | +
|
| 118 | + # visualize output using multiQC |
| 119 | + multiqc sincei_output/rna/ # see results in multiqc_report.html |
| 120 | +
|
| 121 | +In total >74500 genes have been detected in >13.7K cells here. |
| 122 | + |
| 123 | +Below, we perform a basic filtering using **scCountQC**. We exclude the |
| 124 | +cells with <500 and >10000 detected genes (``--filterRegionArgs``). |
| 125 | +Also, we exclude the genes that are detected in too few cells (<100) or |
| 126 | +too many (approax >90%) of cells (``--filterCellArgs``). |
| 127 | + |
| 128 | +.. code:: bash |
| 129 | +
|
| 130 | + for r in 1 2 do scCountQC -i |
| 131 | + sincei_output/rna/scCounts_rna_genes_rep\ :math:`{r}.loom \ |
| 132 | + -o sincei_output/rna/scCounts_rna_genes_filtered_rep`\ {r}.loom |
| 133 | + –filterRegionArgs “n_cells_by_counts: 100, 6000” |
| 134 | + –filterCellArgs “n_genes_by_counts: 500, 15000” done |
| 135 | +
|
| 136 | + ## rep 1 |
| 137 | + #Applying filters |
| 138 | + #Remaining cells: 5314 |
| 139 | + #Remaining features: 48219 |
| 140 | +
|
| 141 | + ## rep 2 |
| 142 | + #Applying filters |
| 143 | + #Remaining cells: 4894 |
| 144 | + #Remaining features: 48660 |
| 145 | +
|
| 146 | +
|
| 147 | +5. Combine counts for the 2 replicates |
| 148 | +-------------------------------------- |
| 149 | + |
| 150 | +While it's useful to perform count QC seperately for each replicate, the counts can now be combined for downstream analysis. |
| 151 | + |
| 152 | +We provide a tool `scCombineCounts`, which can concatenate counts for cells based on common features. Concatenating the filtered cells for the 2 replicates would result in a total of >12K cells. |
| 153 | + |
| 154 | +.. code:: bash |
| 155 | +
|
| 156 | + scCombineCounts \ |
| 157 | + -i sincei_output/rna/scCounts_rna_genes_filtered_rep1.loom \ |
| 158 | + sincei_output/rna/scCounts_rna_genes_filtered_rep2.loom \ |
| 159 | + -o sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \ |
| 160 | + --method multi-sample --labels rep1 rep2 |
| 161 | + #Combined cells: 10208 |
| 162 | + #Combined features: 48059 |
| 163 | +
|
| 164 | +5. Dimentionality reduction and clustering |
| 165 | +------------------------------------------ |
| 166 | + |
| 167 | +Finally, we will apply glmPCA to this data, assuming the data follows a |
| 168 | +poisson distribution (which is nearly appropritate for count-based data |
| 169 | +such as scRNA-seq), we will reduce the dimentionality of the data to 20 |
| 170 | +principle components (the default), followed by a graph-based (louvain) |
| 171 | +clustering of the output. |
| 172 | + |
| 173 | +.. code:: bash |
| 174 | +
|
| 175 | + scClusterCells -i sincei_output/rna/scCounts_rna_genes_filtered.merged.loom \ |
| 176 | + -m glmPCA -gf poisson --clusterResolution 1 \ |
| 177 | + -op sincei_output/rna/scClusterCells_UMAP.png \ |
| 178 | + -o sincei_output/rna/scCounts_rna_genes_clustered.loom |
| 179 | +
|
| 180 | + # Coherence Score: Coherence Score: -0.9893882070519782 |
| 181 | + # also produces the tsv file "sincei_output/rna/scClusterCells_UMAP.tsv" |
| 182 | +
|
| 183 | +(optional) Confirmation of clustering using metadata |
| 184 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 185 | + |
| 186 | +Below, we will load this data in R and compare it to the cell metadata |
| 187 | +provided with our files to see if our clustering separates celltypes in |
| 188 | +a biologically meaningful way. |
| 189 | + |
| 190 | +We can color our UMAP output from ``scClusterCells`` with the cell-type |
| 191 | +information based on the metadata provided by the original authors. |
| 192 | + |
| 193 | +.. code:: r |
| 194 | +
|
| 195 | + library(dplyr) |
| 196 | + library(magrittr) |
| 197 | + library(ggplot2) |
| 198 | + library(patchwork) |
| 199 | +
|
| 200 | + umap <- read.delim(“sincei_output/rna/scClusterCells_UMAP.tsv”) meta <- |
| 201 | + read.csv(“metadata_cd34_rna.csv”, row.names = 1) |
| 202 | + umap$celltype <- meta[gsub("rep1_|rep2_", "", umap`\ Cell_ID), |
| 203 | + “celltype”] |
| 204 | +
|
| 205 | + # keep only cells with published labels |
| 206 | +
|
| 207 | + umap %<>% filter(!is.na(celltype)) # remove clusters with low number of |
| 208 | + cells cl = umap %>% group_by(cluster) %>% |
| 209 | + summarise(Cell_ID = dplyr::n()) %>% |
| 210 | + filter(Cell_ID < 50) %>% .$cluster umap %<>% |
| 211 | + filter(!(cluster %in% cl)) |
| 212 | +
|
| 213 | + # make plots |
| 214 | + df_center <- group_by(umap, cluster) %>% |
| 215 | + summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2)) |
| 216 | + df_center2 <- group_by(umap, celltype) %>% |
| 217 | + summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2)) |
| 218 | +
|
| 219 | + # colors for metadata (8 celltypes) |
| 220 | +
|
| 221 | + col_pallete <- RColorBrewer::brewer.pal(8, “Paired”) |
| 222 | + names(col_pallete) <- unique(umap$celltype) # grey is for NA |
| 223 | +
|
| 224 | + # colors for sincei UMAP (10 clusters) |
| 225 | +
|
| 226 | + colors_cluster <- RColorBrewer::brewer.pal(10, “Paired”) |
| 227 | + names(colors_cluster) <- sort(unique(umap$cluster)) |
| 228 | +
|
| 229 | + p1 <- umap %>% ggplot(., aes(UMAP1, UMAP2, color=factor(cluster), |
| 230 | + label=cluster)) + geom_point() + |
| 231 | + geom_label(data = df_center, aes(UMAP1, UMAP2)) + |
| 232 | + scale_color_manual(values = colors_cluster) + |
| 233 | + theme_void(base_size = 12) + theme(legend.position = “none”) + |
| 234 | +
|
| 235 | + p2 <- umap %>% filter(!is.na(celltype)) %>% ggplot(., aes(UMAP1, UMAP2, |
| 236 | + color=factor(celltype), label=celltype)) + geom_point() + |
| 237 | + geom_label(data = df_center2, aes(UMAP1, UMAP2)) + |
| 238 | + scale_color_manual(values = col_pallete) + labs(color=“Cluster”) + |
| 239 | + theme_void(base_size = 12) + theme(legend.position = “none”) + |
| 240 | + ggtitle(“Published Cell Types”) |
| 241 | +
|
| 242 | + pl <- p1 + p2 |
| 243 | + ggsave(plot=pl, “sincei_output/rna/UMAP_compared_withOrig.png”, |
| 244 | + dpi=300, width = 11, height = 6) |
| 245 | +
|
| 246 | +
|
| 247 | +.. image:: ./../images/UMAP_compared_withOrig_10xRNA.png |
| 248 | + :height: 800px |
| 249 | + :width: 1600 px |
| 250 | + :scale: 50 % |
| 251 | + |
| 252 | +The figure above shows that we can easily replicate the expected cell-type |
| 253 | +results from the scRNA-seq data using **sincei**. However there are some |
| 254 | +interesting differences, especially, a separation of the CLP cluster into 2 clusters, |
| 255 | +where one of these cluster is similar to the annotated pDC. |
| 256 | + |
| 257 | +This was done using basic pre-processing steps, therefore the better cell/region filtering and optimizing the analysis parameters. |
| 258 | + |
| 259 | +6. Creating bigwigs and visualizing signal on IGV |
| 260 | +------------------------------------------------- |
| 261 | + |
| 262 | +For further exploration of data, It's very useful to create in-silico bulk |
| 263 | +coverage files (bigwigs) that aggregate the signal across cells in our clusters. |
| 264 | +The tool **scBulkCoverage** takes sincei clustered `.tsv` file, along with the |
| 265 | +corresponding BAM files, and aggregate the signal to create these bigwigs. |
| 266 | + |
| 267 | +The parameters here are same as other sincei tools that work on BAM files, |
| 268 | +except that we can ask for a normalized bulk signal (specified using |
| 269 | +`--normalizeUsing` option) . Below, we produce CPM-normalized bigwigs with 100 bp bins. |
| 270 | + |
| 271 | +.. code:: bash |
| 272 | +
|
| 273 | + scBulkCoverage -p 20 --cellTag CB --region chr1 \ |
| 274 | + --normalizeUsing CPM --binSize 100 \ |
| 275 | + --minMappingQuality 10 --samFlagExclude 2048 \ |
| 276 | + -b cellranger_output_rep1/outs/gex_possorted_bam.bam \ |
| 277 | + cellranger_output_rep2/outs/gex_possorted_bam.bam \ |
| 278 | + --labels rep1_rna_rep1 rep2_rna_rep2 \ |
| 279 | + -i sincei_output/rna/scClusterCells_UMAP.tsv \ |
| 280 | + -o sincei_output/rna/sincei_cluster |
| 281 | + # creates 6 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 9 |
| 282 | +
|
| 283 | +We can now inspect these bigwigs on IGV. Looking at the region around one of the markers described in the |
| 284 | +original manuscript, **TAL1**, we can see that the CLP (lyphoid) and pDCs lack it's expression, while the |
| 285 | +myloid cells and HSCs have the signal. The neighboring gene STIL, which is involved in cell-cycle regulation |
| 286 | +is not expressed in the highly proliferative HSC/HMP/MEP cell clusters. Overall this confirms that the signal |
| 287 | +coverage extracted from our clusters broadly reflects the biology of the underlying cell types. |
| 288 | + |
| 289 | +.. image:: ./../images/igv_snapshot_10xRNA.png |
| 290 | + :height: 500px |
| 291 | + :width: 6000 px |
| 292 | + :scale: 50 % |
0 commit comments