Skip to content

Commit 2a8a849

Browse files
committed
Add recount3 citations bib file. Incorporate the quality control fields currently described in the paper supplement. Modify the format from LaTeX to markdown.
1 parent 3dd5582 commit 2a8a849

File tree

3 files changed

+660
-0
lines changed

3 files changed

+660
-0
lines changed

04_qc.Rmd

+102
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,105 @@
22

33
# Quality check fields
44

5+
We used a number of tools to collect potentially useful quality-control (QC) measures. Specifically, we used `seqtk` [@seqtk], the `idxstats` subcommand of `samtools`, the output of `STAR`, our own `megadepth` tool, and `featureCounts`. We examine each in turn, listing the specific QC measures calculated be each.
6+
7+
Monorail runs the `seqtk fqchk` command on input FASTQ files to collect base-quality and base-composition summaries for all sequencing cycles. We distill these into a few QC measures included with every summarized run in `recount3`.
8+
9+
1. **`min_len`**: minimum read length
10+
1. **`max_len`**: maximum read length
11+
1. **`avg_len`**: average read length
12+
1. **`#distinct_quality_values`**: number of different quality scores present in the sequence run's base qualities
13+
1. **`#bases`**: total number of bases across spots (not including both read mates if paired)
14+
1. **`%A`**: percent of bases that are A
15+
1. **`%C`**: percent of bases that are C
16+
1. **`%G`**: percent of bases that are G
17+
1. **`%T`**: percent of bases that are T
18+
1. **`%N`**: percent of bases that are N
19+
1. **`avgQ`**: weighted average over Phred quality scores present in the sequence run, where weights are the Phred quality values themselves
20+
1. **`errQ`**: negatively scaled log of the weighted average Phred quality scores present in the sequence run, where weights are the error probabilities associated with the Phred quality scores
21+
22+
23+
Monorail uses `STAR` to align RNA-seq reads in a spliced fashion to a reference genome, without using any annotation. Files output by `STAR`, particularly the `Log.out` and `Log.final.out`, report a number of measures that can be used for QC. We compile these into a number of QC measures included with every summarized run in recount3. Note that some of these measures are reported separately for the two ends of a paired-end read. We omit the second-end versions of these QC measures here for space reasons.
24+
25+
From the `STAR` manual (version 2.7.2b):
26+
27+
>> Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.
28+
29+
Some of the following definitions include text from the `STAR` manual/source code, reprinted here for convenience. Please see the [`STAR` manual](https://github.com/alexdobin/STAR/tree/master/doc) for more in depth information.
30+
31+
1. **`%_of_chimeric_reads`**: Number of chimeric reads divided by number of input reads
32+
1. **`%_of_reads_mapped_to_multiple_loci`**: Number of reads mapped to multiple loci divided by number of input reads
33+
1. **`%_of_reads_mapped_to_too_many_loci`**: Number of reads mapped to (> 10) loci divided by number of input reads
34+
1. **`%_of_reads_unmapped:_other`**: Reads are unmapped due to no acceptable seed/windows divided by number of input reads
35+
1. **`%_of_reads_unmapped:_too_many_mismatches`**: Number of reads where best alignment has more mismatches than max allowed number of mismatches divided by number of input reads
36+
1. **`%_of_reads_unmapped:_too_short`**: Number of reads where best alignment was shorter than min allowed mapped length divided by number of input reads
37+
1. **`all_mapped_reads`**: Total number of reads aligned
38+
1. **`average_input_read_length`**: Average length of a read
39+
1. **`average_mapped_length`**: Average length of an alignment
40+
1. **`deletion_average_length`**: Average length of a genomic deletion, i.e. genomic gaps
41+
1. **`deletion_rate_per_base`**: Genomic deletions per mapped base
42+
1. **`insertion_average_length`**: Average length of a genomic insertion, i.e. read gaps
43+
1. **`insertion_rate_per_base`**: Genomics insertions per mapped base
44+
1. **`mapping_speed_million_of_reads_per_hour`**: How fast it was to align this sample
45+
1. **`mismatch_rate_per_base_%`**: Mismatches per mapped base
46+
1. **`number_of_chimeric_reads`**: Total number of reads which were fragmented on aligning, e.g. fusion potential reads
47+
1. **`number_of_input_reads`**: Total number of reads input to `STAR`
48+
1. **`number_of_reads_mapped_to_multiple_loci`**: Number of reads mapped to multiple loci
49+
1. **`number_of_reads_mapped_to_too_many_loci`**: Number of reads mapped to (> 10) loci
50+
1. **`number_of_reads_unmapped:_other`**: Number of reads left unmapped due to no acceptable seed/windows
51+
1. **`number_of_reads_unmapped:_too_many_mismatches`**: Number of reads where best alignment has more mismatches than max allowed number of mismatches
52+
1. **`number_of_reads_unmapped:_too_short`**: Number of reads where best alignment was shorter than min allowed mapped length
53+
1. **`number_of_splices:_at/ac`**: Number of canonical splices of AT-AC (and reverse)
54+
1. **`number_of_splices:_annotated_(sjdb)`**: Number of splices found that were also in the annotation database
55+
1. **`number_of_splices:_gc/ag`**: Number of canonical splices of GC-AG (and reverse)
56+
1. **`number_of_splices:_gt/ag`**: Number of canonical splices of GT-AG (and reverse)
57+
1. **`number_of_splices:_non-canonical`**: Number of non-canonical splices, anything not GT-AG, AT-AC, GC-AG, or their reverse complement
58+
1. **`number_of_splices:_total`**: Total number of splices
59+
1. **`uniquely_mapped_reads_%`**: Number of reads which mapped to a single locus divided by number of input reads
60+
1. **`uniquely_mapped_reads_number`**: Number of reads which mapped to a single locus
61+
62+
63+
Monorail runs the `samtools idxstats` on the BAM file output by `STAR` to collect statistics about how many reads aligned to each chromosome in the genome assembly. This can be helpful in, for instance, confirming the sex of the individual sequenced based on alignments to sex chromosomes, or measuring effectiveness of ribosomal RNA depletion by considering the fraction of reads aligned to the mitochondrial genome. We compile these into a number of QC measures included with every summarized run in recount3:
64+
65+
66+
1. **`aligned_reads%.chrm`**: Percent of reads aligning to the mitochondrial genome.
67+
1. **`aligned_reads%.chrx`**: Percent of reads aligning to chromosome X.
68+
1. **`aligned_reads%.chry`**: Precent of reads aligning to chromosome Y.
69+
70+
71+
Monorail runs our `megadepth` tool on the BAM files output by `STAR`. The chief function is to convert BAM files to bigWig files that are then added to the recount3 archive. As `megadepth` performs this conversion, it also summarizes the amount of sequencing coverage within the intervals of a provided BED file representing a gene annotation. These quantifications can be useful for quality control, tell us, for example, what fraction of the coverage is within annotated genes.
72+
73+
Fragment length distribution is based on a special read filter only applied for this purpose to be compatible with CSAW's fragment counting approach [@csaw], paired reads in a passing fragment must not be secondary, supplementary, have conflicting read order, be unmapped or be mapped on more than one chromosome. The `bc_` prefix here refers to the previous name of the Megadepth tool ^[Megadepth used to be called BamCount.].
74+
75+
76+
1. **`bc_auc.all_reads_all_bases`**: Area under coverage (total depth of coverage evaluated at all bases) for all alignments
77+
1. **`bc_auc.all_reads_annotated_bases`**: Area under coverage for all alignments, but only for bases in annotated exons
78+
1. **`bc_auc.unique_reads_all_bases`**: Area under coverage for uniquely aligned reads
79+
1. **`bc_auc.unique_reads_annotated_bases`**: Area under coverage for uniquely aligned reads, but only for bases in annotated exons
80+
1. **`bc_auc.all_%`**: `bc_auc.all_reads_annotated_bases` divided by `bc_auc.all_reads_all_bases`
81+
1. **`bc_auc.unique_%`**: `bc_auc.unique_reads_annotated_bases` divided by `bc_auc.unique_reads_all_bases`
82+
1. **`bc_frag.count`**: Total number of read fragments in BAM after filtering
83+
1. **`bc_frag.kallisto_count`**: Number of read fragments (< 1000) bp in length in BAM after filtering
84+
1. **`bc_frag.kallisto_mean_length`**: Mean length of read fragments (< 1000) bp in length in BAM after filtering
85+
1. **`bc_frag.mean_length`**: Mean length of all read fragments in BAM after filtering
86+
1. **`bc_frag.mode_length`**: Mode of the read fragment length of all fragments in BAM after filtering
87+
1. **`bc_frag.mode_length_count`**: Number of read fraqments with the `bc_frag.mode_length` in BAM after filtering
88+
89+
90+
91+
Finally, Monorail runs `featureCounts` on the BAM files output by `STAR`. This provides a `second opinion` on the quantifications produced by `megadepth`. While we have not yet found compelling examples where the `megadepth` and `featureCounts` outputs disagree, we keep summaries of the `featureCounts` quantifications as potential QC measures.
92+
93+
94+
1. **`exon_fc.all_%`**: `exon_fc_count_all.assigned` divided by `all_mapped_reads` (from `STAR`)
95+
1. **`exon_fc.unique_%`**: `exon_fc_count_unique.assigned` divided by `uniquely_mapped_reads_number` (from `STAR`)
96+
1. **`exon_fc_count_all.total`**: Total number of fragments, including multi-mappers, input to `featureCounts`
97+
1. **`exon_fc_count_all.assigned`**: Number of fragments, including multi-mappers, assigned by `featureCounts` to an exon
98+
1. **`exon_fc_count_unique.total`**: Total number of uniquely mapping fragments input to `featureCounts`
99+
1. **`exon_fc_count_unique.assigned`**: Number of uniquely mapping fragments assigned by `featureCounts` to an exon
100+
1. **`gene_fc.all_%`**: `gene_fc_count_all.assigned` divided by `all_mapped_read`s (from `STAR`)
101+
1. **`gene_fc.unique_%`**: `gene_fc_count_unique.assigned` divided by `uniquely_mapped_reads_number` (from `STAR`)
102+
1. **`gene_fc_count_all.total`**: Total number of fragments, including multi-mappers, input to `featureCounts`
103+
1. **`gene_fc_count_all.assigned`**: Number of fragments, including multi-mappers, assigned by `featureCounts` to a gene
104+
1. **`gene_fc_count_unique.total`**: Total number of uniquely mapping fragments input to `featureCounts`
105+
1. **`gene_fc_count_unique.assigned`**: Number of uniquely mapping fragments assigned by `featureCounts` to a gene
106+

index.Rmd

+3
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ site: bookdown::bookdown_site
55
documentclass: book
66
output:
77
bookdown::gitbook: default
8+
bibliography: ["recount3.bib"]
9+
biblio-style: "apalike"
10+
link-citations: true
811
---
912

1013
# Overview {-}

0 commit comments

Comments
 (0)