Skip to content

Commit 5802ab9

Browse files
committed
Version 3.2.0
- Add v3.2.0 documentation, a polished CCS workflow. - Update instructions for CCS v4.0
1 parent 6c86f4c commit 5802ab9

File tree

6 files changed

+246
-2
lines changed

6 files changed

+246
-2
lines changed

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,18 @@ for information on Installation, Support, License, Copyright, and Disclaimer.
2020

2121
## Specific Version Documentation
2222

23+
* [Version 3.2, SMRT Link 8.0](README_v3.2.md)
2324
* [Version 3.1, SMRT Link 7.0](README_v3.1.md)
2425
* [Version 3.0, SMRT Link 6.0](README_v3.0.md)
2526

2627
## Changelog
27-
* **3.1.2**
28+
* **3.2.0**
29+
* Add `collapse` step for aligned transcript BAM input
30+
* Enable CCS-only workflow `cluster --use-qvs`
31+
* Add `refine --min-polya-length`
32+
* Add `cluster --singletons` to output unclustered FLNCs; potential sample prep artifacts!
33+
* Fix minimap2 bugs. Outputs might change slightly.
34+
* 3.1.2
2835
* Reduce `polish` memory footprint
2936
* 3.1.1
3037
* Edge case fix where `polish` would not finish and stale

README_v3.0.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,14 @@ Each sequencing run is processed by [*ccs*](https://github.com/PacificBioscience
3737
to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with
3838
at least one full pass (at least once subread with SMRT adapter on both ends) are
3939
used for the subsequent analysis. Polishing is not necessary
40-
in this step and is by default deactivated through `.
40+
in this step and is by default deactivated through.
4141

4242
ccs movie.subreads.bam ccs.bam --noPolish --minPasses 1
4343

44+
For **CCS version ≥ 4.0.0** use this call:
45+
46+
$ ccs movie.subreads.bam ccs.bam --skip-polish --min-passes 1 --draft-mode winpoa --disable-heuristics
47+
4448
### Primer removal and demultiplexing
4549
Removal of cDNA primers and identification of barcodes (if given) is performed using [*lima*](https://github.com/pacificbiosciences/barcoding),
4650
which offers a specialized `--isoseq` mode.

README_v3.1.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,10 @@ used per ZMW; this can decrease run-time (only available in ccs version ≥ 3.1.
5959

6060
$ ccs movieX.subreads.bam movieX.ccs.bam --noPolish --minPasses 1 --maxPoaCoverage 10
6161

62+
For **CCS version ≥ 4.0.0** use this call:
63+
64+
$ ccs movieX.subreads.bam movieX.ccs.bam --skip-polish --min-passes 1 --draft-mode winpoa --disable-heuristics
65+
6266
### Step 2 - Primer removal and demultiplexing
6367
Removal of primers and identification of barcodes is performed using [*lima*](https://github.com/pacificbiosciences/barcoding),
6468
which offers a specialized `--isoseq` mode.

README_v3.2.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
<h1 align="center"><img width="300px" src="doc/img/isoseq3.png"/></h1>
2+
<h1 align="center">IsoSeq 3.2</h1>
3+
<p align="center">Scalable De Novo Isoform Discovery</p>
4+
5+
***
6+
7+
*IsoSeq3* contains the newest tools to identify transcripts in
8+
PacBio single-molecule sequencing data.
9+
Starting in SMRT Link v6.0.0, those tools power the
10+
*IsoSeq3 GUI-based analysis* application.
11+
A composable workflow of existing tools and algorithms, combined with
12+
a new clustering technique, allows to process the ever-increasing yield of PacBio
13+
machines with similar performance to *IsoSeq1* and *IsoSeq2*.
14+
15+
Focus of version 3.2 documentation is processing of polished CCS reads,
16+
the latest feature of *IsoSeq3*. Processing of unpolished CCS reads with final
17+
transcript polishing is still supported, please refer to the
18+
[documentation of version 3.1](README_v3.1.md).
19+
20+
## Availability
21+
Latest version can be installed via bioconda package `isoseq3`.
22+
23+
Please refer to our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda)
24+
for information on Installation, Support, License, Copyright, and Disclaimer.
25+
26+
## Overview
27+
- Workflow Overview: [high](README_v3.1.md#high-level-workflow) / [mid](README_v3.1.md#mid-level-workflow) / [low](README_v3.1.md#low-level-workflow) level
28+
- [Real-World Example](README_v3.1.md#real-world-example)
29+
- [FAQ](README_v3.1.md#faq)
30+
- [SMRTbell Designs](README_v3.1.md#what-smrtbell-designs-are-possible)
31+
32+
## High-level workflow
33+
34+
The high-level workflow depicts files and processes:
35+
36+
<img width="1000px" src="doc/img/isoseq3.2-end-to-end.png"/>
37+
38+
## Mid-level workflow
39+
40+
The mid-level workflow schematically explains what happens at each stage:
41+
42+
<img width="1000px" src="doc/img/isoseq3.2-workflow.png"/>
43+
44+
## Low-level workflow
45+
46+
The low-level workflow explained via CLI calls. All necessary dependencies are
47+
installed via bioconda.
48+
49+
### Step 0 - Input
50+
For each SMRT cell a `movieX.subreads.bam` is needed for processing.
51+
52+
### Step 1 - Circular Consensus Sequence calling
53+
Each sequencing run is processed by [*ccs*](https://github.com/PacificBiosciences/unanimity)
54+
to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with
55+
at least one full pass (at least one subread with SMRT adapter on both ends) are
56+
used for the subsequent analysis. In contrast to older IsoSeq versions,
57+
CCS polishing is required to enable skipping of the transcript polishing.
58+
It is advised to use the latest CCS version 4.0.0 or newer.
59+
60+
$ ccs movieX.subreads.bam movieX.ccs.bam --min-rq 0.9
61+
62+
More info how to [easily chunk ccs](https://github.com/PacificBiosciences/ccs#how-can-I-parallelize-on-multiple-servers).
63+
64+
### Step 2 - Primer removal and demultiplexing
65+
Removal of primers and identification of barcodes is performed using [*lima*](https://github.com/pacificbiosciences/barcoding),
66+
which offers a specialized `--isoseq` mode.
67+
Even in the case that your sample is not barcoded, primer removal is performed
68+
by *lima*.
69+
If there are more than two sequences in your `primer.fasta` file or better said
70+
more than one pair of 5' and 3' primers, please use *lima* with `--peek-guess`
71+
to remove spurious false positive signal.
72+
More information about how to name input primer(+barcode)
73+
sequences in this [FAQ](https://github.com/pacificbiosciences/barcoding#how-can-i-demultiplex-isoseq-data).
74+
75+
$ lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --no-pbi --peek-guess
76+
77+
**Example 1:**
78+
Following is the `primer.fasta` for the Clontech SMARTer and NEB cDNA library
79+
prep, which are the officially recommended protocols:
80+
81+
>NEB_5p
82+
GCAATGAAGTCGCAGGGTTGGG
83+
>Clontech_5p
84+
AAGCAGTGGTATCAACGCAGAGTACATGGGG
85+
>NEB_Clontech_3p
86+
GTACTCTGCGTTGATACCACTGCTT
87+
88+
**Example 2:**
89+
Following are examples for barcoded primers using a 16bp barcode followed by
90+
Clontech primer:
91+
92+
>primer_5p
93+
AAGCAGTGGTATCAACGCAGAGTACATGGGG
94+
>brain_3p
95+
CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT
96+
>liver_3p
97+
CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT
98+
99+
*Lima* will remove unwanted combinations and orient sequences to 5' → 3' orientation.
100+
101+
Output files will be called according to their primer pair. Example for
102+
single sample libraries:
103+
104+
movieX.fl.NEB_5p--NEB_Clontech_3p.bam
105+
106+
If your library contains multiple samples, execute the following workflow
107+
for each primer pair:
108+
109+
movieX.fl.primer_5p--brain_3p.bam
110+
movieX.fl.primer_5p--liver_3p.bam
111+
112+
### Step 3 - Refine
113+
Your data now contains full-length reads, but still needs to be refined by:
114+
- [Trimming](https://github.com/PacificBiosciences/trim_isoseq_polyA) of poly(A) tails
115+
- Rapid concatmer [identification](https://github.com/jeffdaily/parasail) and removal
116+
117+
**Input**
118+
The input file for *refine* is one demultiplexed CCS file with full-length reads
119+
and the primer fasta file:
120+
- `<movie.primer--pair>.fl.bam` or `<movie.primer--pair>.fl.consensusreadset.xml`
121+
- `primers.fasta`
122+
123+
**Output**
124+
The following output files of *refine* contain full-length non-concatemer reads:
125+
- `<movie>.flnc.bam`
126+
- `<movie>.flnc.transcriptset.xml`
127+
128+
Actual command to refine:
129+
130+
$ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam
131+
132+
If your sample has poly(A) tails, use `--require-polya`.
133+
This filters for FL reads that have a poly(A) tail
134+
with at least 20 base pairs and removes identified tail:
135+
136+
$ isoseq3 refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --require-polya
137+
138+
### Step 3b - Merge SMRT Cells
139+
If you used more than one SMRT cells, use `dataset` for merging.
140+
Merge all of your `<movie>.flnc.bam` files:
141+
142+
$ dataset create --type TranscriptSet merged.flnc.xml movie1.flnc.bam movie2.flnc.bam movieN.flnc.bam
143+
144+
### Step 4 - Clustering
145+
Compared to previous IsoSeq approaches, *IsoSeq3* performs a single clustering
146+
technique.
147+
Due to the nature of the algorithm, it can't be efficiently parallelized.
148+
It is advised to give this step as many coresas possible.
149+
The individual steps of *cluster* are as following:
150+
151+
- Clustering using hierarchical n*log(n) [alignment](https://github.com/lh3/minimap2) and iterative cluster merging
152+
- Polished [POA](https://github.com/rvaser/spoa) sequence generation, using a QV guided consensus approach
153+
154+
**Input**
155+
The input file for *cluster* is one FLNC file:
156+
- `<movie>.flnc.bam` or `merged.flnc.xml`
157+
158+
**Output**
159+
The following output files of *cluster* contain polished isoforms:
160+
- `<prefix>.bam`
161+
- `<prefix>.hq.fasta.gz` with predicted accuracy ≥ 0.99
162+
- `<prefix>.lq.fasta.gz` with predicted accuracy < 0.99
163+
- `<prefix>.bam.pbi`
164+
- `<prefix>.transcriptset.xml`
165+
166+
Example invocation:
167+
168+
$ isoseq3 cluster merged.flnc.xml polished.bam --verbose --use-qvs
169+
170+
## Real-world example
171+
This is an example of an end-to-end cmd-line-only workflow to get from
172+
subreads to polished isoforms:
173+
174+
$ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam
175+
$ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam.pbi
176+
$ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreadset.xml
177+
178+
$ ccs --version
179+
ccs 4.0.0
180+
181+
$ ccs m54086_170204_081430.subreads.bam m54086_170204_081430.ccs.bam --min-rq 0.9
182+
183+
$ cat primers.fasta
184+
>primer_5p
185+
AAGCAGTGGTATCAACGCAGAGTACATGGGG
186+
>primer_3p
187+
AAGCAGTGGTATCAACGCAGAGTAC
188+
189+
$ lima --version
190+
lima 1.9.0 (commit v1.9.0)
191+
192+
$ lima m54086_170204_081430.ccs.bam primers.fasta m54086_170204_081430.fl.bam \
193+
--isoseq --peek-guess
194+
195+
$ ls m54086_170204_081430.fl*
196+
m54086_170204_081430.fl.json m54086_170204_081430.fl.lima.summary
197+
m54086_170204_081430.fl.lima.clips m54086_170204_081430.fl.primer_5p--primer_3p.bam
198+
m54086_170204_081430.fl.lima.counts m54086_170204_081430.fl.primer_5p--primer_3p.subreadset.xml
199+
m54086_170204_081430.fl.lima.report
200+
201+
$ isoseq3 refine m54086_170204_081430.fl.primer_5p--primer_3p.bam primers.fasta m54086_170204_081430.flnc.bam
202+
203+
$ ls m54086_170204_081430.flnc.*
204+
m54086_170204_081430.flnc.bam m54086_170204_081430.flnc.filter_summary.json
205+
m54086_170204_081430.flnc.bam.pbi m54086_170204_081430.flnc.report.csv
206+
m54086_170204_081430.flnc.consensusreadset.xml
207+
208+
$ isoseq3 cluster m54086_170204_081430.flnc.bam polished.bam --verbose --use-qvs
209+
Read BAM : (197791) 4s 20ms
210+
Convert to reads : 1s 431ms
211+
Sort Reads : 56ms 947us
212+
Aligning Linear : 2m 5s
213+
Read to clusters : 9s 432ms
214+
Aligning Linear : 54s 288ms
215+
Merge by mapping : 36s 138ms
216+
Consensus : 30s 126ms
217+
Merge by mapping : 5s 418ms
218+
Consensus : 3s 597ms
219+
Write output : 1s 134ms
220+
Complete run time : 4m 32s
221+
222+
$ ls polished*
223+
polished.bam polished.hq.fasta.gz
224+
polished.bam.pbi polished.lq.fasta.gz
225+
polished.cluster polished.transcriptset.xml
226+
227+
## DISCLAIMER
228+
229+
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

doc/img/isoseq3.2-end-to-end.png

459 KB
Loading

doc/img/isoseq3.2-workflow.png

219 KB
Loading

0 commit comments

Comments
 (0)