Skip to content

Commit 53648a7

Browse files
committed
CCS 5.0.0
1 parent 442c7d7 commit 53648a7

35 files changed

+871
-405
lines changed

README.md

+4-405
Large diffs are not rendered by default.

docs/CNAME

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
ccs.how

docs/_config.yml

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
remote_theme: armintoepfer/just-the-docs
2+
3+
# Aux links for the upper right navigation
4+
aux_links:
5+
"File an issue":
6+
- "https://github.com/PacificBiosciences/pbbioconda/issues/new?template=bug_report.md"
7+
8+
# Makes Aux links open in a new tab. Default is false
9+
aux_links_new_tab: true
10+
11+
color_scheme: custom
12+
13+
# Footer content
14+
# appears at the bottom of every page's main content
15+
footer_content: "THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED \"AS IS,\" WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES."
16+
17+
# Footer last edited timestamp
18+
last_edit_timestamp: true # show or hide edit time - page must have `last_modified_date` defined in the frontmatter
19+
last_edit_time_format: "%b %e %Y at %I:%M %p" # uses ruby's time format: https://ruby-doc.org/stdlib-2.7.0/libdoc/time/rdoc/Time.html
20+
21+
# Footer "Edit this page on GitHub" link text
22+
gh_edit_link: false # show or hide edit this page link
23+
24+
25+
title: "CCS Docs"
26+
tagline: "Generate Highly Accurate Single-Molecule Consensus Reads (HiFi Reads)"
27+
28+
search_enabled: false

docs/_sass/color_schemes/custom.scss

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
$link-color: $blue-000;
2+
$content-width: 900px;
3+
$nav-width: 224px;
4+
$nav-width-md: 200px;
5+
$sidebar-color: $grey-lt-000;

docs/changelog.md

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
layout: default
3+
title: Changelog
4+
nav_order: 99
5+
---
6+
7+
# Version changelog
8+
9+
**5.0.0**
10+
* SMRT Link v10.0 release
11+
* Add `--hifi-kinetics` to average kinetic information for polished reads
12+
* Add `--all-kinetics` to add kinetic information for all ZMWs, except for unpolished draft consensus
13+
* Add `--subread-fallback`, combined with `--all`, use a subread instead of a draft as representative consensus
14+
* Use sDUST to identify tandem repeats
15+
* Output HiFi yield (>= Q20) and Unique Molecular Yield as INFO log
16+
* Set `--top-passes 60` default
17+
* Abort if chemistry information is missing in BAM header
18+
* Add non-blocking temporary file writing
19+
* Add `--input-buffer` to smooth IO fluctations
20+
* Add `--all` to generate one representative read per ZMW
21+
* Reuse prefix of output file for report files to avoid unintentional clobbering
22+
* Add `zmw_metrics.json`, metrics about each ZMW; file name can be set with `--metrics-json`
23+
* Add JSON output of ccs_reports via `--report-json`
24+
* Add `--suppress-reports` to suppress generating default report and metric files
25+
26+
4.2.0
27+
* SMRT Link v9.0 release
28+
* Speed improvements
29+
* Minor yield improvements, by requiring a percentage of subreads mapping back to draft instead of `--min-passes`
30+
* Add effective coverage `ec` tag
31+
* Lowering `--min-passes` does no longer reduce yield
32+
* Add `--batch-size` to better saturate machine with high core counts
33+
* Simplify log output
34+
* Fix bug in predicted accuracy calculation
35+
* Improved `ccs_report.txt` summary
36+
37+
4.1.0
38+
* Minor speed improvements
39+
* Fix `--by-strand` logic, see more [here](https://ccs.how/faq/mode-by-strand)
40+
* Allow vanilla `.xml` output without specifying dataset type
41+
* Compute wall start/end for each output read (future basecaller functionality)
42+
43+
4.0.0
44+
* SMRT Link v8.0 release
45+
* Speed improvements
46+
* Removed support for legacy python Genomic Consensus, please use [gcpp](https://github.com/PacificBiosciences/gcpp)
47+
* New command-line interface
48+
* New report file
49+
50+
3.4.1
51+
* SMRT Link v7.0 release
52+
* Log used chemistry model to INFO level
53+
54+
3.4.0
55+
* Fixes to unpolished mode for IsoSeq
56+
* Improve runtime when `--minPredictedAccuracy` has been increased
57+
58+
3.3.0
59+
* Add a windowing approach to reduce computational complexity from quadratic to linear
60+
* Improve multi-threading framework to increase throughput
61+
* Enhance XML output, propagate `CollectionMetadata`
62+
* Includes latest chemistry parameters
63+
64+
3.1.0
65+
* Add `--maxPoaCoverage` to decrease runtime for unpolished output, special parameter for IsoSeq workflow
66+
* Chemistry parameters for SMRT Link v6.0

docs/faq/accuracy-vs-passes.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Accuracy vs. passes
5+
---
6+
7+
## What impacts the number and quality of HiFi reads that are generated?
8+
The longer the polymerase read gets, more passes of the SMRTbell
9+
are produced and consequently more evidence is accumulated per molecule.
10+
This increase in evidence translates into higher consensus accuracy, as
11+
depicted in the following plot:
12+
13+
<p align="center"><img width="600px" src="../img/ccs-acc.png"/></p>
14+
15+
## How is number of passes computed?
16+
Each read is annotated with a `np` tag that contains the number of
17+
full-length subreads used for polishing. Full-length subreads are flanked by
18+
adapters and thus cover the full insert.
19+
Since the first version of _ccs_, number of passes has only accounted for
20+
full-length subreads. In version v3.3.0 windowing has been added, which
21+
takes the minimum number of full-length subreads across all windows.
22+
Starting with version v4.0.0, minimum has been replaced with mode to get a
23+
better representation across all windows. Only subreads that pass the subread
24+
length filter (please see next FAQ about filters) and were not dropped during
25+
polishing are counted.
26+
27+
Similarly, the tag `ec` reports effective coverage, the average subread coverage
28+
across all windows. This metric includes all subreads, independent of being
29+
full- or partial-length subreads, that pass length filters and did not fail
30+
during polishing. In most cases `ec` will be roughly `np + 1`.
31+
32+
## Why do I get more yield if I increase `--min-passes`?
33+
For versions newer than 3.0.0 and older than 4.2.0, we required that after
34+
draft generation, at least `--min-passes` subreads map back to the draft.
35+
Imagine the following scenario, a ZMW with 10 subreads generates a draft to which
36+
only a single subread aligns. This draft is of low quality and does not
37+
represent the ZMW, yet if you ask for `--min-passes 1`, this low-quality draft
38+
is being used. Starting with version 4.2.0, we switch to an additional
39+
percentage threshold of more than 50% aligning subreads to avoid this problem.
40+
This fixes the majority of discrepancies for fewer than three passes.
41+
42+
Why do we have this problem at all, shouldn't the draft stage be robust enough?
43+
Robustness comes with inherent speed trade-offs. We have a cascade of different draft
44+
generators, from very fast and unstable to slow and robust. If a ZMW fails
45+
to generate a draft for a fast generator, it falls back multiple times until it
46+
reaches the slower and more robust generator. This approach is still much faster
47+
than always relying on the robust generator.
48+
49+
## Is there an upper limit on number of passes used?
50+
Per default, _ccs_ uses at most the top 60 full-length passes after sorting
51+
by median length.
52+
Beyond this threshold, it has been shown that quality does not improve.
53+
You can change this limit with `--top-passes`, whereas `0` means unlimited.

docs/faq/bam-output.md

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: BAM output
5+
---
6+
7+
## What BAM tags are generated?
8+
9+
| Tag | Type | Description |
10+
| :---: | :---: | ----------- |
11+
| `ec` | `f` | [Effective coverage](/faq/accuracy-vs-passes#how-is-number-of-passes-computed)|
12+
| `fi` | `B,C` | [Forward IPD (codec V1)](/faq/kinetics)|
13+
| `fn` | `i` | [Forward number of complete passes (zero or more)](/faq/kinetics)|
14+
| `fp` | `B,C` | [Forward PulseWidth (codec V1)](/faq/kinetics)|
15+
| `np` | `i` | [Number of full-length subreads](/faq/accuracy-vs-passes#how-is-number-of-passes-computed)|
16+
| `ri` | `B,C` | [Reverse IPD (codec V1)](/faq/kinetics)|
17+
| `rn` | `i` | [Reverse number of complete passes (zero or more)](/faq/kinetics)|
18+
| `rp` | `B,C` | [Reverse PulseWidth (codec V1)](/faq/kinetics)|
19+
| `rq` | `f` | [Predicted average read accuracy](/how-does-ccs-work#9-qv-calculation)|
20+
| `sn` | `B,f` | Signal-to-noise ratios for each nucleotide|
21+
| `zm` | `i` | ZMW hole number |
22+
| `RG` | `z` | Read group |
23+
24+
25+
## How does the output BAM file size scale with yield?
26+
For each base, the output BAM file size scales as follows
27+
- 0.5 byte/base for the actual base (4-bit encoding)
28+
- 1 byte/base for the QV
29+
- 1 byte/base for the forward PW
30+
- 1 byte/base for the forward IPD
31+
- 1 byte/base for the reverse PW
32+
- 1 byte/base for the reverse IPD
33+
34+
For a normal _ccs_ run without kinetics, the upper bound is 1.5 bytes/base.
35+
If _ccs_ is run **with** kinetics, the upper bound is 5.5 bytes/base.
36+
37+
Per-read meta information add a fixed amount of 32 bytes per read:
38+
- `ec`,`rq` : float, each 4 bytes
39+
- `sn`: float array, 4x4 bytes
40+
- `np`, `zm`: int32_t, 4 byte
41+
- `RG`: string of length 8, 8x1 bytes
42+
43+
The actual output BAM that _ccs_ generates is compressed. Compression is
44+
data-dependent and because of that, upper bounds can't be provided.
45+
For a 19kb insert library and 30h movie time, the _ccs_ BAM files scale on
46+
average with:
47+
48+
| BAM name | Options | Bytes/<br>Base | Bytes/<br>HiFiBase | Example<br>(GBytes) | Example<br>(GBytes) |
49+
| -------------------- | ------------------------------------------ | :------------: | :----------------: | :-----------------: | :-----------------: |
50+
| hifi.bam | | 0.7 | 0.7 | 100 | 63 |
51+
| hifi.hifikin.bam | `--hifi-kinetics` | 3.7 | 3.7 | 528 | 336 |
52+
| reads.bam | `--all` | 0.55 | 1.1 | 157 | 100 |
53+
| reads.hifikin.bam | `--all --hifi-kinetics` | 2.3 | 4.5 | 642 | 409 |
54+
| reads.allkin.bam | `--all --all-kinetics` | 2.9 | 5.7 | 814 | 518 |
55+
| reads.allkin.sub.bam | `--all --all-kinetics --subread-fallback` | 3.0 | 5.8 | 828 | 527 |

docs/faq/bioconda-binary.md

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Bioconda binary
5+
---
6+
7+
## The binary does not work on my linux system!
8+
Contrary to official SMRT Link releases, the `ccs` binary distributed via bioconda
9+
is tuned for performance while sacrificing backward compatibility.
10+
We are aware of following errors and limitations. If yours is not listed, please
11+
file an issue on our [official pbbioconda page](https://github.com/PacificBiosciences/pbbioconda).
12+
13+
**`Illegal instruction`** Your CPU is not supported.
14+
A modern (post-2008) CPU with support for
15+
[SSE4.1 instructions](https://en.wikipedia.org/wiki/SSE4#SSE4.1) is required.
16+
SMRT Link also has this requirement.
17+
18+
**`FATAL: kernel too old`** Your OS or rather your kernel version is not supported.
19+
Since CCS v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
20+
not bundle a newer `glibc`. Please use this alternative binary.
21+
22+
For CCS v5.0, we offer two binaries in bioconda:
23+
24+
* `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0.
25+
* `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0.

docs/faq/chemistry.md

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Chemistry
5+
---
6+
7+
## Help! I am getting "Unsupported ..."!
8+
If you encounter the error `Unsupported chemistries found: (...)` or
9+
`unsupported sequencing chemistry combination`, your _ccs_ binaries do not
10+
support the used sequencing chemistry kit, from here on referred to as "chemistry".
11+
This may be because we removed support of an older chemistry or your binary predates
12+
release of the used chemistry.
13+
This is unlikely to happen with _ccs_ from SMRT Link installations, as SMRT Link
14+
is able to automatically update and install new chemistries.
15+
Thus, the easiest solution is to always use _ccs_ from the SMRT Link version that
16+
shipped with the release of the sequencing chemistry kit.
17+
18+
**Old chemistries:**
19+
With _ccs_ 4.0.0, we have removed support for the last RSII chemistry `P6-C4`.
20+
The only option is to downgrade _ccs_ with `conda install pbccs==3.4`.
21+
22+
**New chemistries:**
23+
It might happen that your _ccs_ version predates the sequencing chemistry kit.
24+
To fix this, install the latest version of _ccs_ with `conda update --all`.
25+
If you are an early access user, follow the [monkey patch tutorial](/faq/chemistry#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits).
26+
27+
## Monkey patch _ccs_ to support additional sequencing chemistry kits
28+
Please create a directory that is used to inject new chemistry information
29+
into _ccs_:
30+
31+
```sh
32+
mkdir -p /path/to/persistent/dir/
33+
cd /path/to/persistent/dir/
34+
export SMRT_CHEMISTRY_BUNDLE_DIR="${PWD}"
35+
mkdir -p arrow
36+
```
37+
38+
Execute the following step by step instructions to fix the error you are observing
39+
and afterwards proceed using _ccs_ as you would normally do. Additional chemistry
40+
information is automatically loaded from the `${SMRT_CHEMISTRY_BUNDLE_DIR}`
41+
environmental variable.
42+
43+
### Error: "unsupported sequencing chemistry combination"
44+
Please download the latest out-of-band `chemistry.xml`:
45+
46+
```sh
47+
wget https://raw.githubusercontent.com/PacificBiosciences/pbcore/develop/pbcore/chemistry/resources/mapping.xml -O "${SMRT_CHEMISTRY_BUNDLE_DIR}"/chemistry.xml
48+
```
49+
50+
### Error: "Unsupported chemistries found: (...)"
51+
Please get the latest consensus model `.json` from PacBio and
52+
copy it to:
53+
54+
```sh
55+
cp /some/download/dir/model.json "${SMRT_CHEMISTRY_BUNDLE_DIR}"/arrow/
56+
```

docs/faq/index.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
layout: default
3+
title: FAQ
4+
nav_order: 4
5+
has_children: true
6+
---
7+
8+
# FAQ

docs/faq/kinetics.md

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Kinetics
5+
---
6+
7+
## Is it possible to use HiFi reads to call base modifications?
8+
Base modifications can be inferred from per-base pulse width (PW) and
9+
inter-pulse duration (IPD) kinetics.
10+
Running _ccs_ with `--hifi-kinetics` generates averaged kinetic information
11+
for polished reads, independently for both strands of the insert.
12+
Forward is defined with respect to the orientation represented in ``SEQ`` and
13+
is considered to be the native orientation. As with other PacBio-specific
14+
tags, aligners will not re-orient these fields.
15+
16+
Minor cases exist where a certain orientation may get filtered out entirely
17+
from a ZMW, preventing valid values from being passed for that record. In
18+
these cases, empty lists will be passed for the respective record/orientation
19+
and number of passes will be set to zero.
20+
21+
In order to facilitate the use of HiFi reads with base modifications workflows,
22+
we have added an executable in pbbam called `ccs-kinetics-bystrandify` which
23+
creates a pseudo `--by-strand` BAM with corresponding `pw` and `ip` tags
24+
that imitates a normal, unaligned subreads BAM. You can install pbbam from
25+
Bioconda by calling `conda install pbbam`.

docs/faq/licenses.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Licenses
5+
---
6+
7+
# Licenses
8+
PacBio® tool _ccs_, distributed via Bioconda, is licensed under
9+
[BSD-3-Clause-Clear](https://spdx.org/licenses/BSD-3-Clause-Clear.html)
10+
and statically links GNU C Library v2.32 licensed under [LGPL](https://spdx.org/licenses/LGPL-2.1-only.html).
11+
Per LPGL 2.1 subsection 6c, you are entitled to request the complete
12+
machine-readable work that uses glibc in object code.

docs/faq/low-complexity.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
layout: default
3+
parent: FAQ
4+
title: Low complexity
5+
---
6+
7+
## Does CCS dislike low-complexity regions?
8+
Low-complexity comes in many shapes and forms.
9+
A particular challenge for _ccs_ are highly enriched tandem repeats, like
10+
hundreds of copies of `AGGGGT`.
11+
Prior _ccs_ v5.0, inserts with many copies of a small repeat likely not generate
12+
a consensus sequence.
13+
Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat
14+
of length `--min-tandem-repeat-length 1000`.
15+
For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028)
16+
and in particular this [sdust](https://github.com/lh3/sdust) implementation,
17+
but slightly modified.
18+
If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics`
19+
is activated for only this ZMW, and various filters that are known to exclude
20+
low-complexity sequences are disabled.
21+
This recovers most of the low-complexity consensus sequences, without impacting
22+
run time performance.

0 commit comments

Comments
 (0)