Skip to content

Commit

Permalink
Update tool documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Apr 11, 2023
1 parent 62c6adb commit d9b1fa9
Showing 1 changed file with 97 additions and 11 deletions.
108 changes: 97 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,114 @@ that those binaries are longer from `pbbam` directly.
## Usage

### `bam2fastx`
Both tools have an identical interface and take BAM and/or DataSet files as input. Examples:

Tools `bam2fasta` and `bam2fastq` have identical interfaces and transform multiple PacBio BAM and/or DataSet XML files into a compressed FASTA or FASTQ file, respectively:
```
bam2fasta -o projectName m54008_160330_053509.subreads.bam
bam2fastq -o myEcoliRuns m54008_160330_053509.subreads.bam m54008_160331_235636.subreads.bam
bam2fasta -o myHumanGenome m54012_160401_000001.subreadset.xml
# generates out.fasta.gz
bam2fasta -o out in.bam
bam2fasta -o out in.xml
# generates out.fastq.gz
bam2fastq -o out in_1.bam in_2.bam in_3.xml in_4.bam
```
Option `-u` disables compression (drops .gz extension), while option `-c <int>` determines the Gzip compression level.

Option `-p/--seqid-prefix <str>` adds the provided prefix to each sequence header.

Additionally, input files can be split depending on barcode pairs into multiple files:
```
# generates multiple out.{barcode}_{barcodePair}.fasta.gz
bam2fasta --split-barcodes -o out in1.bam in2.bam
```

### `ccs-kinetics-bystrandify`

Converts a PacBio BAM or DataSet XML file containing CCS kinetics tags to a pseudo-bystrand file with `pw` and `ip` tags that can be used as a substitute for subreads in applications expecting such kinetics information:
```
ccs-kinetics-bystrandify in.bam out.bam
ccs-kinetics-bystrandify in.xml out.xml
```

Option `--min-coverage <int>` specifies the minimum number of passes per strand (tags `fn` and `rn`) for creating a strand-specific read.

### `extracthifi`

Simple tool for extracting reads with accuracy above QV 20 (0.99) from a given BAM file:
```
extracthifi in.bam out.bam
```

### `pbindex`

Minimalistic tool which creates an index file that enables random access into PacBio BAM files:
```
# generates in.bam.pbi
pbindex in.bam
```

### `pbindexdump`

Tool which transforms PBI files to JSON or c++ format:
```
pbindexdump in.bam.pbi > out.json
pbindexdump --format cpp in.bam.pbi > out.cpp
```

Option `--json-indent-level <int>` defines the indentation of the JSON file, while option `--json-raw` modifies the output JSON file to more closely reflect the PBI file format.

Alternatively, hole numbers in plain text can be reported with:
```
pbindexdump --zmws-only in.bam.pbi > out.txt
```
**Note:** in case of subreads, the output text file can contain multiple equal hole numbers (as opposed to `zmwfilter --show-all` which reports only unique ones).

### `pbmerge`

Simple tool which merges several PacBio BAM files together, either by providing them on the command line, a DataSet XML or a file containing one file name per line:
```
pbmerge in1.bam in2.bam in3.bam > out.bam
pbmerge -o out.bam in.xml
pbmerge in.fofn > out.bam
```

Option `--no-pbi` disables creation of the index file.

### `zmwfilter`
zmwfilter provides a simple utility for filtering PacBio BAM data on ZMW ID(s), via either an "include-list" or "exclude-list".

Utility tool for filtering PacBio BAM, DataSet XML or FASTX files. Plain filtering based on ZMW hole numbers is supported for any input format, given that the output format is the same, by providing an include list or an exclude list. That can be either in form of a comma separated list on the command line or a single file containing one hole number per line:
```
zmwfilter --include 1,2,4,8,16 in.bam out.bam
zmwfilter --include hole_numbers.txt in.fasta out.fasta
ID list from command line:
zmwfilter --exclude 42 in.xml out.bam
zmwfilter --exclude hole_numbers.txt in.xml out.fastq
```
$ zmwfilter --include 100,200 input.bam filtered.out.bam
$ zmwfilter --exclude 50 input.bam filtered.out.bam

ZMW hole numbers present in a PacBio file can be obtained with option `--show-all` and without providing an output file:
```
ID list from file:
zmwfilter --show-all in.bam > out.txt
```
$ zmwfilter --include good-zmws.txt input.bam filtered.out.bam
$ zmwfilter --exclude bad-zmws.txt input.bam filtered.out.bam

**Note:** Functionality described below is for BAM and DataSet XML files only.

Filtering reads by their names can be achieved by providing a file which contains one read name per line (following PacBio query template name convention):
```
zmwfilter --names read_names.txt in.bam out.bam
```

BAM files can also be randomly downsampled to a provided number of ZMWs or to a fraction of the total count (for reproducibility use a fixed seed):
```
zmwfilter --downsample 0.333 in.xml out.bam
zmwfilter --downsample-count 1024 --downsample-seed 42 in.bam out.bam
```

Additionally, filtering can be constrained by providing a minimal number of passes (incompatible with `--names <str>`):
```
zmwfilter --num-passes 2 --include hole_numbers.txt in.bam out.bam
zmwfilter --num-passes 4 --downsample 0.333 in.bam out.bam
```

**Note**: options `--include <str>`, `--exclude <str>`, `--show-all`, `--names <str>`, `--downsample <float>` and `--downsample-count <int>` are all mutually exclusive!

## Changelog

Expand Down

0 comments on commit d9b1fa9

Please sign in to comment.