Skip to content

SAMTOOLS Output Formats

Hannes Hauswedell edited this page Dec 22, 2016 · 8 revisions

Since version 0.9.2 Lambda also supports the SAM and BAM formats (.sam and .bam). Since SAM and BAM are originally not designed for local alignments, especially of protein sequences, this document describes Lambda's implementation of the standard.

Please see the official specification if some of the terms used here are not clear to you.

column use in Lambda
QNAME name of the query sequence, truncated at first whitespace
FLAG bit 16 and bit 256 implemented in a standard conform way
RNAME name of the subject sequence, truncated at first whitespace
POS begin position of alignment on subject sequence; begin position on original untranslated DNA sequence for TBlastN, TBlastX, end position if negative strand; begin position on protein sequence for BlastP, BlastX
MAPQ 255
CIGAR query DNA cigar (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reversed if negative strand/frame
RNEXT *
PNEXT 0
TLEN 0
SEQ query DNA sequence (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reverse-complemented if negative strand/frame; see below for clipping
QUAL *
OPT see below

Sequence strings

Following the recommendations of the specification the SEQ field is only written, if it is different from the previous line's SEQ field. This can be changed via Lambda's command line parameter --sam-bam-seq which can be set to always or never (the latter saves more space). This behaviour also applies to the ZQ tag defined below.

Clipping

Via the --sam-bam-clip parameter you can chose between hard-clipping and soft-clipping. Soft-clipping will result in full sequences in the SEQ and ZQ fields while hard-clipping will only show the locally matching part. Depending on that the CIGAR strings will also contain H or S characters. Hard-clipping is the default, because it takes up less space.

Please be aware that if the query sequence is translated, those DNA positions that are lost because frame-shifts or incomplete frames (at the end of a sequence) are always hard-clipped. These positions are also not represented in the protein cigar (see the ZQ tag below).

Optional tags

tag
(lambda)
tag
(lambda2)
description
official
AS AS bit score
OC OC query protein cigar (* for BLASTN)
NM NM edit distance (in protein space unless BLASTN)
IH IH number of matches this query has
regarding the alignment
ZE ae expect value
ZR ar raw score
ZI ai % identity (in protein space unless BLASTN)
ZP ap % positive (in protein space unless BLASTN)
regarding the query sequence
ZF qf query frame
ZQ qs query protein sequence (* for BLASTN)
regarding the subject sequence
YF sf subject frame
n/a st subject taxonomy ID(s) separated by ; ([[more info
regarding all matches of this query
n/a ls lowest common ancestor scientific name ([[more info
n/a lt lowest common ancestor taxonomy id ([[more info

These tags can be specified with the command line argument --sam-bam-tags. If you would like to see any other tags supported, please don't hesitate to contact us.

Header

BAM files require all subject names to be written to the header. For SAM this is not required, so Lambda does not automatically do it to save space (especially for protein database this is a lot!). If you still want them with SAM, e.g. for better BAM compatibility, use the --sam-with-refheader option.

Clone this wiki locally