Skip to content

Commit dbb7dda

Browse files
committed
Merge pull request samtools#265 from samtools/feature/norm-set-ref
bcftools norm set REF allele
2 parents df1d0a0 + 05e8a49 commit dbb7dda

File tree

7 files changed

+241
-460
lines changed

7 files changed

+241
-460
lines changed

doc/bcftools.1

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
.\" Title: bcftools
33
.\" Author: [see the "AUTHORS" section]
44
.\" Generator: DocBook XSL Stylesheets v1.76.1 <http://docbook.sf.net/>
5-
.\" Date: 04/22/2015
5+
.\" Date: 05/21/2015
66
.\" Manual: \ \&
77
.\" Source: \ \&
88
.\" Language: English
99
.\"
10-
.TH "BCFTOOLS" "1" "04/22/2015" "\ \&" "\ \&"
10+
.TH "BCFTOOLS" "1" "05/21/2015" "\ \&" "\ \&"
1111
.\" -----------------------------------------------------------------
1212
.\" * Define some portability stuff
1313
.\" -----------------------------------------------------------------
@@ -1772,6 +1772,18 @@ see
17721772
.sp
17731773
Left\-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows\&.
17741774
.PP
1775+
\fB\-c, \-\-check\-ref\fR \fIe\fR|\fIw\fR|\fIx\fR|\fIs\fR
1776+
.RS 4
1777+
what to do when incorrect or missing REF allele is encountered: exit (\fIe\fR), warn (\fIw\fR), exclude (\fIx\fR), or set/fix (\fIs\fR) bad sites\&. The
1778+
\fIw\fR
1779+
option can be combined with
1780+
\fIx\fR
1781+
and
1782+
\fIs\fR\&. Note that
1783+
\fIs\fR
1784+
can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields\&.
1785+
.RE
1786+
.PP
17751787
\fB\-D, \-\-remove\-duplicates\fR
17761788
.RS 4
17771789
remove duplicate lines of the same type
@@ -1790,6 +1802,18 @@ split multiallelic sites into biallelic records (\fI\-\fR) or join biallelic sit
17901802
\fIany\fR\&.
17911803
.RE
17921804
.PP
1805+
\fB\-N, \-\-do\-not\-normalize\fR
1806+
.RS 4
1807+
the
1808+
\fI\-c s\fR
1809+
option can be used to fix or set the REF allele from the reference
1810+
\fI\-f\fR\&. The
1811+
\fI\-N\fR
1812+
option will not turn on indel normalisation as the
1813+
\fI\-f\fR
1814+
option normally implies
1815+
.RE
1816+
.PP
17931817
\fB\-o, \-\-output\fR \fIFILE\fR
17941818
.RS 4
17951819
see

doc/bcftools.html

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
<?xml version="1.0" encoding="UTF-8"?>
22
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3-
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>bcftools</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets V1.76.1" /></head><body><div xml:lang="en" class="refentry" title="bcftools" lang="en"><a id="idp135936"></a><div class="titlepage"></div><div class="refnamediv"><h2>Name</h2><p>bcftools — utilities for variant calling and manipulating VCFs and BCFs.</p></div><div class="refsynopsisdiv" title="Synopsis"><a id="_synopsis"></a><h2>Synopsis</h2><p><span class="strong"><strong>bcftools</strong></span> [<span class="emphasis"><em>COMMAND</em></span>] [<span class="emphasis"><em>OPTIONS</em></span>]</p></div><div class="refsect1" title="DESCRIPTION"><a id="_description"></a><h2>DESCRIPTION</h2><p>BCFtools is a set of utilities that manipulate variant calls in the Variant
3+
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>bcftools</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets V1.76.1" /></head><body><div xml:lang="en" class="refentry" title="bcftools" lang="en"><a id="idp25196192"></a><div class="titlepage"></div><div class="refnamediv"><h2>Name</h2><p>bcftools — utilities for variant calling and manipulating VCFs and BCFs.</p></div><div class="refsynopsisdiv" title="Synopsis"><a id="_synopsis"></a><h2>Synopsis</h2><p><span class="strong"><strong>bcftools</strong></span> [<span class="emphasis"><em>COMMAND</em></span>] [<span class="emphasis"><em>OPTIONS</em></span>]</p></div><div class="refsect1" title="DESCRIPTION"><a id="_description"></a><h2>DESCRIPTION</h2><p>BCFtools is a set of utilities that manipulate variant calls in the Variant
44
Call Format (VCF) and its binary counterpart BCF. All commands work
55
transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.</p><p>Most commands accept VCF, bgzipped VCF and BCF with filetype detected
66
automatically even when streaming from a pipe. Indexed VCF and BCF
77
will work in all situations. Un-indexed VCF and BCF and streams will
88
work in most, but not all situations.</p><p>BCFtools is designed to work on a stream. It regards an input file "-" as the
99
standard input (stdin) and outputs to the standard output (stdout). Several
10-
commands can thus be combined with Unix pipes.</p><div class="refsect2" title="VERSION"><a id="_version"></a><h3>VERSION</h3><p>This manual page was last updated <span class="strong"><strong>2015-03-20 11:57 GMT</strong></span> and refers to bcftools git version <span class="strong"><strong>1.2-6-ga8d7fe9+</strong></span>.</p></div><div class="refsect2" title="BCF1"><a id="_bcf1"></a><h3>BCF1</h3><p>The BCF1 format output by versions of samtools &lt;= 0.1.19 is <span class="strong"><strong>not</strong></span>
10+
commands can thus be combined with Unix pipes.</p><div class="refsect2" title="VERSION"><a id="_version"></a><h3>VERSION</h3><p></p></div><div class="refsect2" title="BCF1"><a id="_bcf1"></a><h3>BCF1</h3><p>The BCF1 format output by versions of samtools &lt;= 0.1.19 is <span class="strong"><strong>not</strong></span>
1111
compatible with this version of bcftools. To read BCF1 files one can use
1212
the view command from old versions of bcftools packaged with samtools
1313
versions &lt;= 0.1.19 to convert to VCF, which can then be read by
@@ -573,6 +573,10 @@
573573
</span></dt><dd>
574574
convert gVCF to VCF, expanding REF blocks into sites. Only sites
575575
with FILTER set to "PASS" or "." will be expanded.
576+
</dd><dt><span class="term">
577+
<span class="strong"><strong>-f, --fasta-ref</strong></span> <span class="emphasis"><em>file</em></span>
578+
</span></dt><dd>
579+
reference sequence in fasta format. Must be indexed with samtools faidx
576580
</dd></dl></div></div><div class="refsect3" title="HAPS/SAMPLE conversion:"><a id="_haps_sample_conversion"></a><h4>HAPS/SAMPLE conversion:</h4><div class="variablelist"><dl><dt><span class="term">
577581
<span class="strong"><strong>--hapsample2vcf</strong></span> <span class="emphasis"><em>prefix</em></span> or <span class="emphasis"><em>haps-file</em></span>,<span class="emphasis"><em>sample-file</em></span>
578582
</span></dt><dd>
@@ -641,7 +645,7 @@
641645
</dd><dt><span class="term">
642646
<span class="strong"><strong>-f, --fasta-ref</strong></span> <span class="emphasis"><em>file</em></span>
643647
</span></dt><dd>
644-
reference sequence in fasta format
648+
reference sequence in fasta format. Must be indexed with samtools faidx
645649
</dd><dt><span class="term">
646650
<span class="strong"><strong>-s, --samples</strong></span> <span class="emphasis"><em>LIST</em></span>
647651
</span></dt><dd>
@@ -969,6 +973,14 @@
969973
</dd></dl></div></div><div class="refsect2" title="bcftools norm [OPTIONS] file.vcf.gz"><a id="norm"></a><h3>bcftools norm [<span class="emphasis"><em>OPTIONS</em></span>] <span class="emphasis"><em>file.vcf.gz</em></span></h3><p>Left-align and normalize indels, check if REF alleles match the reference,
970974
split multiallelic sites into multiple rows; recover multiallelics from
971975
multiple rows.</p><div class="variablelist"><dl><dt><span class="term">
976+
<span class="strong"><strong>-c, --check-ref</strong></span> <span class="emphasis"><em>e</em></span>|<span class="emphasis"><em>w</em></span>|<span class="emphasis"><em>x</em></span>|<span class="emphasis"><em>s</em></span>
977+
</span></dt><dd>
978+
what to do when incorrect or missing REF allele is encountered:
979+
exit (<span class="emphasis"><em>e</em></span>), warn (<span class="emphasis"><em>w</em></span>), exclude (<span class="emphasis"><em>x</em></span>), or set/fix (<span class="emphasis"><em>s</em></span>) bad sites.
980+
The <span class="emphasis"><em>w</em></span> option can be combined with <span class="emphasis"><em>x</em></span> and <span class="emphasis"><em>s</em></span>. Note that <span class="emphasis"><em>s</em></span>
981+
can swap alleles and will update genotypes (GT) and AC counts,
982+
but will not attempt to fix PL or other fields.
983+
</dd><dt><span class="term">
972984
<span class="strong"><strong>-D, --remove-duplicates</strong></span>
973985
</span></dt><dd>
974986
remove duplicate lines of the same type
@@ -987,6 +999,12 @@
987999
<span class="emphasis"><em>both</em></span>; if SNPs and indels should be merged into a single record, specify
9881000
<span class="emphasis"><em>any</em></span>.
9891001
</dd><dt><span class="term">
1002+
<span class="strong"><strong>-N, --do-not-normalize</strong></span>
1003+
</span></dt><dd>
1004+
the <span class="emphasis"><em>-c s</em></span> option can be used to fix or set the REF allele from the
1005+
reference <span class="emphasis"><em>-f</em></span>. The <span class="emphasis"><em>-N</em></span> option will not turn on indel normalisation
1006+
as the <span class="emphasis"><em>-f</em></span> option normally implies
1007+
</dd><dt><span class="term">
9901008
<span class="strong"><strong>-o, --output</strong></span> <span class="emphasis"><em>FILE</em></span>
9911009
</span></dt><dd>
9921010
see <span class="strong"><strong><a class="link" href="#common_options" title="Common Options">Common Options</a></strong></span>

doc/bcftools.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,6 +1026,13 @@ Left-align and normalize indels, check if REF alleles match the reference,
10261026
split multiallelic sites into multiple rows; recover multiallelics from
10271027
multiple rows.
10281028

1029+
*-c, --check-ref* 'e'|'w'|'x'|'s'::
1030+
what to do when incorrect or missing REF allele is encountered:
1031+
exit ('e'), warn ('w'), exclude ('x'), or set/fix ('s') bad sites.
1032+
The 'w' option can be combined with 'x' and 's'. Note that 's'
1033+
can swap alleles and will update genotypes (GT) and AC counts,
1034+
but will not attempt to fix PL or other fields.
1035+
10291036
*-D, --remove-duplicates*::
10301037
remove duplicate lines of the same type
10311038

@@ -1041,6 +1048,11 @@ multiple rows.
10411048
'both'; if SNPs and indels should be merged into a single record, specify
10421049
'any'.
10431050

1051+
*-N, --do-not-normalize*::
1052+
the '-c s' option can be used to fix or set the REF allele from the
1053+
reference '-f'. The '-N' option will not turn on indel normalisation
1054+
as the '-f' option normally implies
1055+
10441056
*-o, --output* 'FILE'::
10451057
see *<<common_options,Common Options>>*
10461058

test/norm.setref.out

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
##fileformat=VCFv4.1
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
4+
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
5+
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled likelihood">
6+
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Depth">
7+
##contig=<ID=1,length=2147483647>
8+
##contig=<ID=2,length=2147483647>
9+
##contig=<ID=3,length=2147483647>
10+
##contig=<ID=4,length=2147483647>
11+
##contig=<ID=5,length=2147483647>
12+
##contig=<ID=20,length=2147483647>
13+
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
14+
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
15+
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
16+
##INFO=<ID=XRF,Number=R,Type=Float,Description="Test Number=AGR in INFO">
17+
##INFO=<ID=XAF,Number=A,Type=Float,Description="Test Number=AGR in INFO">
18+
##INFO=<ID=XGF,Number=G,Type=Float,Description="Test Number=AGR in INFO">
19+
##INFO=<ID=XRI,Number=R,Type=Integer,Description="Test Number=AGR in INFO">
20+
##INFO=<ID=XAI,Number=A,Type=Integer,Description="Test Number=AGR in INFO">
21+
##INFO=<ID=XGI,Number=G,Type=Integer,Description="Test Number=AGR in INFO">
22+
##INFO=<ID=XRS,Number=R,Type=String,Description="Test Number=AGR in INFO">
23+
##INFO=<ID=XAS,Number=A,Type=String,Description="Test Number=AGR in INFO">
24+
##INFO=<ID=XGS,Number=G,Type=String,Description="Test Number=AGR in INFO">
25+
##FORMAT=<ID=FRF,Number=R,Type=Float,Description="Test Number=AGR in FORMAT">
26+
##FORMAT=<ID=FAF,Number=A,Type=Float,Description="Test Number=AGR in FORMAT">
27+
##FORMAT=<ID=FGF,Number=G,Type=Float,Description="Test Number=AGR in FORMAT">
28+
##FORMAT=<ID=FRI,Number=R,Type=Integer,Description="Test Number=AGR in FORMAT">
29+
##FORMAT=<ID=FAI,Number=A,Type=Integer,Description="Test Number=AGR in FORMAT">
30+
##FORMAT=<ID=FGI,Number=G,Type=Integer,Description="Test Number=AGR in FORMAT">
31+
##FORMAT=<ID=FRS,Number=R,Type=String,Description="Test Number=AGR in FORMAT">
32+
##FORMAT=<ID=FAS,Number=A,Type=String,Description="Test Number=AGR in FORMAT">
33+
##FORMAT=<ID=FGS,Number=G,Type=String,Description="Test Number=AGR in FORMAT">
34+
##FORMAT=<ID=FSTR,Number=1,Type=String,Description="Test String in FORMAT">
35+
##INFO=<ID=ISTR,Number=1,Type=String,Description="Test String in INFO">
36+
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
37+
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XY00001 XY00002
38+
1 105 . TAAACCCTAAA TAACCCTAAA,TAA 999 PASS INDEL;AN=4;AC=0,2;DP=19 GT 0/2 0/2
39+
2 101 . A c 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
40+
2 105 . T <DEL> 999 PASS END=112;AN=4;AC=3 GT:DP 0/1:1 1/1:1
41+
2 115 . c t 999 PASS INDEL;AN=4;AC=0 GT:DP 0/0:1 0/0:1
42+
20 3 . g c 999 PASS INDEL;AN=4;AC=3 GT 1/1 1/0
43+
20 3 . gatg gact 999 PASS INDEL;AN=4;AC=2 GT 0/1 0/1
44+
20 10 . C . 999 PASS INDEL;AN=4;AC=1 GT 1/0 0/0
45+
20 275 . a c,g,t,aaa 999 PASS INDEL;AN=2;AC=0,0,0,0 GT 0 0

test/norm.setref.vcf

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
##fileformat=VCFv4.1
2+
##FILTER=<ID=PASS,Description="All filters passed">
3+
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
4+
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
5+
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled likelihood">
6+
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Depth">
7+
##contig=<ID=1,length=2147483647>
8+
##contig=<ID=2,length=2147483647>
9+
##contig=<ID=3,length=2147483647>
10+
##contig=<ID=4,length=2147483647>
11+
##contig=<ID=5,length=2147483647>
12+
##contig=<ID=20,length=2147483647>
13+
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
14+
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
15+
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
16+
##INFO=<ID=XRF,Number=R,Type=Float,Description="Test Number=AGR in INFO">
17+
##INFO=<ID=XAF,Number=A,Type=Float,Description="Test Number=AGR in INFO">
18+
##INFO=<ID=XGF,Number=G,Type=Float,Description="Test Number=AGR in INFO">
19+
##INFO=<ID=XRI,Number=R,Type=Integer,Description="Test Number=AGR in INFO">
20+
##INFO=<ID=XAI,Number=A,Type=Integer,Description="Test Number=AGR in INFO">
21+
##INFO=<ID=XGI,Number=G,Type=Integer,Description="Test Number=AGR in INFO">
22+
##INFO=<ID=XRS,Number=R,Type=String,Description="Test Number=AGR in INFO">
23+
##INFO=<ID=XAS,Number=A,Type=String,Description="Test Number=AGR in INFO">
24+
##INFO=<ID=XGS,Number=G,Type=String,Description="Test Number=AGR in INFO">
25+
##FORMAT=<ID=FRF,Number=R,Type=Float,Description="Test Number=AGR in FORMAT">
26+
##FORMAT=<ID=FAF,Number=A,Type=Float,Description="Test Number=AGR in FORMAT">
27+
##FORMAT=<ID=FGF,Number=G,Type=Float,Description="Test Number=AGR in FORMAT">
28+
##FORMAT=<ID=FRI,Number=R,Type=Integer,Description="Test Number=AGR in FORMAT">
29+
##FORMAT=<ID=FAI,Number=A,Type=Integer,Description="Test Number=AGR in FORMAT">
30+
##FORMAT=<ID=FGI,Number=G,Type=Integer,Description="Test Number=AGR in FORMAT">
31+
##FORMAT=<ID=FRS,Number=R,Type=String,Description="Test Number=AGR in FORMAT">
32+
##FORMAT=<ID=FAS,Number=A,Type=String,Description="Test Number=AGR in FORMAT">
33+
##FORMAT=<ID=FGS,Number=G,Type=String,Description="Test Number=AGR in FORMAT">
34+
##FORMAT=<ID=FSTR,Number=1,Type=String,Description="Test String in FORMAT">
35+
##INFO=<ID=ISTR,Number=1,Type=String,Description="Test String in INFO">
36+
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
37+
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XY00001 XY00002
38+
1 105 . TAACCCTAAA TAAACCCTAAA,TAA 999 PASS INDEL;AN=4;AC=2,2;DP=19 GT 1/2 1/2
39+
2 101 . . c 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
40+
2 105 . n <DEL> 999 PASS END=112;AN=4;AC=3 GT:DP 0/1:1 1/1:1
41+
2 115 . t c 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
42+
20 3 . c g 999 PASS INDEL;AN=4;AC=1 GT 0/0 0/1
43+
20 3 . gact gatg 999 PASS INDEL;AN=4;AC=2 GT 1/0 1/0
44+
20 10 . . . 999 PASS INDEL;AN=4;AC=1 GT 1/0 0/0
45+
20 275 . g c,a,t,aaa 999 PASS INDEL;AN=2;AC=0,2,0,0 GT 2 2

test/test.pl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@
8585
test_vcf_norm($opts,in=>'norm.split',out=>'norm.split.out',args=>'-m-');
8686
test_vcf_norm($opts,in=>'norm.merge',out=>'norm.merge.out',args=>'-m+');
8787
test_vcf_norm($opts,in=>'norm.merge',out=>'norm.merge.strict.out',args=>'-m+ -s');
88+
test_vcf_norm($opts,in=>'norm.setref',out=>'norm.setref.out',args=>'-Nc s',fai=>'norm');
8889
test_vcf_view($opts,in=>'view',out=>'view.1.out',args=>'-aUc1 -C1 -s NA00002 -v snps',reg=>'');
8990
test_vcf_view($opts,in=>'view',out=>'view.2.out',args=>'-f PASS -Xks NA00003',reg=>'-r20,Y');
9091
test_vcf_view($opts,in=>'view',out=>'view.3.out',args=>'-xs NA00003',reg=>'');

0 commit comments

Comments
 (0)