Skip to content

lobSTR: a short tandem repeat profiler for next generation sequencing data

License

Notifications You must be signed in to change notification settings

whitehead/lobstr-code

 
 

Repository files navigation

I. Installation from source
II. Installation from pre-compiled binaries
III. Usage
IV. Getting help
V. Required packages
VI. Output files
VII. Known Bugs
VIII. Building a lobSTR index
IX. Running lobSTR on the Amazon Cloud
X. Quality control checks for lobSTR results
XI. Building the strinfo file
XII. Converting lobSTR output to VCF format

Note, full instructions for the most current version lobSTR are provided on the lobSTR webpage (jura.wi.mit.edu/erlich/lobSTR/). This file is meant to give only basic usage instructions.

I. INSTALLATION FROM SOURCE
 
To install from source, download lobSTR from the github repository:

git clone https://github.com/whitehead/lobstr-code.git lobstr-code

Then do:

cd lobstr-code/
./reconf
./configure
make
make install

II. INSTALLATION FROM PRE-COMPILED BINARIES
tar -xzvf lobstr-xxx.tar.gz
cp lobstr-xxx/bin/* /usr/bin

III. USAGE

Part I: creating an STR alignment

To see usage instructions, just type:

lobSTR --help

for a list of all command line options.

Required inputs are:
* Input files.
For single-end fastq/fasta/bam files:
   -f <file_list>, where $file_list is a comma separated list of files containing raw reads in fasta, fastq, or bam format (for fastq you must add the argument -q, for bam you must add the argument --bam) 
For paired-end files fastq/fasta files:
  --p1 <file_list> --p2 <file_list> where $file_list arguments are comma-separated lists of files in for the first and second ends in fastq or fasta format. For fastq, you must add the argument -q.
For paired-end bam files:
  -f <file_list> where $file_list is a comma separated list of files containing reads in bam format. You must also specify the --bam and --bampair flags. Note, to process paired end bam files, you MUST sort the files by read name first using:
samtools sort -n <oldfile.bam> <newfileprefix>

For fastq or fasta files that are gzipped, specify the --gzip flag.

Several examples of setting the file path parameters:
1. Single end fastq files
-f file1.fq,file2.fq -q
2. Single end fasta files
-f file1.fa,file2.fa,file3.fa
3. Paired-end fastq files
--p1 file1_1.fq,file2_1.fq --p2 file1_2.fq,file2_2.fq -q
4. Single end bam file
-f file1.bam,file2.bam,file3.bam --bam
5. Paired-end bam file
-f file1.sorted_by_name.bam --bampair
6. Gzipped fastq paired end files
--p1 file1_1.fq.gz --p2 file1_2.fq.gz

* -o: prefix to name all output files

* --index-prefix: path and prefix of lobSTR index: You can download the lobSTR index from the lobSTR downloads page. Unzip the index to an folder PATH-TO-INDEX. To specify this as the index, use:
--index-prefix PATH-TO-INDEX/lobSTR_

Change the path accordingly to match the path where the index is stored. An index for hg19 is available for download on the lobSTR website

Part II: allelotyping

To see usage instructions, type:
allelotype --help

for a list of all command line options.

Required inputs are:
* command:
  - "train" builds a stutter noise model using given alignment data. This only works on male samples with a sizable number of reads aligned to the sex chromosomes.
  - "classify" takes in a pre made noise model and produces an allelotype file
  - "both" performs both training and classification
  - "simple" performs allelotyping without using a noise model
* --bam: a bam file consisting of aligned reads created during the alignment step.
* --out: a prefix to name output files
* --noise_model: an example noisemodel is given in the files $PATH_TO_LOBSTR/models/illumina_v2.0.3.* . Or you can create your own using the "train" command described above. 
* --strinfo: a file containing metadata about each STR locus. This is included in the lobSTR resource bundle. To create this file for a custom set of loci or for loci from a different species, see XI: BUILDING THE STRINFO FILE. This is required if command is not "simple".
* --index-prefix: PATH_TO_LOBSTR/lobSTR_, required if command is not "train".

IV. GETTING HELP
For more information on the current version see the website at http://jura.wi.mit.edu/erlich/lobSTR/ and the lobSTR user group mailing list (groups.google.com/group/lobstr-user-group)

V. REQUIREMENTS

Assuming you're using a UNIX environment, the following packages are required, and can be obtained by doing "sudo apt-get install $package_name".

gcc
g++
automake
libtool
pkg-config
fftw3-dev
libboost-dev

Python libraries:
Biopython

For sorting paired-end files, samtools must be installed.

VI. OUTPUT

The following files are output from lobSTR, with formats described fully at http://club/lobSTR/file-formats.html:

$prefix.aligned.bam: aligned reads in bam format
$prefix.vcf: genotype calls in vcf format

VII. KNOWN BUGS

- lobSTR will not produce output for chromosome names containing a "$". For now all "$" characters must be removed from chromosome names.

- For very small files (< several million reads), if lobSTR is run in multiprocessing mode it may fail to write to the BAM output file. Therefore, for small files, it is recommended to run in single processor mode (-p 1).

VIII. Building a lobSTR index
An index built using the Tandem Repeat Finder table from UCSC is included in the index_trf/ directory of the lobSTR download.

You can build your own lobSTR index using the lobstr_index.py script provided in the scripts/ directory of the lobSTR download. 

Usage:

python PATH-TO-LOBSTR/scripts/lobstr_index.py --str <path to trf table> --ref <reference genome in fasta format> --out_dir <path to output index> [--extend <INT>]

Where --str is the table resulting from running the Tandem Repeat Finder tool.
The resulting argument to --index-prefix for lobSTR will then be $out_dir/lobSTR

--extend gives how many bp around an STR locus to include in the reference. By default, this is set to 1000bp in order to allow enough sequence to align mate pairs of STR-containing reads. Note that the --extend parameter must be set to the SAME value for lobSTR alignment as it was set to during index building.

IX. Running lobSTR on the Amazon Cloud

We have made lobSTR available on the Amazon Cloud. the lobSTR tool and index are stored in an S3 bucket at:

s3://lobstr_public/lobstr_2.0.0.linux64.tar.gz
s3://lobstr_public/lobstr_index_trfhg19_extend1000.tar.gz

This bucket is located in the US Standard. Therefore, it is free to transfer the tool and index to any EC2 instance running in the same region. To download the the items in the bucket, from an EC2 instance run:

s3cmd get s3://lobstr_public/lobstr_2.0.0.linux64.tar.gz
s3cmd get s3://lobstr_public/lobstr_index_hg19_extend1000.tar.gz

You can then unpack the tars and run lobSTR by following the instructions above.

lobSTR also has several parameters built for reading files from S3. In this setting, lobSTR will transfer files from the desired S3 bucket, process the file, and then delete it. DO NOT use this option if you do not want the files to be removed from the local hard drive after processing. This option is made for processing many files from S3 in situations where it is not desirable to keep a copy of the raw data around. If this is not your desired behavior, first download the files to the local drive and proceed with lobSTR usage as described above without the s3 options set.

Using the s3 options require that s3cmd be installed and that there exists an s3 config file with the user's credential information. The s3 options are:

--use-s3 <bucket>          Files are read from this s3 bucket
                           WARNING s3 mode DELETES FILES after processing
                           DO NOT USE this option unless you are pulling
                           files from Amazon S3!
--s3config <file>          s3cmd configuration file (created by
                           s3cmd --configure)

For example, to process a genome from the 1000 genomes S3 bucket, you can run:

lobSTR --index-prefix $index_path/lobSTR_ --extend 1000 -q --bwaq 10 --mapq 100 --out NA19675_WGS_paired -q --gzip --p1 SRR058937_1.filt.fastq.gz,SRR058938_1.filt.fastq.gz,SRR058939_1.filt.fastq.gz,SRR058964_1.filt.fastq.gz --p2 SRR058937_2.filt.fastq.gz,SRR058938_2.filt.fastq.gz,SRR058939_2.filt.fastq.gz,SRR058964_2.filt.fastq.gz -v -p 10 --use-s3 s3://1000genomes/data/NA19675/sequence_read --s3config lobstr1kgtools/s3cfg

X. Quality control checks for lobSTR results

lobSTR outputs the files:
<prefix>.aligned.stats
<prefix>.allelotype.stats

with information about each run.

A description of the normal range of values reported, as well as a description of the plots produced, is given on the usage page of the website.


XI. BUILDING THE STRINFO TABLE

The allelotype step uses additional characteristics about each STR (GC content, TRF score, and sequence entropy) in its model of PCR stutter noise. This file is pre-made and included in the lobSTR download for genomes hg18 and hg19. To create it on another set of STR loci:

python $PATH_TO_LOBSTR/scripts/GetSTRInfo.py <str table> <ref.fa> > <strinfo.tab>

e.g.:
python $PATH_TO_LOBSTR/scripts/GetSTRInfo.py reference.bed > reference_strinfo.tab

About

lobSTR: a short tandem repeat profiler for next generation sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 69.0%
  • C 25.6%
  • Shell 3.0%
  • Python 2.4%