forked from mgymrek/lobstr-code
-
Notifications
You must be signed in to change notification settings - Fork 0
lobSTR: a short tandem repeat profiler for next generation sequencing data
License
whitehead/lobstr-code
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
I. Installation from source II. Installation from pre-compiled binaries III. Usage IV. Getting help V. Required packages VI. Output files VII. Known Bugs VIII. Building a lobSTR index IX. Running lobSTR on the Amazon Cloud X. Quality control checks for lobSTR results XI. Building the strinfo file XII. Converting lobSTR output to VCF format Note, full instructions for the most current version lobSTR are provided on the lobSTR webpage (jura.wi.mit.edu/erlich/lobSTR/). This file is meant to give only basic usage instructions. I. INSTALLATION FROM SOURCE To install from source, download lobSTR from the github repository: git clone https://github.com/whitehead/lobstr-code.git lobstr-code Then do: cd lobstr-code/ ./reconf ./configure make make install II. INSTALLATION FROM PRE-COMPILED BINARIES tar -xzvf lobstr-xxx.tar.gz cp lobstr-xxx/bin/* /usr/bin III. USAGE Part I: creating an STR alignment To see usage instructions, just type: lobSTR --help for a list of all command line options. Required inputs are: * Input files. For single-end fastq/fasta/bam files: -f <file_list>, where $file_list is a comma separated list of files containing raw reads in fasta, fastq, or bam format (for fastq you must add the argument -q, for bam you must add the argument --bam) For paired-end files fastq/fasta files: --p1 <file_list> --p2 <file_list> where $file_list arguments are comma-separated lists of files in for the first and second ends in fastq or fasta format. For fastq, you must add the argument -q. For paired-end bam files: -f <file_list> where $file_list is a comma separated list of files containing reads in bam format. You must also specify the --bam and --bampair flags. Note, to process paired end bam files, you MUST sort the files by read name first using: samtools sort -n <oldfile.bam> <newfileprefix> For fastq or fasta files that are gzipped, specify the --gzip flag. Several examples of setting the file path parameters: 1. Single end fastq files -f file1.fq,file2.fq -q 2. Single end fasta files -f file1.fa,file2.fa,file3.fa 3. Paired-end fastq files --p1 file1_1.fq,file2_1.fq --p2 file1_2.fq,file2_2.fq -q 4. Single end bam file -f file1.bam,file2.bam,file3.bam --bam 5. Paired-end bam file -f file1.sorted_by_name.bam --bampair 6. Gzipped fastq paired end files --p1 file1_1.fq.gz --p2 file1_2.fq.gz * -o: prefix to name all output files * --index-prefix: path and prefix of lobSTR index: You can download the lobSTR index from the lobSTR downloads page. Unzip the index to an folder PATH-TO-INDEX. To specify this as the index, use: --index-prefix PATH-TO-INDEX/lobSTR_ Change the path accordingly to match the path where the index is stored. An index for hg19 is available for download on the lobSTR website Part II: allelotyping To see usage instructions, type: allelotype --help for a list of all command line options. Required inputs are: * command: - "train" builds a stutter noise model using given alignment data. This only works on male samples with a sizable number of reads aligned to the sex chromosomes. - "classify" takes in a pre made noise model and produces an allelotype file - "both" performs both training and classification - "simple" performs allelotyping without using a noise model * --bam: a bam file consisting of aligned reads created during the alignment step. * --out: a prefix to name output files * --noise_model: an example noisemodel is given in the files $PATH_TO_LOBSTR/models/illumina_v2.0.3.* . Or you can create your own using the "train" command described above. * --strinfo: a file containing metadata about each STR locus. This is included in the lobSTR resource bundle. To create this file for a custom set of loci or for loci from a different species, see XI: BUILDING THE STRINFO FILE. This is required if command is not "simple". * --index-prefix: PATH_TO_LOBSTR/lobSTR_, required if command is not "train". IV. GETTING HELP For more information on the current version see the website at http://jura.wi.mit.edu/erlich/lobSTR/ and the lobSTR user group mailing list (groups.google.com/group/lobstr-user-group) V. REQUIREMENTS Assuming you're using a UNIX environment, the following packages are required, and can be obtained by doing "sudo apt-get install $package_name". gcc g++ automake libtool pkg-config fftw3-dev libboost-dev Python libraries: Biopython For sorting paired-end files, samtools must be installed. VI. OUTPUT The following files are output from lobSTR, with formats described fully at http://club/lobSTR/file-formats.html: $prefix.aligned.bam: aligned reads in bam format $prefix.vcf: genotype calls in vcf format VII. KNOWN BUGS - lobSTR will not produce output for chromosome names containing a "$". For now all "$" characters must be removed from chromosome names. - For very small files (< several million reads), if lobSTR is run in multiprocessing mode it may fail to write to the BAM output file. Therefore, for small files, it is recommended to run in single processor mode (-p 1). VIII. Building a lobSTR index An index built using the Tandem Repeat Finder table from UCSC is included in the index_trf/ directory of the lobSTR download. You can build your own lobSTR index using the lobstr_index.py script provided in the scripts/ directory of the lobSTR download. Usage: python PATH-TO-LOBSTR/scripts/lobstr_index.py --str <path to trf table> --ref <reference genome in fasta format> --out_dir <path to output index> [--extend <INT>] Where --str is the table resulting from running the Tandem Repeat Finder tool. The resulting argument to --index-prefix for lobSTR will then be $out_dir/lobSTR --extend gives how many bp around an STR locus to include in the reference. By default, this is set to 1000bp in order to allow enough sequence to align mate pairs of STR-containing reads. Note that the --extend parameter must be set to the SAME value for lobSTR alignment as it was set to during index building. IX. Running lobSTR on the Amazon Cloud We have made lobSTR available on the Amazon Cloud. the lobSTR tool and index are stored in an S3 bucket at: s3://lobstr_public/lobstr_2.0.0.linux64.tar.gz s3://lobstr_public/lobstr_index_trfhg19_extend1000.tar.gz This bucket is located in the US Standard. Therefore, it is free to transfer the tool and index to any EC2 instance running in the same region. To download the the items in the bucket, from an EC2 instance run: s3cmd get s3://lobstr_public/lobstr_2.0.0.linux64.tar.gz s3cmd get s3://lobstr_public/lobstr_index_hg19_extend1000.tar.gz You can then unpack the tars and run lobSTR by following the instructions above. lobSTR also has several parameters built for reading files from S3. In this setting, lobSTR will transfer files from the desired S3 bucket, process the file, and then delete it. DO NOT use this option if you do not want the files to be removed from the local hard drive after processing. This option is made for processing many files from S3 in situations where it is not desirable to keep a copy of the raw data around. If this is not your desired behavior, first download the files to the local drive and proceed with lobSTR usage as described above without the s3 options set. Using the s3 options require that s3cmd be installed and that there exists an s3 config file with the user's credential information. The s3 options are: --use-s3 <bucket> Files are read from this s3 bucket WARNING s3 mode DELETES FILES after processing DO NOT USE this option unless you are pulling files from Amazon S3! --s3config <file> s3cmd configuration file (created by s3cmd --configure) For example, to process a genome from the 1000 genomes S3 bucket, you can run: lobSTR --index-prefix $index_path/lobSTR_ --extend 1000 -q --bwaq 10 --mapq 100 --out NA19675_WGS_paired -q --gzip --p1 SRR058937_1.filt.fastq.gz,SRR058938_1.filt.fastq.gz,SRR058939_1.filt.fastq.gz,SRR058964_1.filt.fastq.gz --p2 SRR058937_2.filt.fastq.gz,SRR058938_2.filt.fastq.gz,SRR058939_2.filt.fastq.gz,SRR058964_2.filt.fastq.gz -v -p 10 --use-s3 s3://1000genomes/data/NA19675/sequence_read --s3config lobstr1kgtools/s3cfg X. Quality control checks for lobSTR results lobSTR outputs the files: <prefix>.aligned.stats <prefix>.allelotype.stats with information about each run. A description of the normal range of values reported, as well as a description of the plots produced, is given on the usage page of the website. XI. BUILDING THE STRINFO TABLE The allelotype step uses additional characteristics about each STR (GC content, TRF score, and sequence entropy) in its model of PCR stutter noise. This file is pre-made and included in the lobSTR download for genomes hg18 and hg19. To create it on another set of STR loci: python $PATH_TO_LOBSTR/scripts/GetSTRInfo.py <str table> <ref.fa> > <strinfo.tab> e.g.: python $PATH_TO_LOBSTR/scripts/GetSTRInfo.py reference.bed > reference_strinfo.tab
About
lobSTR: a short tandem repeat profiler for next generation sequencing data
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C++ 69.0%
- C 25.6%
- Shell 3.0%
- Python 2.4%