forked from yangao07/LAMSA
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
gaoyan07
committed
Dec 4, 2015
1 parent
a2985cd
commit cbf9883
Showing
71 changed files
with
850 additions
and
4,459 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,25 @@ | ||
# Object files | ||
*.o | ||
|
||
# Libraries | ||
*.lib | ||
*.a | ||
|
||
# Shared objects (inc. Windows DLLs) | ||
*.dll | ||
*.so | ||
*.so.* | ||
*.dylib | ||
src/*.o | ||
|
||
# Executables | ||
*.exe | ||
*.out | ||
*.app | ||
lsat | ||
lamsa | ||
gdb_lamsa | ||
|
||
# Project files | ||
.cproject | ||
.project | ||
|
||
#ctags | ||
src/tags | ||
tags | ||
|
||
#cscope | ||
src/cscope.* | ||
cscope.* | ||
|
||
#data | ||
Data/* | ||
data/* | ||
out | ||
|
||
#soap2-dp | ||
soap2-dp/* | ||
|
||
#bwa | ||
bwa/* | ||
|
||
#gem | ||
gem/* | ||
test/* | ||
|
||
#output | ||
output/* | ||
out* | ||
|
||
#temp source files | ||
LSAT/* | ||
|
||
#eclipse project | ||
*project | ||
|
||
#err log | ||
error.log | ||
|
||
#shell | ||
lsat.sh | ||
run.sh | ||
mol.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
language: c | ||
compiler: | ||
- gcc | ||
script: make |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,22 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2013 gaoyan07 | ||
Copyright (c) 2015 Yan Gao | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy of | ||
this software and associated documentation files (the "Software"), to deal in | ||
the Software without restriction, including without limitation the rights to | ||
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of | ||
the Software, and to permit persons to whom the Software is furnished to do so, | ||
subject to the following conditions: | ||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS | ||
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR | ||
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER | ||
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | ||
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,38 @@ | ||
CC= gcc | ||
CFLAGS= -g -Wall -O3 -Wno-unused-variable -Wno-unused-but-set-variable -Wno-unused-function | ||
OBJS= main.o build_ref.o bntseq.o lsat_heap.o lsat_aln.o lsat_dp_con.o frag_check.o split_mapping.o ksw.o \ | ||
gem_parse.o is.o bwtindex.o bwt_gen.o QSufSort.o\ | ||
./lsat_sam_parse/bam_aux.o ./lsat_sam_parse/bam.o ./lsat_sam_parse/bam_import.o \ | ||
./lsat_sam_parse/kstring.o ./lsat_sam_parse/sam_header.o ./lsat_sam_parse/sam_view.o \ | ||
bwt.o bwt_aln.o utils.o | ||
PROG= lsat | ||
PROG1= ~/bin/lsat | ||
LIB= -lm -lz -lpthread | ||
#MACRO= -D __NEW__ | ||
#MACRO= -D __DEBUG__ | ||
.SUFFIXES:.c .o | ||
CC = gcc | ||
CFLAGS = -g -Wall -O3 -Wno-unused-variable -Wno-unused-but-set-variable -Wno-unused-function | ||
DFLAGS = -g -Wall | ||
LIB = -lm -lz -lpthread | ||
|
||
BIN_DIR = . | ||
SRC_DIR = ./src | ||
|
||
SOURCE = $(wildcard ${SRC_DIR}/*.c) | ||
#SOURCE = main.c build_ref.c bntseq.c lamsa_heap.c lamsa_aln.c lamsa_dp_con.c frag_check.c split_mapping.c ksw.c \ | ||
gem_parse.c is.c bwtindex.c bwt_gen.c QSufSort.c kstring.c \ | ||
bwt.c bwt_aln.c utils.c | ||
OBJS = $(SOURCE:.c=.o) | ||
|
||
BIN = $(BIN_DIR)/lamsa | ||
|
||
DEBUG = $(BIN_DIR)/gdb_lamsa | ||
DMARCRO = -D __DEBUG__ | ||
|
||
.c.o: | ||
$(CC) -c $(CFLAGS) $(MACRO) $< -o $@ | ||
$(CC) -c $(CFLAGS) $< -o $@ | ||
|
||
all:$(PROG) $(PROG1) | ||
$(PROG):$(OBJS) | ||
$(CC) $(CFLAGS) $(OBJS) $(LIB) -o $@ | ||
$(PROG1):$(OBJS) | ||
$(CC) $(CFLAGS) $(OBJS) $(LIB) -o $@ | ||
all: $(SOURCE) $(BIN) | ||
#lamsa: $(SOURCE) $(BIN) | ||
gdb_lamsa: $(SOURCE) $(DEBUG) | ||
|
||
|
||
$(BIN): $(OBJS) | ||
$(CC) $(OBJS) -o $@ $(LIB) | ||
|
||
$(DEBUG): | ||
$(CC) $(DFLAGS) $(SOURCE) $(DMARCRO) -o $@ $(LIB) | ||
|
||
clean: | ||
rm -f *.o lsat ~/bin/lsat | ||
rm -f $(SRC_DIR)/*.o $(BIN) | ||
|
||
clean_debug: | ||
rm -f $(SRC_DIR)/*.o $(DEBUG) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,100 @@ | ||
LSAT | ||
==== | ||
# LAMSA | ||
Long Approximated Matches-based Split Aligner | ||
|
||
## Getting started | ||
git clone https://github.com/gaoyan07/LAMSA.git | ||
cd LAMSA; make | ||
./lamsa index ref.fa | ||
./lamsa aln ref.fa read.fq > aln.sam | ||
|
||
## Introduction | ||
LAMSA(Long Approximated Matches-based Split Aligner) is a novel read alignment approach with faster speed and good ability of handling both co-linear and non-co-linear SV events. | ||
|
||
LAMSA takes the advantage of the rareness of SVs to implement a specifically designed two-step split read alignment strategy, which efficiently solves the small events and mitigates the affection of repeats with co-linear alignment, and then well-handles the relatively large or non-co-linear events with a sparse dynamic programming (SDP)-based split alignment approach. | ||
|
||
LAMSA has outstanding throughput on aligning both simulated and real datasets having various read length and sequencing error rates. It is severaly to over 100 folds faster than the state-of-art long read aligners. Morever, it also has good ability of handling various kinds of SV events within the read. | ||
|
||
LAMSA is open source and free for non-commercial use. | ||
|
||
LAMSA is mainly designed by Bo Liu & Yan Gao and developed by Yan Gao in Center for Bioinformatics, Harbin Institute of Technology, China. | ||
|
||
## Memory requirement | ||
The memory usage of LAMSA can fit the configurations of most modern servers and workstations. Its peak memory footprint depends on the length of the read, i.e., 5.1 Gigabytes and 7.2 Gigabytes respectively for the 5000 bp and 100000 bp datasets, on a server with Intel Xeon CPU at 2.00 GHz, 1 Terabytes RAM running Linux Ubuntu 14.04. These reads were aligned to GRCh37/hg19 reference genome. | ||
|
||
## Installation | ||
Current version of LAMSA needs to be run on Linux operating system. | ||
The source code is written in C, and can be directly download from: https://github.com/gaoyan07/LAMSA | ||
The makefile is attached. Use the make command for generating the executable file. | ||
|
||
## Synopsis | ||
|
||
Index reference sequence and generate auxiliary files | ||
``` | ||
lamsa index ref.fa | ||
``` | ||
|
||
Align long read to reference | ||
``` | ||
lamsa aln ref.fa read.fa/fq > aln.sam | ||
``` | ||
|
||
## Commands and options | ||
``` | ||
lamsa aln [-t nThreads] [-l seedLen] [-i seedInv] [-p maxLoci] [-V maxSVLen] | ||
[-m matchScore] [-M mismatchScore] [-O gapOpenPen] [-E gapExtPen] | ||
[-r maxOutputNum] [-g minSplitLen] [-SC] [-o outSAM] | ||
<ref.fa> <read.fa/fq> | ||
Algorithm options: | ||
-t --thread [INT] Number of threads. [1] | ||
-l --seed-len [INT] Seed length. Moreover, LAMSA uses short sequence tool(e.g., GEM) to align | ||
seeds and obtain their approximate matches. [50] | ||
-i --seed-inv [INT] Interval size of adjacent seeds. LAMSA extracts seeds on the starting | ||
positons of every i bp. [100] | ||
-p --max-loci [INT] Maximum allowed number of a seed's locations. If a seed has more than -p | ||
approximate matches, LAMSA would consider the seed is too repetitive, and | ||
idiscard all the matches. [200] | ||
-V --SV-len [INT] Expected maximum length of SV. If the genomic distance of two seeds is | ||
short than -V bp, they are avalibale to be connected to construct a | ||
skeleton. [10000] | ||
Scoring options: | ||
-m --match-sc [INT] Match score for SW-alignment. [1] | ||
-M --mis-pen [INT] Mismatch penalty for SW-alignment. [3] | ||
-O --open-pen [INT] Gap open penalty for SW-alignment. [5] | ||
-E --ext-pen [INT] Gap extension penalty for SW-alignment. A gap of length k costs O + k*E | ||
(i.e. -O is for opening a zero-length gap). [2] | ||
Output options: | ||
-r --max-out [INT] Maximum number of output records for a specific split read region. For a | ||
specific region, LAMSA reserves the top -r alignment records. The record | ||
with highest alignment score is considered as best alignment, others are | ||
considered as alternative alignments. If the score of an alternative | ||
alignment is less than half of the best alignment, it will not be output. | ||
[10 | ||
-g --gap-split [INT] Minimum length of gap that causes a split-alignment. To avoid generating | ||
insertion(I) or deletion(D) longer than -g bp in the SAM cigar. [100] | ||
-S --soft-clip Use soft clipping for supplementary alignment. It is strongly recommended | ||
to turn off this option to reduce the redundancy of output when mapping | ||
relatively long reads. [false] | ||
-C --comment Append FASTQ comment to SAM output. [false] | ||
-o --output [STR] Output file (SAM format). [stdout] | ||
``` | ||
|
||
## Simulation benchmarking | ||
We simulated a series of datasets from a variant human genomes. More precisely, we used RSVsim (version 1.10.0) to integrate 4002 SV events into human reference genome (GRCh37/hg19) to build the donor genome, including 532 duplications, 503 insertions, 2943 deletions and 24 inversions. The ratio of the four categories of SV events are configured by referring to the DGV database. Then, using the simulated donor genome as input, 15 datasets respectively with 5 kinds of read lengths (5000, 10000, 20000, 50000 and 100000 bp) and 3 kinds of sequencing error rates (1%, 2% and 4%) were simulated by wgsim(https://github.com/lh3/wgsim). Each of the datasets contains about 6 Giga bps, i.e., nearly 2X coverage of human genome. These datasets helped us to evaluate the performance of LAMSA. The donor genome and detailed list of simulated SV events have been uploaded to Google Drive, and can be downloaded through the following link: https://drive.google.com/folderview?id=0B24uQUND9m51UVlNWkJra19BMGs&usp=sharing | ||
|
||
|
||
## Reference | ||
LAMSA: fast split read alignment with long approximate matches. Manuscript in preparation. | ||
|
||
## Contact | ||
For advising, bug reporting and requiring help, please contact [email protected] or [email protected] | ||
|
||
|
Oops, something went wrong.