-
Notifications
You must be signed in to change notification settings - Fork 12
4. Finding target contigs
find_target_contigs.py
Now we need to figure out, which of the randomly assembled contigs from the previous step represent the target loci. The target loci are those genomic regions, which the capture-probes were designed for, and that were (hopefully) selectively amplified during the library preparation. Therefore we first need to dig out the probe-order-file which should be in fasta format. Alternatively you can also create a new fasta file containing all loci that you are potentially interested in. Having the reference fasta-file ready, and the folder containing the assembled contigs for all samples at hand, we can start blasting the reference sequences against our contig-file database. The following script does exactly that.
Let's first check the available options:
python2.7 find_target_contigs.py --h
This results in the following screen output:
usage: match_contigs_to_reference1.3.py [-h] --contigs CONTIGS --reference REFERENCE --output OUTPUT [--verbosity {INFO,WARN,CRITICAL}] [--log-path LOG_PATH] [--min-coverage MIN_COVERAGE] [--min-identity MIN_IDENTITY] [--dupefile DUPEFILE] [--regex REGEX] [--keep-duplicates KEEP_DUPLICATES] [--sqlite3 SQLITE3]
Here a little explanation of all the possible options:
--contigs: The directory containing the assembled contigs in fasta format.
--reference: The fasta-file containing the reference sequences (probe-order-file).
--output: The directory in which to store the resulting SQL database and LASTZ files.
--verbosity: The logging level to use. You can choose between {INFO,WARN,CRITICAL}.
--log-path: The path to a directory to store log-files.
--min-coverage: The minimum percent coverage required for a match between contig and reference locus sequence [default=80].
--min-identity: The minimum percent identity required for a match between contig and reference locus sequence [default=80].
--dupefile: Path to self-to-self lastz results for baits to remove potential duplicate probes (Only relevant for probe-design).
--regex: A regular expression, matching the name-pattern of the sequences in the reference fasta-file. This can be a part of the fasta-headers which you want to have as names in the output table [default='\w+\d+\d+']. The default regex works for a name pattern like this one: SourceOrganismName_255_1
which is a common way to label your bait-sequences, containing the taxonomic name of the organism that the original sequence was extracted from (mostly only the genus name). The numbers are a code for a gene and the id of the exon of the respective gene, separated by an underscore [_].
--keep-duplicates: An optional output file in which to store those loci that appear to be duplicates. Usually those loci are discarded.
--sqlite3: The complete path to sqlite3.
python2.7 find_target_contigs.py --contigs path/to/contig-folder --reference path/to/reference.fasta --output path/to/matches-folder --min-coverage 85 --min-identity 85 --regex "\w+_\d+_\d+" --sqlite3 /path/to/sqlite
You can play around with the settings a bit, particularly with the values for --min-coverage
and --min-identity
and check the log-file or the screen output for the number of contigs that could be matched to the reference and the number of duplicates/multiple hits. You have to find a good compromise between avoiding false hits (when identity thresholds are too low) and ensuring to get the maximum out of your contigs, including those contigs that differ somewhat more from the reference sequences, and would therefore be discarded if you set the thresholds too high.
There is a text-file, which is stored in the output-folder with the name match_table.txt
. Open this file in Excel to get an overview of which exons could be recovered for which samples.