-
Notifications
You must be signed in to change notification settings - Fork 0
Sub Sequence Query
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 27, 2021
·
5 revisions
getSequence
method is used for getting sub-sequence string out of a sequence. First parameter is for selecting sequence. Second parameter is starting position of reading (excludes line-feed characters, sequence has only nucleobase symbols). Third parameter is length of read.
FastaGeneIndexer cache("./data/yersinia_pestis_genomic.fna");
std::cout << cache.getDescriptor(3) << std::endl;
// getting 100 nucleobase symbols from fourth(index=3) sequence, starting at its 101th (index=100) symbol
std::cout << cache.getSequence(3,100,100) << std::endl;
output:
NC_003132.1 Yersinia pestis CO92 plasmid pPCP1, complete sequence
GACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATGAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTG
if initDescriptorIndexMapping
method is called, then sequences can be accessed by their descriptors:
bool debug = true;
FastaGeneIndexer cache("./data/influenza.fna", debug);
cache.initDescriptorIndexMapping();
std::cout << cache.getDescriptor(1) << std::endl;
std::cout << cache.getSequenceByDescriptor(cache.getDescriptor(1),50,50) << std::endl;
std::cout << cache.getSequence(1,50,50) << std::endl;
It doesn't skip duplicate descriptors and last duplicate's index is used instead:
gi|59292|gb|X53029|Influenza A virus (A/USSR/90/1977(H1N1)) genes for matrix proteins 1 and 2, genomic RNA
CGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCA
CGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCA