Skip to content

Sub Sequence Query

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 27, 2021 · 5 revisions

getSequence method is used for getting sub-sequence string out of a sequence. First parameter is for selecting sequence. Second parameter is starting position of reading (excludes line-feed characters, sequence has only nucleobase symbols). Third parameter is length of read.

FastaGeneIndexer cache("./data/yersinia_pestis_genomic.fna");
std::cout << cache.getDescriptor(3) << std::endl;

// getting 100 nucleobase symbols from fourth(index=3) sequence, starting at its 101th (index=100) symbol
std::cout << cache.getSequence(3,100,100) << std::endl;

output:

NC_003132.1 Yersinia pestis CO92 plasmid pPCP1, complete sequence
GACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATGAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTG

if initDescriptorIndexMapping method is called, then sequences can be accessed by their descriptors:

bool debug = true;
FastaGeneIndexer cache("./data/influenza.fna", debug);
cache.initDescriptorIndexMapping();
std::cout << cache.getDescriptor(1) << std::endl;
std::cout << cache.getSequenceByDescriptor(cache.getDescriptor(1),50,50) << std::endl;
std::cout << cache.getSequence(1,50,50) << std::endl;

It doesn't skip duplicate descriptors and last duplicate's index is used instead:

gi|59292|gb|X53029|Influenza A virus (A/USSR/90/1977(H1N1)) genes for matrix proteins 1 and 2, genomic RNA
CGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCA
CGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCA
Clone this wiki locally