-
Notifications
You must be signed in to change notification settings - Fork 0
Sequence Query
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 27, 2021
·
4 revisions
Sequence queries are made with method getSequence
. It takes a single parameter as index of sequence. 0 means first sequence, N-1 means last sequence in N-sequence FASTA file.
bool debug = true;
FastaGeneIndexer cache("./data/influenza.fna", debug);
std::cout << cache.getDescriptor(100) << std::endl;
std::cout << cache.getSequence(100) << std::endl;
gi|221327|gb|D13578|Influenza A virus (A/Kaizuka/2/65(H2N2)) gene for hemagglutinin, partial cds
AATACAACACTACCTTTTCACAATGTCCACCCACTGACAATAGGTGAATGCCCCAAATATGTAAAATCGGAGAAATTGGTCTTAGCAACAGGACTAAGGAATGTTCCCCAGATTGAATCAAGAGGATTGTTTGGGGCAATAGCTGGCTTTATAGAAGGAGGATGGCAAGGAATGGTTGATGGTTGGTATGGATACCATCACAGCAATGACCAGGGATCAGGGTATGCAGCAGACAAAGAATCCACTCAAAAGGCATTTGATGGAATCACCAACAAGGTAAATTCTGTGATTGAAAAGATGAACACCCAATTTGAAGCTGTTGGGAAAGAATTCAATAATTTAGAGAAAAGACTGGAGAACTTGAACAAAAAGATGGAAGACGGGTTTCTAGATG
if initDescriptorIndexMapping
method is called, then sequences can be accessed by their descriptors:
bool debug = true;
FastaGeneIndexer cache("./data/influenza.fna", debug);
cache.initDescriptorIndexMapping();
std::cout << cache.getDescriptor(1) << std::endl;
std::cout << cache.getSequenceByDescriptor(cache.getDescriptor(1)) << std::endl;
std::cout << cache.getSequence(1) << std::endl;
It doesn't skip duplicate descriptors and last duplicate's index is used instead:
gi|59292|gb|X53029|Influenza A virus (A/USSR/90/1977(H1N1)) genes for matrix proteins 1 and 2, genomic RNA
AGCAAAAGCAGGTAGATGTTGAAAGATGAGTCTTCTAACCGAGGTCGAAACGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCACAGAGACTTGAAGATGTCTTTGCTGGGAAGAACACCGATCTTGAGGCTCTCATGGAATGGCTAAAGACAAGACCAATCCTGTCACCTCTGACTAAGGGGATTTTAGGATTTGTGTTCACGCTCACCGTGCCCAGTGAGCGAGGACTGCAGCGTAGACGCTTTGTCCAAAATGCCCTTAATGGGAATGGGGATCCAAATAACATGGACAGAGCAGTTAAACTGTATAGAAAGCTTAAGAGGGAGATAACATTCCATGGGGCCAAAGAAATAGCACTCAGTTATTCTGCTGGTGCACTTGCCAGTTGTATGGGCCTCATATACAACAGGATGGGGGCTGTGACCACCGAAGCGGCATTTGGCCTGATATGCGCAACCTGTGAACAGATTGCTGACTCCCAGCATAGGTCTCATAGGCAAATGGTGACAACAACCAATCCACTAATAAGACATGAGAACAGAATGGTTCTGGCCAGCACTACAGCTAAGGCTATGGAGCAAATGGCTGGATCGAGTGAGCAAGCAGCAGAGGCCATGGAGGTTGCTAGTCAGGCCAGGCAAATGGTGCAGGCAATGAGAGCCATTGGGACTCATCCTAGCTCCAGTGCTGGTCTGAAAAATGATCTTCTTGAAAATTTGCAGGCCTATCAGAAACGAATGGGGGTGCAGATGCAACGATTCAAGTGATCCTCTTGTTGTTGCCGCAAGTATCATTGGGATTTTGCACTTGATATTGTGGATTCTTGATCGTCTTTTTTTCAAATGCATTTATCGTCTCTTTAAACACGGTCTGAAAAGAGGGCCTTCTACGGAAGGAGTACCAGAGTCTATGAGGGAAGAATATCGAAAGGAACAGCAGAATGCTGTGGATGCTGACGATAGTCATTTTGTCAACATAGAGCTAGAGTAAAAAACTACCTTGTTTCTACT
AGCAAAAGCAGGTAGATGTTGAAAGATGAGTCTTCTAACCGAGGTCGAAACGTACGTTCTCTCTATCGTCCCGTCAGGCCCCCTCAAAGCCGAGATCGCACAGAGACTTGAAGATGTCTTTGCTGGGAAGAACACCGATCTTGAGGCTCTCATGGAATGGCTAAAGACAAGACCAATCCTGTCACCTCTGACTAAGGGGATTTTAGGATTTGTGTTCACGCTCACCGTGCCCAGTGAGCGAGGACTGCAGCGTAGACGCTTTGTCCAAAATGCCCTTAATGGGAATGGGGATCCAAATAACATGGACAGAGCAGTTAAACTGTATAGAAAGCTTAAGAGGGAGATAACATTCCATGGGGCCAAAGAAATAGCACTCAGTTATTCTGCTGGTGCACTTGCCAGTTGTATGGGCCTCATATACAACAGGATGGGGGCTGTGACCACCGAAGCGGCATTTGGCCTGATATGCGCAACCTGTGAACAGATTGCTGACTCCCAGCATAGGTCTCATAGGCAAATGGTGACAACAACCAATCCACTAATAAGACATGAGAACAGAATGGTTCTGGCCAGCACTACAGCTAAGGCTATGGAGCAAATGGCTGGATCGAGTGAGCAAGCAGCAGAGGCCATGGAGGTTGCTAGTCAGGCCAGGCAAATGGTGCAGGCAATGAGAGCCATTGGGACTCATCCTAGCTCCAGTGCTGGTCTGAAAAATGATCTTCTTGAAAATTTGCAGGCCTATCAGAAACGAATGGGGGTGCAGATGCAACGATTCAAGTGATCCTCTTGTTGTTGCCGCAAGTATCATTGGGATTTTGCACTTGATATTGTGGATTCTTGATCGTCTTTTTTTCAAATGCATTTATCGTCTCTTTAAACACGGTCTGAAAAGAGGGCCTTCTACGGAAGGAGTACCAGAGTCTATGAGGGAAGAATATCGAAAGGAACAGCAGAATGCTGTGGATGCTGACGATAGTCATTTTGTCAACATAGAGCTAGAGTAAAAAACTACCTTGTTTCTACT