Skip to content

Commit 2fceec9

Browse files
committed
docs: update tips for improving searching speed
1 parent 883b7e9 commit 2fceec9

File tree

3 files changed

+43
-15
lines changed

3 files changed

+43
-15
lines changed

faqs/index.html

+17-7
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959
"url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
6060
"headline": "FAQs",
6161
"description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
62-
"wordCount" : "773",
62+
"wordCount" : "818",
6363
"inLanguage": "en",
6464
"isFamilyFriendly": "true",
6565
"mainEntityOfPage": {
@@ -1830,21 +1830,31 @@ <h1>FAQs</h1>
18301830
<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
18311831
There are some ways to improve the search speed of <code>lexicmap search</code>.</p>
18321832
<ul>
1833-
<li>Increasing the concurrency number
1833+
<li><strong>Increasing the concurrency number</strong>
18341834
<ul>
1835-
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
1835+
<li>
1836+
<p>Make sure that the value of <code>-j/--threads</code> (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in <code>info.toml</code> file, e.g,</p>
1837+
<pre><code># Seeds (k-mer-value data) files
1838+
chunks = 48
1839+
</code></pre>
1840+
</li>
1841+
<li>
1842+
<p>Increasing the value of <code>--max-open-files</code> (default 512). You might also need to <a
18361843
class="gdoc-markdown__link"
18371844
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
1838-
>change the open files limit</a>.</li>
1839-
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
1845+
>change the open files limit</a>.</p>
1846+
</li>
1847+
<li>
1848+
<p>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</p>
1849+
</li>
18401850
</ul>
18411851
</li>
1842-
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
1852+
<li><strong>Loading the entire seed data into memoy</strong> (It&rsquo;s unnecessary if the index is stored in SSD)
18431853
<ul>
18441854
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
18451855
</ul>
18461856
</li>
1847-
<li>Returning less results
1857+
<li><strong>Returning less results</strong>
18481858
<ul>
18491859
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
18501860
</ul>

search/en.data.min.json

+1-1
Large diffs are not rendered by default.

tutorials/search/index.html

+25-7
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
7272
"headline": "Step 2. Searching",
7373
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
74-
"wordCount" : "2941",
74+
"wordCount" : "3088",
7575
"inLanguage": "en",
7676
"isFamilyFriendly": "true",
7777
"mainEntityOfPage": {
@@ -2089,23 +2089,41 @@ <h1>Step 2. Searching</h1>
20892089
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
20902090
</a>
20912091
</div>
2092+
<p>LexicMap&rsquo;s searching speed is related to many factors:</p>
2093+
<ul>
2094+
<li><strong>The number of similar sequences in the index/database</strong>. More genome hits cost more time, e.g., 16S rRNA gene.</li>
2095+
<li><strong>Similarity between query and subject sequences</strong>. Alignment of diverse sequences is slower than that of highly similar sequences.</li>
2096+
<li><strong>The length of query sequence</strong>. Longer queries run with more time.</li>
2097+
<li><strong>The I/O performance and load</strong>. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel.</li>
2098+
<li><strong>CPU frequency and the number of threads</strong>. Faster CPUs and more threads cost less time.</li>
2099+
</ul>
20922100
<p>Here are some tips to improve the search speed.</p>
20932101
<ul>
2094-
<li>Increasing the concurrency number
2102+
<li><strong>Increasing the concurrency number</strong>
20952103
<ul>
2096-
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
2104+
<li>
2105+
<p>Make sure that the value of <code>-j/--threads</code> (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in <code>info.toml</code> file, e.g,</p>
2106+
<pre><code># Seeds (k-mer-value data) files
2107+
chunks = 48
2108+
</code></pre>
2109+
</li>
2110+
<li>
2111+
<p>Increasing the value of <code>--max-open-files</code> (default 512). You might also need to <a
20972112
class="gdoc-markdown__link"
20982113
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
2099-
>change the open files limit</a>.</li>
2100-
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
2114+
>change the open files limit</a>.</p>
2115+
</li>
2116+
<li>
2117+
<p>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</p>
2118+
</li>
21012119
</ul>
21022120
</li>
2103-
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
2121+
<li>(If you have many queries) <strong>Loading the entire seed data into memoy</strong> (It&rsquo;s unnecessary if the index is stored in SSD)
21042122
<ul>
21052123
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
21062124
</ul>
21072125
</li>
2108-
<li>Returning less results
2126+
<li><strong>Returning less results</strong>
21092127
<ul>
21102128
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
21112129
</ul>

0 commit comments

Comments
 (0)