|
71 | 71 | "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
|
72 | 72 | "headline": "Step 2. Searching",
|
73 | 73 | "description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
|
74 |
| - "wordCount" : "2941", |
| 74 | + "wordCount" : "3088", |
75 | 75 | "inLanguage": "en",
|
76 | 76 | "isFamilyFriendly": "true",
|
77 | 77 | "mainEntityOfPage": {
|
@@ -2089,23 +2089,41 @@ <h1>Step 2. Searching</h1>
|
2089 | 2089 | <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
|
2090 | 2090 | </a>
|
2091 | 2091 | </div>
|
| 2092 | +<p>LexicMap’s searching speed is related to many factors:</p> |
| 2093 | +<ul> |
| 2094 | +<li><strong>The number of similar sequences in the index/database</strong>. More genome hits cost more time, e.g., 16S rRNA gene.</li> |
| 2095 | +<li><strong>Similarity between query and subject sequences</strong>. Alignment of diverse sequences is slower than that of highly similar sequences.</li> |
| 2096 | +<li><strong>The length of query sequence</strong>. Longer queries run with more time.</li> |
| 2097 | +<li><strong>The I/O performance and load</strong>. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel.</li> |
| 2098 | +<li><strong>CPU frequency and the number of threads</strong>. Faster CPUs and more threads cost less time.</li> |
| 2099 | +</ul> |
2092 | 2100 | <p>Here are some tips to improve the search speed.</p>
|
2093 | 2101 | <ul>
|
2094 |
| -<li>Increasing the concurrency number |
| 2102 | +<li><strong>Increasing the concurrency number</strong> |
2095 | 2103 | <ul>
|
2096 |
| -<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a |
| 2104 | +<li> |
| 2105 | +<p>Make sure that the value of <code>-j/--threads</code> (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in <code>info.toml</code> file, e.g,</p> |
| 2106 | +<pre><code># Seeds (k-mer-value data) files |
| 2107 | +chunks = 48 |
| 2108 | +</code></pre> |
| 2109 | +</li> |
| 2110 | +<li> |
| 2111 | +<p>Increasing the value of <code>--max-open-files</code> (default 512). You might also need to <a |
2097 | 2112 | class="gdoc-markdown__link"
|
2098 | 2113 | href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
|
2099 |
| ->change the open files limit</a>.</li> |
2100 |
| -<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li> |
| 2114 | +>change the open files limit</a>.</p> |
| 2115 | +</li> |
| 2116 | +<li> |
| 2117 | +<p>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</p> |
| 2118 | +</li> |
2101 | 2119 | </ul>
|
2102 | 2120 | </li>
|
2103 |
| -<li>Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD) |
| 2121 | +<li>(If you have many queries) <strong>Loading the entire seed data into memoy</strong> (It’s unnecessary if the index is stored in SSD) |
2104 | 2122 | <ul>
|
2105 | 2123 | <li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
|
2106 | 2124 | </ul>
|
2107 | 2125 | </li>
|
2108 |
| -<li>Returning less results |
| 2126 | +<li><strong>Returning less results</strong> |
2109 | 2127 | <ul>
|
2110 | 2128 | <li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
|
2111 | 2129 | </ul>
|
|
0 commit comments