update docs

shenwei356 · shenwei356 · commit 09d3e4d144cc · 2024-08-14T16:08:41.000+01:00
diff --git a/introduction/index.html b/introduction/index.html
@@ -1482,15 +1482,15 @@ <h1>Introduction</h1>
   class="gdoc-markdown__link"
   href="https://bioinf.shenwei.me/LexicMap/introduction/#searching"
 >fast and memory-efficient</a></strong>.</li>
-<li>LexicMap is easy to <a
+<li>LexicMap is <strong>easy to <a
   class="gdoc-markdown__link"
   href="http://bioinf.shenwei.me/LexicMap/installation/"
->install</a>,
+>install</a></strong>,
 we provide <a
   class="gdoc-markdown__link"
   href="https://github.com/shenwei356/LexicMap/releases/"
 >binary files</a> with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).</li>
-<li>LexicMap is easy to use (<a
+<li>LexicMap is <strong>easy to use</strong> (<a
   class="gdoc-markdown__link"
   href="http://bioinf.shenwei.me/LexicMap/tutorials/index/"
 >tutorials</a> and <a
@@ -1542,7 +1542,7 @@ <h1>Introduction</h1>
 <li><strong>We added the support of suffix matching of seeds, making seeds much more tolerant to mutations</strong>. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means <strong>seeds are immune to any single SNP</strong>.</li>
 </ol>
 </li>
-<li>A multi-level index enables fast and low-memory variable-length seed matching and chaining.</li>
+<li>A hierarchical index enables fast and low-memory variable-length seed matching and chaining.</li>
 <li>A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment.</li>
 <li>A <a
   class="gdoc-markdown__link"
@@ -1761,9 +1761,9 @@ <h1>Introduction</h1>
 <tr>
 <td style="text-align:left">GTDB complete</td>
 <td style="text-align:right">402,538</td>
-<td style="text-align:right">578 GB</td>
+<td style="text-align:right">443 GB</td>
 <td style="text-align:left">LexicMap</td>
-<td style="text-align:right">906 GB</td>
+<td style="text-align:right">973 GB</td>
 <td style="text-align:right">10 h 36 m</td>
 <td style="text-align:right">63.3 GB</td>
 </tr>
@@ -1772,16 +1772,16 @@ <h1>Introduction</h1>
 <td style="text-align:right"></td>
 <td style="text-align:right"></td>
 <td style="text-align:left">Blastn</td>
-<td style="text-align:right">360 GB</td>
+<td style="text-align:right">387 GB</td>
 <td style="text-align:right">3 h 11 m</td>
 <td style="text-align:right">718 MB</td>
 </tr>
 <tr>
 <td style="text-align:left">AllTheBacteria HQ</td>
 <td style="text-align:right">1,858,610</td>
-<td style="text-align:right">3.1 TB</td>
+<td style="text-align:right">2.5 TB</td>
 <td style="text-align:left">LexicMap</td>
-<td style="text-align:right">3.88 TB</td>
+<td style="text-align:right">4.26 TB</td>
 <td style="text-align:right">48 h 08 m</td>
 <td style="text-align:right">88.6 GB</td>
 </tr>
@@ -1790,7 +1790,7 @@ <h1>Introduction</h1>
 <td style="text-align:right"></td>
 <td style="text-align:right"></td>
 <td style="text-align:left">Blastn</td>
-<td style="text-align:right">1.76 TB</td>
+<td style="text-align:right">1.93 TB</td>
 <td style="text-align:right">14 h 03 m</td>
 <td style="text-align:right">2.9 GB</td>
 </tr>
@@ -1806,9 +1806,9 @@ <h1>Introduction</h1>
 <tr>
 <td style="text-align:left">Genbank+RefSeq</td>
 <td style="text-align:right">2,340,672</td>
-<td style="text-align:right">3.5 TB</td>
+<td style="text-align:right">2.7 TB</td>
 <td style="text-align:left">LexicMap</td>
-<td style="text-align:right">4.94 TB</td>
+<td style="text-align:right">5.43 TB</td>
 <td style="text-align:right">54 h 33 m</td>
 <td style="text-align:right">178.3 GB</td>
 </tr>
@@ -1817,7 +1817,7 @@ <h1>Introduction</h1>
 <td style="text-align:right"></td>
 <td style="text-align:right"></td>
 <td style="text-align:left">Blastn</td>
-<td style="text-align:right">2.15 TB</td>
+<td style="text-align:right">2.37 TB</td>
 <td style="text-align:right">14 h 04 m</td>
 <td style="text-align:right">4.3 GB</td>
 </tr>
@@ -1914,8 +1914,8 @@ <h1>Introduction</h1>
 <td style="text-align:left">LexicMap</td>
 <td style="text-align:right">3,867,003</td>
 <td style="text-align:right">2,228,339</td>
-<td style="text-align:right">1,165 s</td>
-<td style="text-align:right">20.2 GB</td>
+<td style="text-align:right">1,254 s</td>
+<td style="text-align:right">21.4 GB</td>
 </tr>
 <tr>
 <td style="text-align:left"></td>
diff --git a/search/en.data.min.json b/search/en.data.min.json
diff --git a/tutorials/index/index.html b/tutorials/index/index.html
@@ -59,7 +59,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/",
       "headline": "Step 1. Building a database",
       "description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it’s feasible as sequence IDs in search result can help to distinguish targe genomes.",
-      "wordCount" : "2840",
+      "wordCount" : "2851",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -2045,12 +2045,12 @@ <h1>Step 1. Building a database</h1>
     </label>
     <div class="gdoc-markdown--nested gdoc-tabs__content">
       <pre><code># 15 genomes
-demo.lmi: 69.89 MB
-  56.65 MB      seeds
-  12.93 MB      genomes
- 312.53 KB      masks.bin
-  375.00 B      genomes.map.bin
-  323.00 B      info.toml
+demo.lmi: 73.30 MB (73,297,328)
+  59.41 MB      seeds
+  13.57 MB      genomes
+ 320.03 kB      masks.bin
+     375 B      genomes.map.bin
+     323 B      info.toml
 </code></pre>
 
     </div>
@@ -2066,12 +2066,12 @@ <h1>Step 1. Building a database</h1>
     </label>
     <div class="gdoc-markdown--nested gdoc-tabs__content">
       <pre><code># 85,205 genomes
-gtdb_repr.lmi: 212.48 GB
- 145.69 GB      seeds
-  66.78 GB      genomes
-   2.03 MB      genomes.map.bin
- 312.53 KB      masks.bin
-  329.00 B      info.toml
+gtdb_repr.lmi: 228.15 GB (228,149,871,198)
+ 156.44 GB      seeds
+  71.71 GB      genomes
+   2.13 MB      genomes.map.bin
+ 320.03 kB      masks.bin
+     329 B      info.toml
 </code></pre>
 
     </div>
@@ -2087,12 +2087,12 @@ <h1>Step 1. Building a database</h1>
     </label>
     <div class="gdoc-markdown--nested gdoc-tabs__content">
       <pre><code># 402,538 genomes
-gtdb_complete.lmi: 906.04 GB
- 543.06 GB      seeds
- 362.98 GB      genomes
-   9.60 MB      genomes.map.bin
- 312.53 KB      masks.bin
-  330.00 B      info.toml
+gtdb_complete.lmi: 972.85 GB (972,854,821,322)
+ 583.10 GB      seeds
+ 389.74 GB      genomes
+  10.06 MB      genomes.map.bin
+ 320.03 kB      masks.bin
+     330 B      info.toml
 </code></pre>
 
     </div>
@@ -2108,12 +2108,13 @@ <h1>Step 1. Building a database</h1>
     </label>
     <div class="gdoc-markdown--nested gdoc-tabs__content">
       <pre><code># 2,340,672 genomes
-genbank_refseq.lmi: 4.94 TB
-   2.77 TB      seeds
-   2.17 TB      genomes
-  55.81 MB      genomes.map.bin
- 312.53 KB      masks.bin
-  332.00 B      info.toml
+genbank_refseq.lmi: 5.43 TB (5,428,824,803,581)
+   3.04 TB      seeds
+   2.38 TB      genomes
+ 821.17 MB      kmers-m12345.tsv
+  58.52 MB      genomes.map.bin
+ 320.03 kB      masks.bin
+     332 B      info.toml
 </code></pre>
 
     </div>
@@ -2129,12 +2130,12 @@ <h1>Step 1. Building a database</h1>
     </label>
     <div class="gdoc-markdown--nested gdoc-tabs__content">
       <pre><code># 1,858,610 genomes
-atb_hq.lmi: 3.88 TB
-   2.11 TB      seeds
-   1.77 TB      genomes
-  39.22 MB      genomes.map.bin
- 312.53 KB      masks.bin
-  332.00 B      info.toml
+atb_hq.lmi: 4.26 TB (4,261,437,129,065)
+   2.32 TB      seeds
+   1.94 TB      genomes
+  41.12 MB      genomes.map.bin
+ 320.03 kB      masks.bin
+     332 B      info.toml
 </code></pre>
 
     </div>
@@ -2144,7 +2145,7 @@ <h1>Step 1. Building a database</h1>
 <li>Directory/file sizes are counted with <a
   class="gdoc-markdown__link"
   href="https://github.com/shenwei356/dirsize"
->https://github.com/shenwei356/dirsize</a>. (base: 1024)</li>
+>https://github.com/shenwei356/dirsize</a> v1.2.1 (<code>dirsize -k</code>, base: 1000).</li>
 <li>Index building parameters: <code>-k 31 -m 40000</code>. Genome batch size: <code>-b 5000</code> for GTDB datasets, <code>-b 25000</code> for others.</li>
 </ul>
 <div class="flex align-center gdoc-page__anchorwrap">