update doc

shenwei356 · shenwei356 · commit 375bbf509dd8 · 2024-08-06T13:36:22.000+01:00
diff --git a/search/en.data.min.json b/search/en.data.min.json
diff --git a/tutorials/index/index.html b/tutorials/index/index.html
@@ -65,7 +65,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/",
       "headline": "Building an index",
       "description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz Run: From a directory with multiple genome files:\nlexicmap index -I genomes\/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.",
-      "wordCount" : "2744",
+      "wordCount" : "2763",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1545,6 +1545,7 @@ <h1>Building an index</h1>
 >GCA_000765055.1</a> has &gt;150 Mb.
 The flag <code>-g/--max-genome</code> (default 15 Mb) is used to skip these input files, and the file list would be written to a file
 via the flag <code>-G/--big-genomes</code>.</li>
+<li><strong>Minimum sequence length</strong>. A flag <code>-l/--min-seq-len</code> can filter out sequences shorter than the threshold (default is the <code>k</code> value).</li>
 </ul>
 </li>
 <li><strong>At most 17,179,869,184 (2<sup>34</sup>) genomes are supported</strong>. For more genomes, just build multiple indexes.</li>
diff --git a/usage/index/index.html b/usage/index/index.html
@@ -59,7 +59,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/usage/index/",
       "headline": "index",
       "description": "$ lexicmap index -h Generate an index from FASTA\/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA\/Q files, with reference identifiers in the file names. 2. Input plain or gzip\/xz\/zstd\/bzip2 compressed FASTA\/Q files can be given via positional arguments or the flag -X\/--infile-list with a list of input files. Flag -S\/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I\/--in-dir, with multiple-level sub-directories allowed.",
-      "wordCount" : "1278",
+      "wordCount" : "1324",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1436,6 +1436,7 @@ <h1>index</h1>
 </span></span><span class="line"><span class="cl">  5. Maximum genome size: 268,435,456.
 </span></span><span class="line"><span class="cl">     More precisely: $total_bases + ($num_contigs - 1) * 1000 &lt;= 268,435,456, as we concatenate contigs with
 </span></span><span class="line"><span class="cl">     1000-bp intervals of N’s to reduce the sequence scale to index.
+</span></span><span class="line"><span class="cl">  6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
 </span></span><span class="line"><span class="cl">
 </span></span><span class="line"><span class="cl">  Attention:
 </span></span><span class="line"><span class="cl">   *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome
@@ -1539,6 +1540,9 @@ <h1>index</h1>
 </span></span><span class="line"><span class="cl">                                  assemblies from Genbank) will be skipped. Need to be smaller than the
 </span></span><span class="line"><span class="cl">                                  maximum supported genome size: 268435456 (default 15000000)
 </span></span><span class="line"><span class="cl">      --max-open-files int        ► Maximum opened files, used in merging indexes. (default 512)
+</span></span><span class="line"><span class="cl">  -l, --min-seq-len int           ► Maximum sequence length to index. The value would be k for values
+</span></span><span class="line"><span class="cl">                                  &lt;= 0 (default -1)
+</span></span><span class="line"><span class="cl">      --no-desert-filling         ► Disable sketching desert filling (only for debug).
 </span></span><span class="line"><span class="cl">  -O, --out-dir string            ► Output LexicMap index directory.
 </span></span><span class="line"><span class="cl">      --partitions int            ► Number of partitions for indexing seeds (k-mer-value data) files.
 </span></span><span class="line"><span class="cl">                                  (default 512)
diff --git a/usage/utils/kmers/index.html b/usage/utils/kmers/index.html
@@ -59,7 +59,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/usage/utils/kmers/",
       "headline": "kmers",
       "description": "$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0027s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u003cindex path\u003e [-m \u003cmask index\u003e] [-o out.",
-      "wordCount" : "1003",
+      "wordCount" : "1197",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1443,6 +1443,7 @@ <h1>kmers</h1>
 </span></span><span class="line"><span class="cl">  -h, --help              help for kmers
 </span></span><span class="line"><span class="cl">  -d, --index string      ► Index directory created by &#34;lexicmap index&#34;.
 </span></span><span class="line"><span class="cl">  -m, --mask int          ► View k-mers captured by Xth mask. (0 for all) (default 1)
+</span></span><span class="line"><span class="cl">  -f, --only-forward      ► Only output forward k-mers.
 </span></span><span class="line"><span class="cl">  -o, --out-file string   ► Out file, supports and recommends a &#34;.gz&#34; suffix (&#34;-&#34; for stdout).
 </span></span><span class="line"><span class="cl">                          (default &#34;-&#34;)
 </span></span><span class="line"><span class="cl">
@@ -1489,6 +1490,30 @@ <h1>kmers</h1>
  1      AAAAAAAACCATATTATGTCCGATCCTCACA   4        1        GCF_000392875.1   1060650   +        yes
  1      AAAAAAAACCCTTCGTCAAGCATTATGGAAT   4        1        GCF_000392875.1   1139573   -        yes
 </code></pre>
+<p>Only forward k-mers.</p>
+<pre><code> $ lexicmap utils kmers --quiet -d demo.lmi/ -f | head -n 20 | csvtk pretty -t
+ mask   kmer                              prefix   number   ref               pos       strand   reversed
+ ----   -------------------------------   ------   ------   ---------------   -------   ------   --------
+ 1      AAAACACCAAAAGCCTCTCCGATAACACCAG   9        1        GCF_002949675.1   2046311   +        no
+ 1      AAAACACCAAAGTTAAAGTGCCGTTTAGCGT   9        1        GCF_003697165.2   1085073   +        no
+ 1      AAAACACCAATTAGTGATTGTGTTTCCTCAA   9        1        GCF_000392875.1   2785764   -        no
+ 1      AAAACACCACAGTGAAAGACAACATTTAATA   9        1        GCF_000392875.1   1132052   -        no
+ 1      AAAACACCACCACAAATGCATAAGAAAACTT   9        1        GCF_003697165.2   2862670   +        no
+ 1      AAAACACCACTCAATCCTTTAAATAAAAACA   9        1        GCF_002949675.1   2467828   -        no
+ 1      AAAACACCACTTTACGGGCGTTTTGTGCAAT   9        1        GCF_003697165.2   4241904   -        no
+ 1      AAAACACCAGCACGTTCAGCACCGCCACCAG   9        1        GCF_000017205.1   4399207   -        no
+ 1      AAAACACCAGCGAACGGAAGAACATCGCGAT   9        1        GCF_003697165.2   248663    +        no
+ 1      AAAACACCAGGCCGGAGCAGAAGGTTATTCT   9        1        GCF_003697165.2   4139632   +        no
+ 1      AAAACACCATAAACGATTGTTGGAATACCCG   10       1        GCF_009759685.1   268158    +        no
+ 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   496925    +        no
+ 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   2254974   +        no
+ 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   2495183   +        no
+ 1      AAAACACCATCATACACTAAATCAGTAAGTT   10       4        GCF_002949675.1   4009312   +        no
+ 1      AAAACACCATGAACGCCAACGCCGCCGAGCT   11       1        GCF_000742135.1   2707622   +        no
+ 1      AAAACACCATGAGCAAACTCCAGCATATCGG   11       1        GCF_000017205.1   2490011   -        no
+ 1      AAAACACCATGCAAAAAACTTCTTTTAGAAA   11       1        GCF_000006945.2   1324151   -        no
+ 1      AAAACACCATGCAGCATGTCATAGCGCTGGA   11       1        GCF_003697165.2   422685    +        no
+</code></pre>
 </li>
 <li>
 <p>Specify the mask.</p>