Skip to content

Commit 020761c

Browse files
committed
index: remove --ref-name-info
1 parent 4f4794a commit 020761c

File tree

5 files changed

+7
-24
lines changed

5 files changed

+7
-24
lines changed

search/en.data.min.json

+1-1
Large diffs are not rendered by default.

tutorials/index/index.html

+1-3
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@
6262
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/index/",
6363
"headline": "Step 1. Building a database",
6464
"description": "Terminology differences:\nOn this page and in the LexicMap command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA\/Q files, with identifiers in the file names.",
65-
"wordCount" : "2973",
65+
"wordCount" : "2913",
6666
"inLanguage": "en",
6767
"isFamilyFriendly": "true",
6868
"mainEntityOfPage": {
@@ -1840,8 +1840,6 @@ <h1>Step 1. Building a database</h1>
18401840
<li><strong>If the RAM is not sufficient</strong>. Please:
18411841
<ul>
18421842
<li><strong>Use a smaller genome batch size</strong>. It decreases indexing memory occupation and has little affection on searching performance.</li>
1843-
<li><strong>Sorting the input file list by species</strong>. So genomes within a batch would be more similar and the memory would be lower.
1844-
For LexicMap v0.4.1 or later versions, a flag <code>--ref-name-info</code> can specify a two-column tab-delimted file for mapping reference names to taxonomic information such as species names, and the input files will be sorted according to the taxonomic information.</li>
18451843
<li>Use a smaller number of masks, e.g., 20,000 performs well for small genomes (&lt;=5 Mb). And if the queries are long (&gt;= 2kb), there&rsquo;s little affection for the alignment results.</li>
18461844
</ul>
18471845
</li>

tutorials/misc/index-genbank/index.html

+2-8
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-genbank/",
6666
"headline": "Indexing GenBank+RefSeq",
6767
"description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022genbank\u0022 -M \u0022ncbi\u0022 -t 12 -m -L curl cd genbank\/2024-02-15_11-00-51\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
68-
"wordCount" : "181",
68+
"wordCount" : "156",
6969
"inLanguage": "en",
7070
"isFamilyFriendly": "true",
7171
"mainEntityOfPage": {
@@ -1705,17 +1705,11 @@ <h1>Indexing GenBank&#43;RefSeq</h1>
17051705
# redownload them:
17061706
# run the genome_updater command again, with the flag -i
17071707
</code></pre>
1708-
<p>Taxonomic information (optional), for reducing index memory.</p>
1709-
<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
1710-
</code></pre>
17111708
<p>Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB.
17121709
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
1713-
<pre><code># --ref-name-info is available for v0.4.1 or later versions.
1714-
1715-
lexicmap index \
1710+
<pre><code>lexicmap index \
17161711
-I files/ \
17171712
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
1718-
--ref-name-info ref2species.tsv \
17191713
-O genbank_refseq.lmi --log genbank_refseq.lmi.log \
17201714
-b 25000
17211715
</code></pre>

tutorials/misc/index-gtdb/index.html

+2-8
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-gtdb/",
6666
"headline": "Indexing GTDB",
6767
"description": "Tools:\nhttps:\/\/github.com\/pirovc\/genome_updater, for downloading genomes https:\/\/github.com\/shenwei356\/seqkit, for checking sequence files https:\/\/github.com\/shenwei356\/rush, for running jobs Data:\ntime genome_updater.sh -d \u0022refseq,genbank\u0022 -g \u0022archaea,bacteria\u0022 \\ -f \u0022genomic.fna.gz\u0022 -o \u0022GTDB_complete\u0022 -M \u0022gtdb\u0022 -t 12 -m -L curl cd GTDB_complete\/2024-01-30_19-34-40\/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0022*.gz\u0022 \\ fd \u0022.gz$\u0022 $genomes \\ | rush --eta \u0027seqkit seq -w 0 {} \u003e \/dev\/null; if [ $? -ne 0 ]; then echo {}; fi\u0027 \\ \u003e failed.",
68-
"wordCount" : "181",
68+
"wordCount" : "156",
6969
"inLanguage": "en",
7070
"isFamilyFriendly": "true",
7171
"mainEntityOfPage": {
@@ -1705,17 +1705,11 @@ <h1>Indexing GTDB</h1>
17051705
# redownload them:
17061706
# run the genome_updater command again, with the flag -i
17071707
</code></pre>
1708-
<p>Taxonomic information (optional), for reducing index memory.</p>
1709-
<pre><code>cut -f 1,8 assembly_summary.txt &gt; ref2species.tsv
1710-
</code></pre>
17111708
<p>Indexing. On a 48-CPU machine, time: 11 h, ram: 64 GB, index size: 906 GB.
17121709
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
1713-
<pre><code># --ref-name-info is available for v0.4.1 or later versions.
1714-
1715-
lexicmap index \
1710+
<pre><code>lexicmap index \
17161711
-I files/ \
17171712
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
1718-
--ref-name-info ref2species.tsv \
17191713
-O gtdb_complete.lmi --log gtdb_complete.lmi.log \
17201714
-b 5000
17211715
</code></pre>

usage/index/index.html

+1-4
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959
"url" : "https://bioinf.shenwei.me/LexicMap/usage/index/",
6060
"headline": "index",
6161
"description": "Terminology differences In the LexicMap source code and command line options, the term “mask” is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA\/Q sequences Input: *1.",
62-
"wordCount" : "1368",
62+
"wordCount" : "1341",
6363
"inLanguage": "en",
6464
"isFamilyFriendly": "true",
6565
"mainEntityOfPage": {
@@ -1801,9 +1801,6 @@ <h1>index</h1>
18011801
</span></span><span class="line"><span class="cl"> --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files.
18021802
</span></span><span class="line"><span class="cl"> The value needs to be the power of 4. (default 1024)
18031803
</span></span><span class="line"><span class="cl"> -s, --rand-seed int ► Rand seed for generating random masks. (default 1)
1804-
</span></span><span class="line"><span class="cl"> --ref-name-info string ► A two-column tab-delimted file for mapping reference names
1805-
</span></span><span class="line"><span class="cl"> (extracted by --ref-name-regexp) to taxonomic information such as
1806-
</span></span><span class="line"><span class="cl"> species names. It helps to reduce memory usage.
18071804
</span></span><span class="line"><span class="cl"> -N, --ref-name-regexp string ► Regular expression (must contains &#34;(&#34; and &#34;)&#34;) for extracting the
18081805
</span></span><span class="line"><span class="cl"> reference name from the filename. Attention: use double quotation
18091806
</span></span><span class="line"><span class="cl"> marks for patterns containing commas, e.g., -p &#39;&#34;A{2,}&#34;&#39; (default

0 commit comments

Comments
 (0)