add a new tutorial of indexing ATB dataset hosted on OSF

shenwei356 · shenwei356 · commit 883b7e985f5b · 2024-09-18T14:11:48.000+01:00
diff --git a/index.html b/index.html
@@ -329,6 +329,14 @@ <h1></h1>
   src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
   alt="Cross-platform"
   
+/></a>
+<a
+  class="gdoc-markdown__link--raw"
+  href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
+><img
+  src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
+  alt="license"
+  
 /></a></p>
 <p><font size=5rem>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying <strong>gene, plasmid, virus, or long-read sequences</strong> against up to <strong>millions</strong> of <strong>prokaryotic genomes</strong>.</font></p>
 
diff --git a/index.xml b/index.xml
@@ -47,7 +47,7 @@
       <link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
       <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
       <guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
-      <description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
+      <description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
     </item>
     <item>
       <title>genomes</title>
diff --git a/introduction/index.html b/introduction/index.html
@@ -1708,6 +1708,14 @@ <h1>Introduction</h1>
   src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
   alt="Cross-platform"
   
+/></a>
+<a
+  class="gdoc-markdown__link--raw"
+  href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
+><img
+  src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
+  alt="license"
+  
 /></a></p>
 <p>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to <strong>millions of prokaryotic genomes</strong>.</p>
 <p>Preprint:</p>
diff --git a/search/en.data.min.json b/search/en.data.min.json
diff --git a/tutorials/misc/index-allthebacteria/index.html b/tutorials/misc/index-allthebacteria/index.html
@@ -15,8 +15,7 @@
   <meta name="description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
 Tools:
 https://github.com/shenwei356/rush, for running jobs Info:
-AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
-mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
+AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
 
     <title>Indexing AllTheBacteria | LexicMap: efficient sequence alignment against millions of prokaryotic genomes​</title>
 
@@ -45,8 +44,7 @@
   <meta property="og:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
 Tools:
 https://github.com/shenwei356/rush, for running jobs Info:
-AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
-mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
+AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/" />
 
@@ -58,8 +56,7 @@
   <meta name="twitter:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
 Tools:
 https://github.com/shenwei356/rush, for running jobs Info:
-AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
-mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
+AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
 
 
   <script type="application/ld+json">
@@ -70,8 +67,8 @@
       "name": "Indexing AllTheBacteria",
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/",
       "headline": "Indexing AllTheBacteria",
-      "description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https:\/\/bioinf.",
-      "wordCount" : "416",
+      "description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https:\/\/osf.io\/xv7q9\/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.",
+      "wordCount" : "744",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1722,13 +1719,114 @@ <h1>Indexing AllTheBacteria</h1>
   class="gdoc-markdown__link"
   href="https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1"
 >AllTheBacteria - all bacterial genomes assembled, available and searchable</a></li>
+<li>Data on OSF: <a
+  class="gdoc-markdown__link"
+  href="https://osf.io/xv7q9/"
+>https://osf.io/xv7q9/</a></li>
 </ul>
 <div class="flex align-center gdoc-page__anchorwrap">
-    <h2 id="steps-for-v02"
+    <h2 id="steps-for-v02-and-later-versions-hosted-at-osf"
     >
-        Steps for v0.2
+        Steps for v0.2 and later versions hosted at OSF
     </h2>
-    <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2" aria-label="Anchor to: Steps for v0.2" href="#steps-for-v02">
+    <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-and-later-versions-hosted-at-osf" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 and later versions hosted at OSF" aria-label="Anchor to: Steps for v0.2 and later versions hosted at OSF" href="#steps-for-v02-and-later-versions-hosted-at-osf">
+        <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
+    </a>
+</div>
+<p>After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at <a
+  class="gdoc-markdown__link"
+  href="https://osf.io/xv7q9/"
+>OSF</a>.</p>
+<ol>
+<li>
+<p>Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). <a
+  class="gdoc-markdown__link"
+  href="https://osf.io/zxfmy/"
+>assemblies</a>.</p>
+<pre><code> mkdir -p atb;
+ cd atb;
+
+ # attention, the URL might changes, please check it in the browser.
+ wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz
+</code></pre>
+<p>If you only need to add assemblies from an incremental version.
+Please manually download the file list in the path <code>AllTheBacteria/Assembly/OSF Storage/File_lists</code>.</p>
+</li>
+<li>
+<p>Downloading assembly tarball files.</p>
+<pre><code> # tarball file names and their URLs
+ zcat file_list.all.latest.tsv.gz | awk 'NR&gt;1 {print $3&quot;\t&quot;$4}' | uniq &gt; tar2url.tsv
+
+ # download
+ cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}'
+</code></pre>
+</li>
+<li>
+<p>Decompressing all tarballs. The decompressed genomes are stored in plain text,
+so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
+<pre><code> # {^tar.xz} is for removing the suffix &quot;tar.xz&quot;
+ ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa'
+
+ cd ..
+</code></pre>
+<p>After that, the assemblies directory would have multiple subdirectories.
+When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.
+You can also give a file list with selected assemblies.</p>
+<pre><code> $ tree atb | more
+ atb
+ ├── atb.assembly.r0.2.batch.1
+ │   ├── SAMD00013333.fa.gz
+ │   ├── SAMD00049594.fa.gz
+ │   ├── SAMD00195911.fa.gz
+ │   ├── SAMD00195914.fa.gz
+</code></pre>
+</li>
+<li>
+<p>Parepare a file list of assemblies.</p>
+<ul>
+<li>
+<p>Just use <code>find</code> or <a
+  class="gdoc-markdown__link"
+  href="https://github.com/sharkdp/fd"
+>fd</a> (much faster).</p>
+<pre><code> # find
+ find atb/ -name &quot;*.fa.gz&quot; &gt; files.txt
+
+ # fd
+ fd .fa.gz$ atb/ &gt; files.txt
+</code></pre>
+<p>What it looks like:</p>
+<pre><code> $ head -n 2 files.txt
+ atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz
+ atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz
+</code></pre>
+</li>
+<li>
+<p>(Optional) Only keep assemblies of high-quality.
+Please manually download the <code>hq_set.sample_list.txt.gz</code> file from <a
+  class="gdoc-markdown__link"
+  href="https://osf.io/xv7q9/"
+>this path</a>, e.g., <code>AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/</code> (choose the latest date).</p>
+<pre><code>  find atb/ -name &quot;*.fa.gz&quot; | grep -w -f &lt;(zcat hq_set.sample_list.txt.gz) &gt; files.txt
+</code></pre>
+</li>
+</ul>
+</li>
+<li>
+<p>Creating a LexicMap index. (more details: <a
+  class="gdoc-markdown__link"
+  href="https://bioinf.shenwei.me/LexicMap/tutorials/index/"
+>https://bioinf.shenwei.me/LexicMap/tutorials/index/</a>)</p>
+<pre><code> lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log
+</code></pre>
+</li>
+</ol>
+<div class="flex align-center gdoc-page__anchorwrap">
+    <h2 id="steps-for-v02-hosted-at-ebi-ftp"
+    >
+        Steps for v0.2 hosted at EBI ftp
+    </h2>
+    <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-hosted-at-ebi-ftp" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 hosted at EBI ftp" aria-label="Anchor to: Steps for v0.2 hosted at EBI ftp" href="#steps-for-v02-hosted-at-ebi-ftp">
         <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
     </a>
 </div>
@@ -1760,6 +1858,7 @@ <h1>Indexing AllTheBacteria</h1>
 so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
 <pre><code> # {^asm.tar.xz} is for removing the suffix &quot;asm.tar.xz&quot;
  ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'
+
  cd ..
 </code></pre>
 <p>After that, the assemblies directory would have multiple subdirectories.
@@ -1802,8 +1901,8 @@ <h1>Indexing AllTheBacteria</h1>
 
 
 
-# index
-lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
+ # index
+ lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
 </code></pre>
 <p>For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB.
 If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>
diff --git a/tutorials/misc/index.xml b/tutorials/misc/index.xml
@@ -26,7 +26,7 @@
       <link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
       <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
       <guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
-      <description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
+      <description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
     </item>
     <item>
       <title>Indexing GlobDB</title>
diff --git a/tutorials/parameters-general.tsv b/tutorials/parameters-general.tsv
@@ -1,5 +1,5 @@
 Flag	Value	Function	Comment
 **`-w/--load-whole-seeds`**		Load the whole seed data into memory for faster search	Use this if the index is not big and many queries are needed to search.
-**`-n/--top-n-genomes`**	Default 0, 0 for all	Keep top N genome matches for a query in the chaining phase	The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
+**`-n/--top-n-genomes`**	Default 0, 0 for all	Keep top N genome matches for a query in the chaining phase	Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be >= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
 **`-a/--all`**		Output more columns, e.g., matched sequences.	"Use this if you want to output blast-style format with ""lexicmap utils 2blast"""
 -J/--max-query-conc 	Default 12, 0 for all	Maximum number of concurrent queries	Bigger values do not improve the batch searching speed and consume much memory.
diff --git a/tutorials/search/index.html b/tutorials/search/index.html
@@ -71,7 +71,7 @@
       "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
       "headline": "Step 2. Searching",
       "description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
-      "wordCount" : "2918",
+      "wordCount" : "2941",
       "inLanguage": "en",
       "isFamilyFriendly": "true",
       "mainEntityOfPage": {
@@ -1938,7 +1938,7 @@ <h1>Step 2. Searching</h1>
 <td style="text-align:left"><strong><code>-n/--top-n-genomes</code></strong></td>
 <td style="text-align:left">Default 0, 0 for all</td>
 <td style="text-align:left">Keep top N genome matches for a query in the chaining phase</td>
-<td style="text-align:left">The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
+<td style="text-align:left">Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be &gt;= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
 </tr>
 <tr>
 <td style="text-align:left"><strong><code>-a/--all</code></strong></td>
@@ -1950,7 +1950,7 @@ <h1>Step 2. Searching</h1>
 <td style="text-align:left">-J/&ndash;max-query-conc</td>
 <td style="text-align:left">Default 12, 0 for all</td>
 <td style="text-align:left">Maximum number of concurrent queries</td>
-<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory</td>
+<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory.</td>
 </tr>
 </tbody>
 </table> </div>
@@ -2492,7 +2492,8 @@ <h1>Step 2. Searching</h1>
  Escherichia coli           128071   
  Streptococcus pneumoniae   51971    
  Staphylococcus aureus      44215    
- Pseudomonas aeruginosa     34254</code></pre>
+ Pseudomonas aeruginosa     34254
+</code></pre>
 </li>
 </ol>
 
diff --git a/usage/search/index.html b/usage/search/index.html