Skip to content

Commit 883b7e9

Browse files
committed
add a new tutorial of indexing ATB dataset hosted on OSF
1 parent a4f727c commit 883b7e9

File tree

9 files changed

+140
-23
lines changed

9 files changed

+140
-23
lines changed

index.html

+8
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,14 @@ <h1></h1>
329329
src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
330330
alt="Cross-platform"
331331

332+
/></a>
333+
<a
334+
class="gdoc-markdown__link--raw"
335+
href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
336+
><img
337+
src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
338+
alt="license"
339+
332340
/></a></p>
333341
<p><font size=5rem>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying <strong>gene, plasmid, virus, or long-read sequences</strong> against up to <strong>millions</strong> of <strong>prokaryotic genomes</strong>.</font></p>
334342

index.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
4848
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
4949
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
50-
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
50+
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
5151
</item>
5252
<item>
5353
<title>genomes</title>

introduction/index.html

+8
Original file line numberDiff line numberDiff line change
@@ -1708,6 +1708,14 @@ <h1>Introduction</h1>
17081708
src="https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat"
17091709
alt="Cross-platform"
17101710

1711+
/></a>
1712+
<a
1713+
class="gdoc-markdown__link--raw"
1714+
href="https://github.com/shenwei356/taxonkit/blob/master/LICENSE"
1715+
><img
1716+
src="https://img.shields.io/github/license/shenwei356/taxonkit.svg?maxAge=2592000"
1717+
alt="license"
1718+
17111719
/></a></p>
17121720
<p>LexicMap is a <strong>nucleotide sequence alignment</strong> tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to <strong>millions of prokaryotic genomes</strong>.</p>
17131721
<p>Preprint:</p>

search/en.data.min.json

+1-1
Large diffs are not rendered by default.

tutorials/misc/index-allthebacteria/index.html

+112-13
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@
1515
<meta name="description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
1616
Tools:
1717
https://github.com/shenwei356/rush, for running jobs Info:
18-
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
19-
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
18+
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
2019

2120
<title>Indexing AllTheBacteria | LexicMap: efficient sequence alignment against millions of prokaryotic genomes​</title>
2221

@@ -45,8 +44,7 @@
4544
<meta property="og:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
4645
Tools:
4746
https://github.com/shenwei356/rush, for running jobs Info:
48-
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
49-
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
47+
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
5048
<meta property="og:type" content="article" />
5149
<meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/" />
5250

@@ -58,8 +56,7 @@
5856
<meta name="twitter:description" content="Make sure you have enough disk space, at least 8 TB, &gt;10 TB is preferred.
5957
Tools:
6058
https://github.com/shenwei356/rush, for running jobs Info:
61-
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/
62-
mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." />
59+
AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." />
6360

6461

6562
<script type="application/ld+json">
@@ -70,8 +67,8 @@
7067
"name": "Indexing AllTheBacteria",
7168
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/",
7269
"headline": "Indexing AllTheBacteria",
73-
"description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https:\/\/bioinf.",
74-
"wordCount" : "416",
70+
"description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https:\/\/osf.io\/xv7q9\/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.",
71+
"wordCount" : "744",
7572
"inLanguage": "en",
7673
"isFamilyFriendly": "true",
7774
"mainEntityOfPage": {
@@ -1722,13 +1719,114 @@ <h1>Indexing AllTheBacteria</h1>
17221719
class="gdoc-markdown__link"
17231720
href="https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1"
17241721
>AllTheBacteria - all bacterial genomes assembled, available and searchable</a></li>
1722+
<li>Data on OSF: <a
1723+
class="gdoc-markdown__link"
1724+
href="https://osf.io/xv7q9/"
1725+
>https://osf.io/xv7q9/</a></li>
17251726
</ul>
17261727
<div class="flex align-center gdoc-page__anchorwrap">
1727-
<h2 id="steps-for-v02"
1728+
<h2 id="steps-for-v02-and-later-versions-hosted-at-osf"
17281729
>
1729-
Steps for v0.2
1730+
Steps for v0.2 and later versions hosted at OSF
17301731
</h2>
1731-
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2" aria-label="Anchor to: Steps for v0.2" href="#steps-for-v02">
1732+
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-and-later-versions-hosted-at-osf" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 and later versions hosted at OSF" aria-label="Anchor to: Steps for v0.2 and later versions hosted at OSF" href="#steps-for-v02-and-later-versions-hosted-at-osf">
1733+
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
1734+
</a>
1735+
</div>
1736+
<p>After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at <a
1737+
class="gdoc-markdown__link"
1738+
href="https://osf.io/xv7q9/"
1739+
>OSF</a>.</p>
1740+
<ol>
1741+
<li>
1742+
<p>Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). <a
1743+
class="gdoc-markdown__link"
1744+
href="https://osf.io/zxfmy/"
1745+
>assemblies</a>.</p>
1746+
<pre><code> mkdir -p atb;
1747+
cd atb;
1748+
1749+
# attention, the URL might changes, please check it in the browser.
1750+
wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz
1751+
</code></pre>
1752+
<p>If you only need to add assemblies from an incremental version.
1753+
Please manually download the file list in the path <code>AllTheBacteria/Assembly/OSF Storage/File_lists</code>.</p>
1754+
</li>
1755+
<li>
1756+
<p>Downloading assembly tarball files.</p>
1757+
<pre><code> # tarball file names and their URLs
1758+
zcat file_list.all.latest.tsv.gz | awk 'NR&gt;1 {print $3&quot;\t&quot;$4}' | uniq &gt; tar2url.tsv
1759+
1760+
# download
1761+
cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}'
1762+
</code></pre>
1763+
</li>
1764+
<li>
1765+
<p>Decompressing all tarballs. The decompressed genomes are stored in plain text,
1766+
so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
1767+
<pre><code> # {^tar.xz} is for removing the suffix &quot;tar.xz&quot;
1768+
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa'
1769+
1770+
cd ..
1771+
</code></pre>
1772+
<p>After that, the assemblies directory would have multiple subdirectories.
1773+
When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files.
1774+
You can also give a file list with selected assemblies.</p>
1775+
<pre><code> $ tree atb | more
1776+
atb
1777+
├── atb.assembly.r0.2.batch.1
1778+
│   ├── SAMD00013333.fa.gz
1779+
│   ├── SAMD00049594.fa.gz
1780+
│   ├── SAMD00195911.fa.gz
1781+
│   ├── SAMD00195914.fa.gz
1782+
</code></pre>
1783+
</li>
1784+
<li>
1785+
<p>Parepare a file list of assemblies.</p>
1786+
<ul>
1787+
<li>
1788+
<p>Just use <code>find</code> or <a
1789+
class="gdoc-markdown__link"
1790+
href="https://github.com/sharkdp/fd"
1791+
>fd</a> (much faster).</p>
1792+
<pre><code> # find
1793+
find atb/ -name &quot;*.fa.gz&quot; &gt; files.txt
1794+
1795+
# fd
1796+
fd .fa.gz$ atb/ &gt; files.txt
1797+
</code></pre>
1798+
<p>What it looks like:</p>
1799+
<pre><code> $ head -n 2 files.txt
1800+
atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz
1801+
atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz
1802+
</code></pre>
1803+
</li>
1804+
<li>
1805+
<p>(Optional) Only keep assemblies of high-quality.
1806+
Please manually download the <code>hq_set.sample_list.txt.gz</code> file from <a
1807+
class="gdoc-markdown__link"
1808+
href="https://osf.io/xv7q9/"
1809+
>this path</a>, e.g., <code>AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/</code> (choose the latest date).</p>
1810+
<pre><code> find atb/ -name &quot;*.fa.gz&quot; | grep -w -f &lt;(zcat hq_set.sample_list.txt.gz) &gt; files.txt
1811+
</code></pre>
1812+
</li>
1813+
</ul>
1814+
</li>
1815+
<li>
1816+
<p>Creating a LexicMap index. (more details: <a
1817+
class="gdoc-markdown__link"
1818+
href="https://bioinf.shenwei.me/LexicMap/tutorials/index/"
1819+
>https://bioinf.shenwei.me/LexicMap/tutorials/index/</a>)</p>
1820+
<pre><code> lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log
1821+
</code></pre>
1822+
</li>
1823+
</ol>
1824+
<div class="flex align-center gdoc-page__anchorwrap">
1825+
<h2 id="steps-for-v02-hosted-at-ebi-ftp"
1826+
>
1827+
Steps for v0.2 hosted at EBI ftp
1828+
</h2>
1829+
<a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-hosted-at-ebi-ftp" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 hosted at EBI ftp" aria-label="Anchor to: Steps for v0.2 hosted at EBI ftp" href="#steps-for-v02-hosted-at-ebi-ftp">
17321830
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
17331831
</a>
17341832
</div>
@@ -1760,6 +1858,7 @@ <h1>Indexing AllTheBacteria</h1>
17601858
so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
17611859
<pre><code> # {^asm.tar.xz} is for removing the suffix &quot;asm.tar.xz&quot;
17621860
ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'
1861+
17631862
cd ..
17641863
</code></pre>
17651864
<p>After that, the assemblies directory would have multiple subdirectories.
@@ -1802,8 +1901,8 @@ <h1>Indexing AllTheBacteria</h1>
18021901

18031902

18041903

1805-
# index
1806-
lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
1904+
# index
1905+
lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log
18071906
</code></pre>
18081907
<p>For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB.
18091908
If you don&rsquo;t have enough memory, please decrease the value of <code>-b</code>.</p>

tutorials/misc/index.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
<link>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</link>
2727
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
2828
<guid>https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/</guid>
29-
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/&#xA;mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.</description>
29+
<description>Make sure you have enough disk space, at least 8 TB, &amp;gt;10 TB is preferred.&#xA;Tools:&#xA;https://github.com/shenwei356/rush, for running jobs Info:&#xA;AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.</description>
3030
</item>
3131
<item>
3232
<title>Indexing GlobDB</title>

tutorials/parameters-general.tsv

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Flag Value Function Comment
22
**`-w/--load-whole-seeds`** Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search.
3-
**`-n/--top-n-genomes`** Default 0, 0 for all Keep top N genome matches for a query in the chaining phase The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
3+
**`-n/--top-n-genomes`** Default 0, 0 for all Keep top N genome matches for a query in the chaining phase Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be >= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
44
**`-a/--all`** Output more columns, e.g., matched sequences. "Use this if you want to output blast-style format with ""lexicmap utils 2blast"""
55
-J/--max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory.

tutorials/search/index.html

+5-4
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
7272
"headline": "Step 2. Searching",
7373
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
74-
"wordCount" : "2918",
74+
"wordCount" : "2941",
7575
"inLanguage": "en",
7676
"isFamilyFriendly": "true",
7777
"mainEntityOfPage": {
@@ -1938,7 +1938,7 @@ <h1>Step 2. Searching</h1>
19381938
<td style="text-align:left"><strong><code>-n/--top-n-genomes</code></strong></td>
19391939
<td style="text-align:left">Default 0, 0 for all</td>
19401940
<td style="text-align:left">Keep top N genome matches for a query in the chaining phase</td>
1941-
<td style="text-align:left">The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
1941+
<td style="text-align:left">Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be &gt;= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.</td>
19421942
</tr>
19431943
<tr>
19441944
<td style="text-align:left"><strong><code>-a/--all</code></strong></td>
@@ -1950,7 +1950,7 @@ <h1>Step 2. Searching</h1>
19501950
<td style="text-align:left">-J/&ndash;max-query-conc</td>
19511951
<td style="text-align:left">Default 12, 0 for all</td>
19521952
<td style="text-align:left">Maximum number of concurrent queries</td>
1953-
<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory</td>
1953+
<td style="text-align:left">Bigger values do not improve the batch searching speed and consume much memory.</td>
19541954
</tr>
19551955
</tbody>
19561956
</table> </div>
@@ -2492,7 +2492,8 @@ <h1>Step 2. Searching</h1>
24922492
Escherichia coli 128071
24932493
Streptococcus pneumoniae 51971
24942494
Staphylococcus aureus 44215
2495-
Pseudomonas aeruginosa 34254</code></pre>
2495+
Pseudomonas aeruginosa 34254
2496+
</code></pre>
24962497
</li>
24972498
</ol>
24982499

0 commit comments

Comments
 (0)