|
15 | 15 | <meta name="description" content="Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.
|
16 | 16 | Tools:
|
17 | 17 | https://github.com/shenwei356/rush, for running jobs Info:
|
18 |
| -AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/ |
19 |
| -mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." /> |
| 18 | +AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." /> |
20 | 19 |
|
21 | 20 | <title>Indexing AllTheBacteria | LexicMap: efficient sequence alignment against millions of prokaryotic genomes</title>
|
22 | 21 |
|
|
45 | 44 | <meta property="og:description" content="Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.
|
46 | 45 | Tools:
|
47 | 46 | https://github.com/shenwei356/rush, for running jobs Info:
|
48 |
| -AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/ |
49 |
| -mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." /> |
| 47 | +AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." /> |
50 | 48 | <meta property="og:type" content="article" />
|
51 | 49 | <meta property="og:url" content="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/" />
|
52 | 50 |
|
|
58 | 56 | <meta name="twitter:description" content="Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.
|
59 | 57 | Tools:
|
60 | 58 | https://github.com/shenwei356/rush, for running jobs Info:
|
61 |
| -AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/ |
62 |
| -mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf." /> |
| 59 | +AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF." /> |
63 | 60 |
|
64 | 61 |
|
65 | 62 | <script type="application/ld+json">
|
|
70 | 67 | "name": "Indexing AllTheBacteria",
|
71 | 68 | "url" : "https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/",
|
72 | 69 | "headline": "Indexing AllTheBacteria",
|
73 |
| - "description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Steps for v0.2 Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https:\/\/ftp.ebi.ac.uk\/pub\/databases\/AllTheBacteria\/Releases\/0.2\/assembly\/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https:\/\/bioinf.", |
74 |
| - "wordCount" : "416", |
| 70 | + "description": "Make sure you have enough disk space, at least 8 TB, \u003e10 TB is preferred.\nTools:\nhttps:\/\/github.com\/shenwei356\/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https:\/\/osf.io\/xv7q9\/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.", |
| 71 | + "wordCount" : "744", |
75 | 72 | "inLanguage": "en",
|
76 | 73 | "isFamilyFriendly": "true",
|
77 | 74 | "mainEntityOfPage": {
|
@@ -1722,13 +1719,114 @@ <h1>Indexing AllTheBacteria</h1>
|
1722 | 1719 | class="gdoc-markdown__link"
|
1723 | 1720 | href="https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1"
|
1724 | 1721 | >AllTheBacteria - all bacterial genomes assembled, available and searchable</a></li>
|
| 1722 | +<li>Data on OSF: <a |
| 1723 | + class="gdoc-markdown__link" |
| 1724 | + href="https://osf.io/xv7q9/" |
| 1725 | +>https://osf.io/xv7q9/</a></li> |
1725 | 1726 | </ul>
|
1726 | 1727 | <div class="flex align-center gdoc-page__anchorwrap">
|
1727 |
| - <h2 id="steps-for-v02" |
| 1728 | + <h2 id="steps-for-v02-and-later-versions-hosted-at-osf" |
1728 | 1729 | >
|
1729 |
| - Steps for v0.2 |
| 1730 | + Steps for v0.2 and later versions hosted at OSF |
1730 | 1731 | </h2>
|
1731 |
| - <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2" aria-label="Anchor to: Steps for v0.2" href="#steps-for-v02"> |
| 1732 | + <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-and-later-versions-hosted-at-osf" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 and later versions hosted at OSF" aria-label="Anchor to: Steps for v0.2 and later versions hosted at OSF" href="#steps-for-v02-and-later-versions-hosted-at-osf"> |
| 1733 | + <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg> |
| 1734 | + </a> |
| 1735 | +</div> |
| 1736 | +<p>After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at <a |
| 1737 | + class="gdoc-markdown__link" |
| 1738 | + href="https://osf.io/xv7q9/" |
| 1739 | +>OSF</a>.</p> |
| 1740 | +<ol> |
| 1741 | +<li> |
| 1742 | +<p>Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). <a |
| 1743 | + class="gdoc-markdown__link" |
| 1744 | + href="https://osf.io/zxfmy/" |
| 1745 | +>assemblies</a>.</p> |
| 1746 | +<pre><code> mkdir -p atb; |
| 1747 | + cd atb; |
| 1748 | + |
| 1749 | + # attention, the URL might changes, please check it in the browser. |
| 1750 | + wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz |
| 1751 | +</code></pre> |
| 1752 | +<p>If you only need to add assemblies from an incremental version. |
| 1753 | +Please manually download the file list in the path <code>AllTheBacteria/Assembly/OSF Storage/File_lists</code>.</p> |
| 1754 | +</li> |
| 1755 | +<li> |
| 1756 | +<p>Downloading assembly tarball files.</p> |
| 1757 | +<pre><code> # tarball file names and their URLs |
| 1758 | + zcat file_list.all.latest.tsv.gz | awk 'NR>1 {print $3"\t"$4}' | uniq > tar2url.tsv |
| 1759 | + |
| 1760 | + # download |
| 1761 | + cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}' |
| 1762 | +</code></pre> |
| 1763 | +</li> |
| 1764 | +<li> |
| 1765 | +<p>Decompressing all tarballs. The decompressed genomes are stored in plain text, |
| 1766 | +so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p> |
| 1767 | +<pre><code> # {^tar.xz} is for removing the suffix "tar.xz" |
| 1768 | + ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa' |
| 1769 | + |
| 1770 | + cd .. |
| 1771 | +</code></pre> |
| 1772 | +<p>After that, the assemblies directory would have multiple subdirectories. |
| 1773 | +When you give the directory to <code>lexicmap index -I</code>, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. |
| 1774 | +You can also give a file list with selected assemblies.</p> |
| 1775 | +<pre><code> $ tree atb | more |
| 1776 | + atb |
| 1777 | + ├── atb.assembly.r0.2.batch.1 |
| 1778 | + │ ├── SAMD00013333.fa.gz |
| 1779 | + │ ├── SAMD00049594.fa.gz |
| 1780 | + │ ├── SAMD00195911.fa.gz |
| 1781 | + │ ├── SAMD00195914.fa.gz |
| 1782 | +</code></pre> |
| 1783 | +</li> |
| 1784 | +<li> |
| 1785 | +<p>Parepare a file list of assemblies.</p> |
| 1786 | +<ul> |
| 1787 | +<li> |
| 1788 | +<p>Just use <code>find</code> or <a |
| 1789 | + class="gdoc-markdown__link" |
| 1790 | + href="https://github.com/sharkdp/fd" |
| 1791 | +>fd</a> (much faster).</p> |
| 1792 | +<pre><code> # find |
| 1793 | + find atb/ -name "*.fa.gz" > files.txt |
| 1794 | + |
| 1795 | + # fd |
| 1796 | + fd .fa.gz$ atb/ > files.txt |
| 1797 | +</code></pre> |
| 1798 | +<p>What it looks like:</p> |
| 1799 | +<pre><code> $ head -n 2 files.txt |
| 1800 | + atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz |
| 1801 | + atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz |
| 1802 | +</code></pre> |
| 1803 | +</li> |
| 1804 | +<li> |
| 1805 | +<p>(Optional) Only keep assemblies of high-quality. |
| 1806 | +Please manually download the <code>hq_set.sample_list.txt.gz</code> file from <a |
| 1807 | + class="gdoc-markdown__link" |
| 1808 | + href="https://osf.io/xv7q9/" |
| 1809 | +>this path</a>, e.g., <code>AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/</code> (choose the latest date).</p> |
| 1810 | +<pre><code> find atb/ -name "*.fa.gz" | grep -w -f <(zcat hq_set.sample_list.txt.gz) > files.txt |
| 1811 | +</code></pre> |
| 1812 | +</li> |
| 1813 | +</ul> |
| 1814 | +</li> |
| 1815 | +<li> |
| 1816 | +<p>Creating a LexicMap index. (more details: <a |
| 1817 | + class="gdoc-markdown__link" |
| 1818 | + href="https://bioinf.shenwei.me/LexicMap/tutorials/index/" |
| 1819 | +>https://bioinf.shenwei.me/LexicMap/tutorials/index/</a>)</p> |
| 1820 | +<pre><code> lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log |
| 1821 | +</code></pre> |
| 1822 | +</li> |
| 1823 | +</ol> |
| 1824 | +<div class="flex align-center gdoc-page__anchorwrap"> |
| 1825 | + <h2 id="steps-for-v02-hosted-at-ebi-ftp" |
| 1826 | + > |
| 1827 | + Steps for v0.2 hosted at EBI ftp |
| 1828 | + </h2> |
| 1829 | + <a data-clipboard-text="https://bioinf.shenwei.me/LexicMap/tutorials/misc/index-allthebacteria/#steps-for-v02-hosted-at-ebi-ftp" class="gdoc-page__anchor clip flex align-center" title="Anchor to: Steps for v0.2 hosted at EBI ftp" aria-label="Anchor to: Steps for v0.2 hosted at EBI ftp" href="#steps-for-v02-hosted-at-ebi-ftp"> |
1732 | 1830 | <svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
|
1733 | 1831 | </a>
|
1734 | 1832 | </div>
|
@@ -1760,6 +1858,7 @@ <h1>Indexing AllTheBacteria</h1>
|
1760 | 1858 | so we use <code>gzip</code> (can be replaced with faster <code>pigz</code> ) to compress them to save disk space.</p>
|
1761 | 1859 | <pre><code> # {^asm.tar.xz} is for removing the suffix "asm.tar.xz"
|
1762 | 1860 | ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'
|
| 1861 | + |
1763 | 1862 | cd ..
|
1764 | 1863 | </code></pre>
|
1765 | 1864 | <p>After that, the assemblies directory would have multiple subdirectories.
|
@@ -1802,8 +1901,8 @@ <h1>Indexing AllTheBacteria</h1>
|
1802 | 1901 |
|
1803 | 1902 |
|
1804 | 1903 |
|
1805 |
| -# index |
1806 |
| -lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log |
| 1904 | + # index |
| 1905 | + lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log |
1807 | 1906 | </code></pre>
|
1808 | 1907 | <p>For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB.
|
1809 | 1908 | If you don’t have enough memory, please decrease the value of <code>-b</code>.</p>
|
|
0 commit comments