This document describes the databases used by the pipeline.
For building the database, install Kraken2 if it is not already installed. It can be installed using conda with this command:
conda install bioconda::kraken2
Commands for building the Kraken database:
kraken2-build --threads 16 --download-taxonomy --db nt
kraken2-build --threads 16 --download-library nt --db nt
kraken2-build --build --threads 16 --db nt
Warning: building this database requires hundreds of gigabytes of memory.
Download from from: https://busco-data.ezlab.org/v5/data/.
Download and set up according to the instructions at https://blobtoolkit.genomehubs.org/install/.
Download the nr database protein FASTA files from the NCBI ftp server (wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
) and build the database similarly to the Uniprot Diamond database, following the instructions at https://blobtoolkit.genomehubs.org/install/.
Download the files from the NCBI FTP server and uncompress them:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz.md5 \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz.md5 \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz.md5 \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.md5 \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz \
&& wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz \
&& gunzip *accession2taxid.gz
Downloading instructions are at https://github.com/ncbi/fcs/wiki/FCS-GX-quickstart#download-the-fcs-gx-database.
The FCS-adaptor database is included in the FCS-adaptor installation, so it doesn't need to be downloaded separately.
A FASTA file with the sequences for making a VecScreen database is included in the ASCC repository. It is the vecscreen_adaptors_for_screening_euks.fa
file in the assets
directory of this pipeline (vecscreen_adaptors_for_screening_euks.fa).
VecScreen requires a BLAST V4 database as input, we can generate this with the above file use the following.
makeblastdb -in vecscreen_adaptors_for_screening_euks.fa -parse_seqids -blastdb_version 4 -dbtype nucl
To use this database, point the vecscreen_database_path
variable in the input YAML file of the pipeline run to the directory that contains this BLAST database. Use the name of the directory for vecscreen_database_path
, without using the name of the database files. E.g. /path/to/my/database/files/vecscreen_database/
.
A FASTA file with the sequences of PacBio multiplexing barcodes is included in the ASCC repository. It is the pacbio_adaptors.fa
file in the assets
directory of this pipeline (pacbio_adaptors.fa).