This document describes the databases used by the pipeline.
For building the database, install Kraken2 if it is not already installed. It can be installed using conda with this command:
conda install bioconda::kraken2
Commands for building the Kraken database:
kraken2-build --threads 16 --download-taxonomy --db nt
kraken2-build --threads 16 --download-library nt --db nt
kraken2-build --build --threads 16 --db nt
Warning: building this database requires hundreds of gigabytes of memory.
Download from from:
Download and set up according to the instructions at
Download the nr database protein FASTA files from the NCBI ftp server (wget
) and build the database similarly to the Uniprot Diamond database, following the instructions at
Download the files from the NCBI FTP server and uncompress them:
wget \
&& wget \
&& wget \
&& wget \
&& wget \
&& wget \
&& wget \
&& wget \
&& gunzip *accession2taxid.gz
Downloading instructions are at
The FCS-adaptor database is included in the FCS-adaptor installation, so it doesn't need to be downloaded separately.
A FASTA file with the sequences for making a VecScreen database is included in the ASCC repository. It is the vecscreen_adaptors_for_screening_euks.fa
file in the assets
directory of this pipeline (vecscreen_adaptors_for_screening_euks.fa).
VecScreen requires a BLAST V4 database as input, we can generate this with the above file use the following.
makeblastdb -in vecscreen_adaptors_for_screening_euks.fa -parse_seqids -blastdb_version 4 -dbtype nucl
To use this database, point the vecscreen_database_path
variable in the input YAML file of the pipeline run to the directory that contains this BLAST database. Use the name of the directory for vecscreen_database_path
, without using the name of the database files. E.g. /path/to/my/database/files/vecscreen_database/
A FASTA file with the sequences of PacBio multiplexing barcodes is included in the ASCC repository. It is the pacbio_adaptors.fa
file in the assets
directory of this pipeline (pacbio_adaptors.fa).