Flexible Taxonomy Databases - A cross platform tool for customization and merging of various taxonomic classification sources.
New to FlexTaxD? --> see wiki
Supported input formats in version later than v0.4.2: For details see Formats-wiki
- NCBI
- TSV
- MP-style
- GTDB
- QIIME
- SILVA
- CanSNPer
Supported database build programs
- kraken2
- ganon
- krakenuniq
- centrifuge
The flextaxd (flextaxd) script allows customization of databases from several sources (See supported source formats above) and supports export functions into NCBI formatted names and nodes.dmp files as well as a standard tab separated file (or a selected separation).
- Create your database from source files --taxonomy_file
- and select --taxonomy_type [supported formats]
- Modify taxonomy (--mod_file/----mod_database)
- Modify taxonomy tree from selected node (--parent) Required for modification
- If the new database contains edges that are obsolete use --replace to remove any existing links
- Annotate genomes to database using genomeid2taxid
- Clean up tree, if your tree have many non annotated nodes it may be worthwile removing those from the database (for speed) use --clean_database to remove all non annotated nodes.
- To export the database into a file use --dump
- --dbprogram Select the required output format for downstream programs [supported_programs]
- --dump_prefix Change prefix on outputfiles from names,nodes)
- --dump_sep Set output separator default (NCBI \t|\t) unless --dump_mini is used (default \t sep)
- --dump_descriptions Dump node names to file instead of node ids
All data is kept in a sqlite3 database (.ftd by default) and can be dumped into NCBI formatted names and nodes.dmp (--dump) or tab separated files (--dump_mini, --dump_descriptions). The TSV dump format is similar to the NCBI dump except that it contains a header (parentchild), has parent on the left and only uses tab to separate each column (not <tab>|<tab>).
Python >=3.6
## additional requirements depending on executed functions
ncbi-genome-download - to download additional genomes from NCBI
## Visualisation
required for visualisation (plain newick format can be printed to stdout without biopython).
* biopython - required for newick_vis
* matplotlib - required for tree
* inquirer - required when multiple parents are present
## Database program requirement if create_database is executed
kraken2
krakenuniq
ganon
centrifuge
## Using conda
Install mamba in your base conda environment.
conda install mamba -n base -c conda-forge
We will then use mamba to create a self-contained conda environment with flextaxd:
mamba create -c conda-forge -c bioconda -n flextaxd flextaxd d
(For user new to conda please see the [conda guide])
## Manual install python - download git release https://github.com/FOI-Bioinformatics/flextaxd and run
python setup.py install
Read more about flextaxd on Wiki -> https://github.com/FOI-Bioinformatics/flextaxd/wiki
flextaxd --taxonomy_file taxonomy.tsv --database .ftd
flextaxd --dump
Print statistics --stats will print the number of nodes links and the number of annotated genomes.
Use the --help option for a complete list of parameters
flextaxd --help
The different formats supported and used by FlexTaxD are described in the wiki -> https://github.com/FOI-Bioinformatics/flextaxd/wiki/2a.-File-formats
The database update function can use either a previously built flextaxd database or directly through a TAB separated text file with headers (parent, child, (level))). Using the --parent parameter, all nodes/edges subsequent to that parent will be added (or can replace an existing node see options) with the links supplied. The parent node must exist in the database/tables and must have the same name (ex "Francisella tularensis"). Using the (--replace) parameter all children in the old database under the given parent will be removed, if you only want to replace for example Francisella tularensis be sure not to choose Francisella as parent.
flextaxd --mod_file custom_tree_file.txt --parent "Francisella tularensis" --genomeid2taxid custom_genome_annotations.txt
Create, modify and dump your database (observe modification can only be done simultanously when GTDB is the source, otw the genomeid2taxid is occupied with other nessesary files) Parent gives the name of the node where the modification should happen --replace will remove all child nodes of the given parent.
flextaxd --taxonomy_file taxonomy.tsv --mod_file custom_modification.txt --genomeid2taxid custom_genome_annotations.txt --parent "Francisella tularensis" --dump
Finally there is a quick option to create a kraken2 or a krakenuniq database using your custom taxonomy database by only supplying genome names matching your annotation table given as --genomeid2taxid Requirements: kraken2 or krakenuniq needs to be installed, For the FlexTaxD standard database, source data from genbank or refseq is required (for genomeid2taxid match)
Note: If your genome names are different, you can create a custom genome2taxid file and import into your database to match the names of your genome fasta/fa please avoid naming custom fasta files fna as this will be used to detect if genome names are formatted using refseq/genbank naming. Note that the refseq/genbank genomefiles needs to be gzipped (and end with .gz) in their stored location.
First dump your FlexTaxDatabase into names.dmp and nodes.dmp (default) if that was not already done.
flextaxd --dump -o NCBI_database
Add genomes to a kraken database <my_custom_krakendb> (and create the kraken database using --create_db) -o must be set to where the database names and nodes were dumped (NCBI_database)
flextaxd-create --kraken_db path/to/my_custom_krakendb --genomes_path path/to/genomes/folder --create_db --krakenversion kraken2 -o NCBI_database
The script will find all custom fasta, fa and refseq/genbank fna files in the given path and then add them to the krakendb, if --create_db parameter is given the script will execute the kraken-build --build command.
For a more in depth tutorial on how to merge a database onto the NCBI taxonomy see wiki -> https://github.com/FOI-Bioinformatics/flextaxd/wiki/8d.-Walkthrough--merge-NCBI-and-GTDB
The most common database to start with is the NCBI taxonomy tree, however there are many known caveats to the NCBI tree in particular in the Bacterial kingdom, FlexTaxD allows modifications of the NCBI taxonomy by replacing nodes with correct structures.
Creating a custom taxonomy database using the NCBI taxonomy tree instead of FlexTaxD as base
Required files from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy):
from taxdmp.zip
names.dmp
nodes.dmp
nucl_gb.accession2taxid.gz (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz)
flextaxd
--taxonomy_file taxonomy/nodes.dmp ## Path to taxonomy nodes.dmp
--taxonomy_type NCBI ## NCBI formatted input
--genomes_path refseq/bacteria/ ## path to ncbi-genome-download folder with bacteria (or other folder structures)
--genomeid2taxid taxonomy/nucl_gb.accession2taxid.gz ## accession numbers to taxid annotation
--database NCBI_taxonomy.db ## Name of the database where all information will be kept (for future use or reuse of flextaxd)
-o NCBI_database ## Output folder
--dump ## Print out names.dmp and nodes.dmp from flextaxd database
Modify the NCBI database using a previously created flextaxd (example database from another source (CanSNPer.db))
Replace the Francisella node in the NCBI database with the node structure from a CanSNP database containing the Francisella CanSNPer tree.
## Step one: create your database containing required modifications
flextaxd --taxonomy_file cansnper_tree.txt --taxonomy_type CanSNPer --genomeid2taxid cansnper_genometotaxid_annotation.txt --database canSNPer_database/CanSNPer.db
## Step two: create your custom taxonomy database from NCBI
flextaxd --database NCBI_taxonomy.db --mod_database canSNPer_database/CanSNPer.db --parent "Francisella"
## Step three: dump your database
flextaxd --dump
Create kraken2 NCBI database (approx 60min with 40 cores with complete bacterial genomes from genbank may 2019 (~9000))
conda activate kraken2
flextaxd-create
--database NCBI_taxonomy.db ## flextaxd database file
-o NCBI_database ## output dir (must be the same as where names and nodes.dmp were exported (--dump))
--kraken_db NCBI_krakendb ## Name of kraken database, relative path from script, not realtive to --outdir (-o)
--genomes_path refseq/bacteria/ ## Path to ncbi-genome-download folder with bacteria (or other folder structures) using the same as when
# the database was created is recommended otw genomes may be missed
--create_db ## Request script to create a kraken database (kraken2 or krakenuniq must be installed)
--krakenversion kraken2 ## Select version of kraken which is installed
--processes 40 ## Number of cores to use for adding genomes to the kraken database as well as the number of cores
used to run kraken*-build (\*2,uniq)
--skip "taxid" ## exclude genomes in taxid see details below
Remove branches, the --skip parameter was implemented for benchmarking purposes as an option to remove branches by taxid, all children of the given taxid will be excluded.
flextaxd-create --skip "taxid,taxid2"
Sundell, D. et al. (2021) ‘FlexTaxD: flexible modification of taxonomy databases for improved sequence classification’, Bioinformatics. Edited by J. Kelso. Bioinformatics. doi: 10.1093/bioinformatics/btab621.