Skip to content

EBI-Metagenomics/assembly-analysis-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ebi-metagenomics/assembly-analysis-pipeline

GitHub Actions CI Status GitHub Actions Linting Status nf-test Nextflow

Introduction

MGnify assembly analysis pipeline

This repository contains the MGnify assembly analysis pipeline, from version 6.0.0 onwards. For version 5.0 of the pipeline, please follow this link.

V6 Schema

Pipeline description

Features

The MGnify assembly analysis pipeline, version 6.0.0 and onwards, provides the following key features:

  • Assembly Quality Control: The pipeline performs quality control on the assembled contigs and includes optional decontamination functionality to remove human, PhiX, and custom contaminant sequences.
  • CDS Prediction: The pipeline utilizes the MGnify Combined Gene Caller to predict coding sequences (CDS) within the assembled contigs.
  • Taxonomic Assignment: The pipeline assigns taxonomic classifications to the assembled contigs using Contig Annotation Tool (CAT).
  • Functional Annotation:
    • InterProScan: Identifies protein domains, families, and functional sites.
    • eggNOG Mapper: Assigns clusters of orthologs groups (COGs) annotations and eggNOG functional descriptions.
    • GO Slims: The pipeline maps the protein sequences to Gene Ontology (GO) Slim terms.
    • run_dbCAN: Annotates carbohydrate-active enzymes.
    • KEGG Orthologs: Assigns KEGG Orthologs (KO) identifiers using HMMER.
    • RHEA: Proteins are assigned RHEA ids.
  • Biosynthetic Gene Cluster Annotation: The pipeline uses antiSMASH and SanntiS to identify and annotate biosynthetic gene clusters associated with secondary metabolite production.
  • KEGG Modules completeness: The pipeline analyzes the KEGG Orthologs annotations to infer the presence and completeness of KEGG modules.
  • Consolidated annotation: The pipeline aggregates all the generated annotations into a single consolidated GFF file.

Tools

Tool Version Purpose
antiSMASH 8.0.1 Tool for the identification and annotation of secondary metabolite biosynthesis gene clusters
boto3 1.35.37 AWS SDK for Python used to access EBI FIRE S3 storage for assembly file downloads
CAT_pack 6.0 Taxonomic classification of the contigs in the assembly
cmsearchtbloutdeoverlap 0.09 Deoverlapping of cmsearch results
csvtk 0.31.0 A cross-platform, efficient, and practical CSV/TSV toolkit
Combined Gene Caller - Merge 1.2.0 Combined gene caller merge script used to combine predictions of Pyrodigal and FragGeneScanRS (this tool is part of the mgnify-pipelines-toolkit)
Diamond 2.1.11 Used to match predicted CDS against the CAT reference database for the taxonomic classification of the contigs
DRAM 13.5 Summarizes annotations from multiple tools like KEGG, Pfam, and CAZy
easel 0.49 Extracts FASTA sequences by name from a cmsearch deoverlap result
extractcoords 1.2.0 Processes output from easel-sfetch to extract SSU and LSU sequences (this tool is part of the mgnify-pipelines-toolkit).
FragGeneScanRs 1.1.0 CDS calling; this tool specializes in calling fragmented CDS
generategaf 1.2.0 Script that generates a GO Annotation File (GAF) from an InterProScan result TSV file (this tool is part of the mgnify-pipelines-toolkit).
Genome Properties 2.0 Uses protein signatures as evidence to determine the presence of each step within a property
Infernal - cmscan 1.1.5 RNA sequence searching
InterProScan 5.76-107.0 Functionally characterizes nucleotide or protein sequences by scanning them against the InterPro database.
HMMER 3.4 Used to annotate CDS with KO
Krona 2.8.1 Krona chart visualization
kegg-pathways-completeness 1.3.0 Computes the completeness of each KEGG pathway module based on KEGG orthologue (KO) annotations.
MGnify pipelines toolkit 1.2.0 Collection of tools and scripts used in MGnify pipelines.
minimap2 2.29-r1283 A versatile pairwise aligner for genomic and spliced nucleotide sequences. Used in the assembly decontamination subworkflow
MultiQC 1.29 Tool to aggregate bioinformatic analysis results.
Owltools 2024-06-12T00:00:00Z Tool utilized to map GO terms to GO-slims
Pyrodigal 3.6.3 CDS calling
pigz 2.3.4 A parallel implementation of gzip for modern multi-processor, multi-core systems
QUAST 5.2.0 Tool used evaluates genome assemblies, it's part of the pipeline QC module.
run_dbCAN 5.1.2 Annotation tool for the Carbohydrate-Active enZYmes Database (CAZy)
SeqKit 2.8.0 Used to manipulate FASTA files
SanntiS 0.9.4.1 Tool used to identify biosynthetic gene clusters
tabix 1.21 Generic indexer for TAB-delimited genome position files
Genome Tools - gff3validator 1.6.5 Used to validate the analysis summary GFF file
jq 1.5 Used to concatenate the chunked antiSMASH json results

Reference databases

This pipeline uses several reference databases, you can find the list of them in the follow table. The databases marked with * are downloaded and post-processed by the Microbiome Informatics reference-databases-preprocessing-pipeline. Our team also stores ready to use version of these databases in EBI's FTP server.

Reference database Version Purpose Download
Rfam covariance models 15 rRNA covariance models ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.cm.gz
Rfam clan info 15 rRNA clan information ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.clanin
InterProScan 5.73-104.0 InterProScan reference database ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.73-104.0/
eggNOG-mapper 5.0.2 eggNOG-mapper annotation databases and Diamond https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12#requirements
antiSMASH 8.0.1 The antiSMASH reference database https://docs.antismash.secondarymetabolites.org/install/#antismash-standalone-lite
KOFAM* 2025-04 KOfam - HMM profiles for KEGG/KO. Our reference generation pipeline generates the required files https://github.com/EBI-Metagenomics/reference-databases-preprocessing-pipeline
GO Slims* 20160705 Metagenomics GO Slims ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/goslim/20160705/goslim_20160705.tar.gz
run_dbCAN 4.1.4-V13 Pre-built run_DBCan reference database ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/dbcan/dbcan_4.1.3_V12.tar.gz
CAT_Pack 2025_01 CAT/BAT/RAT NCBI taxonomy pre-made reference database https://github.com/MGXlab/CAT_pack?tab=readme-ov-file#downloading-preconstructed-database-files
DRAM 1.3.0 DRAM databases https://github.com/WrightonLabCSU/DRAM/wiki#dram-setup

Reference genomes

The pipeline includes an optional decontamination step that requires reference genomes (e.g., human, PhiX174, or any user-supplied genome). Frequently used reference genomes are available on our FTP server.

Use the following pipeline options to configure references:

  • --reference_genomes_folder: Path to a folder containing all reference genome subfolders.

  • --human_reference, --phyx_reference, --contaminant_reference: Names of the subfolders (not paths) for each specific reference.

Each genome should be organized as follows:

<reference_genomes_folder>/
├── <genome_prefix>/
│   └── <genome_prefix>.fna

Important

FASTA files must use the .fna extension.

How to run

Requirements

At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all the Nextflow processes use pre-built containers.

Input shape

The input data for the pipeline is metagenomic assemblies FASTA files. These files should be specified using a .csv samplesheet file with this format:

sample,assembly_fasta,contaminant_reference,human_reference,phix_reference
ERZ999,/path/to/assembly/ERZ999.fasta.gz,,,
ERZ998,/path/to/assembly/ERZ998.fasta.gz,,,

FIRE Download Support (EBI Network Only)

Important

This functionality is only enabled on EBI Network (which is only accessible to EBI Staff) There are no funcional changes on the annotation, this only affects the download assembly step

The pipeline includes support for downloading assembly files directly from the EBI FIRE system. This feature is only available when running on the EBI network and is disabled by default (--use_fire_download false).

To use this feature:

  1. Network requirement: You must be connected to the EBI network (only available for EBI Staff)
  2. Enable FIRE download: Add --use_fire_download parameter when running the pipeline
  3. Configure samplesheet: Use ENA FTP URLs or ENA HTTP links in the assembly_fasta column (the script will automatically translate these to FIRE S3 URLs)
  4. Set credentials: Ensure FIRE_ACCESS_KEY and FIRE_SECRET_KEY environment variables are set

Example samplesheet with ENA FTP URLs or ENA HTTP links:

sample,assembly_fasta,contaminant_reference,human_reference,phix_reference
ERZ999,ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/assembly/ERZ999.fasta.gz,,,
ERZ998,https://ftp.ebi.ac.uk/pub/databases/ena/wgs_set/ERZ998/ERZ998.fasta.gz,,,

Execution

You can run the current version of the pipeline with:

nextflow run ebi-metagenomics/assembly-analysis-pipeline \
    -r main \
    --input /path/to/samplesheet.csv \
    --outdir /path/to/outputdir

This pipeline supports nf-core shared configuration files.

For a more detailed description on how to use the pipeline, see the usage file.

Outputs

For a more detailed description of the different output files, see the outputs file.

Citations

Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al. MGnify: the microbiome sequence data analysis resource in 2023 [Internet]. Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.