This repository contains the MGnify assembly analysis pipeline, from version 6.0.0 onwards. For version 5.0 of the pipeline, please follow this link.
The MGnify assembly analysis pipeline, version 6.0.0 and onwards, provides the following key features:
- Assembly Quality Control: The pipeline performs quality control on the assembled contigs and includes optional decontamination functionality to remove human, PhiX, and custom contaminant sequences.
- CDS Prediction: The pipeline utilizes the MGnify Combined Gene Caller to predict coding sequences (CDS) within the assembled contigs.
- Taxonomic Assignment: The pipeline assigns taxonomic classifications to the assembled contigs using Contig Annotation Tool (CAT).
- Functional Annotation:
- InterProScan: Identifies protein domains, families, and functional sites.
- eggNOG Mapper: Assigns clusters of orthologs groups (COGs) annotations and eggNOG functional descriptions.
- GO Slims: The pipeline maps the protein sequences to Gene Ontology (GO) Slim terms.
- run_dbCAN: Annotates carbohydrate-active enzymes.
- KEGG Orthologs: Assigns KEGG Orthologs (KO) identifiers using HMMER.
- RHEA: Proteins are assigned RHEA ids.
- Biosynthetic Gene Cluster Annotation: The pipeline uses antiSMASH and SanntiS to identify and annotate biosynthetic gene clusters associated with secondary metabolite production.
- KEGG Modules completeness: The pipeline analyzes the KEGG Orthologs annotations to infer the presence and completeness of KEGG modules.
- Consolidated annotation: The pipeline aggregates all the generated annotations into a single consolidated GFF file.
| Tool | Version | Purpose |
|---|---|---|
| antiSMASH | 8.0.1 | Tool for the identification and annotation of secondary metabolite biosynthesis gene clusters |
| boto3 | 1.35.37 | AWS SDK for Python used to access EBI FIRE S3 storage for assembly file downloads |
| CAT_pack | 6.0 | Taxonomic classification of the contigs in the assembly |
| cmsearchtbloutdeoverlap | 0.09 | Deoverlapping of cmsearch results |
| csvtk | 0.31.0 | A cross-platform, efficient, and practical CSV/TSV toolkit |
| Combined Gene Caller - Merge | 1.2.0 | Combined gene caller merge script used to combine predictions of Pyrodigal and FragGeneScanRS (this tool is part of the mgnify-pipelines-toolkit) |
| Diamond | 2.1.11 | Used to match predicted CDS against the CAT reference database for the taxonomic classification of the contigs |
| DRAM | 13.5 | Summarizes annotations from multiple tools like KEGG, Pfam, and CAZy |
| easel | 0.49 | Extracts FASTA sequences by name from a cmsearch deoverlap result |
| extractcoords | 1.2.0 | Processes output from easel-sfetch to extract SSU and LSU sequences (this tool is part of the mgnify-pipelines-toolkit). |
| FragGeneScanRs | 1.1.0 | CDS calling; this tool specializes in calling fragmented CDS |
| generategaf | 1.2.0 | Script that generates a GO Annotation File (GAF) from an InterProScan result TSV file (this tool is part of the mgnify-pipelines-toolkit). |
| Genome Properties | 2.0 | Uses protein signatures as evidence to determine the presence of each step within a property |
| Infernal - cmscan | 1.1.5 | RNA sequence searching |
| InterProScan | 5.76-107.0 | Functionally characterizes nucleotide or protein sequences by scanning them against the InterPro database. |
| HMMER | 3.4 | Used to annotate CDS with KO |
| Krona | 2.8.1 | Krona chart visualization |
| kegg-pathways-completeness | 1.3.0 | Computes the completeness of each KEGG pathway module based on KEGG orthologue (KO) annotations. |
| MGnify pipelines toolkit | 1.2.0 | Collection of tools and scripts used in MGnify pipelines. |
| minimap2 | 2.29-r1283 | A versatile pairwise aligner for genomic and spliced nucleotide sequences. Used in the assembly decontamination subworkflow |
| MultiQC | 1.29 | Tool to aggregate bioinformatic analysis results. |
| Owltools | 2024-06-12T00:00:00Z | Tool utilized to map GO terms to GO-slims |
| Pyrodigal | 3.6.3 | CDS calling |
| pigz | 2.3.4 | A parallel implementation of gzip for modern multi-processor, multi-core systems |
| QUAST | 5.2.0 | Tool used evaluates genome assemblies, it's part of the pipeline QC module. |
| run_dbCAN | 5.1.2 | Annotation tool for the Carbohydrate-Active enZYmes Database (CAZy) |
| SeqKit | 2.8.0 | Used to manipulate FASTA files |
| SanntiS | 0.9.4.1 | Tool used to identify biosynthetic gene clusters |
| tabix | 1.21 | Generic indexer for TAB-delimited genome position files |
| Genome Tools - gff3validator | 1.6.5 | Used to validate the analysis summary GFF file |
| jq | 1.5 | Used to concatenate the chunked antiSMASH json results |
This pipeline uses several reference databases, you can find the list of them in the follow table. The databases marked with * are downloaded and post-processed by the Microbiome Informatics reference-databases-preprocessing-pipeline. Our team also stores ready to use version of these databases in EBI's FTP server.
The pipeline includes an optional decontamination step that requires reference genomes (e.g., human, PhiX174, or any user-supplied genome). Frequently used reference genomes are available on our FTP server.
Use the following pipeline options to configure references:
-
--reference_genomes_folder: Path to a folder containing all reference genome subfolders. -
--human_reference,--phyx_reference,--contaminant_reference: Names of the subfolders (not paths) for each specific reference.
Each genome should be organized as follows:
<reference_genomes_folder>/
├── <genome_prefix>/
│ └── <genome_prefix>.fna
Important
FASTA files must use the .fna extension.
At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all the Nextflow processes use pre-built containers.
The input data for the pipeline is metagenomic assemblies FASTA files. These files should be specified using a .csv samplesheet file with this format:
sample,assembly_fasta,contaminant_reference,human_reference,phix_reference
ERZ999,/path/to/assembly/ERZ999.fasta.gz,,,
ERZ998,/path/to/assembly/ERZ998.fasta.gz,,,
Important
This functionality is only enabled on EBI Network (which is only accessible to EBI Staff) There are no funcional changes on the annotation, this only affects the download assembly step
The pipeline includes support for downloading assembly files directly from the EBI FIRE system. This feature is only available when running on the EBI network and is disabled by default (--use_fire_download false).
To use this feature:
- Network requirement: You must be connected to the EBI network (only available for EBI Staff)
- Enable FIRE download: Add
--use_fire_downloadparameter when running the pipeline - Configure samplesheet: Use ENA FTP URLs or ENA HTTP links in the
assembly_fastacolumn (the script will automatically translate these to FIRE S3 URLs) - Set credentials: Ensure
FIRE_ACCESS_KEYandFIRE_SECRET_KEYenvironment variables are set
Example samplesheet with ENA FTP URLs or ENA HTTP links:
sample,assembly_fasta,contaminant_reference,human_reference,phix_reference
ERZ999,ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/assembly/ERZ999.fasta.gz,,,
ERZ998,https://ftp.ebi.ac.uk/pub/databases/ena/wgs_set/ERZ998/ERZ998.fasta.gz,,,
You can run the current version of the pipeline with:
nextflow run ebi-metagenomics/assembly-analysis-pipeline \
-r main \
--input /path/to/samplesheet.csv \
--outdir /path/to/outputdirThis pipeline supports nf-core shared configuration files.
For a more detailed description on how to use the pipeline, see the usage file.
For a more detailed description of the different output files, see the outputs file.
Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al. MGnify: the microbiome sequence data analysis resource in 2023 [Internet]. Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
