Note: You must create a custom configuration file and add it with
-profile custom -c tofu.config
to your TOFU-MAaPO call!
Reference databases are mandatory for specific modules:
MetaPhlAn DB: --metaphlan_db
HUMAnN DB: --humann_db
Kraken2 DB: --kraken2_db
Sylph DB: --sylph_db
Salmon DB: --salmon_db
GTDB-Tk DB: --gtdbtk_reference
Parameters should be set in the first run of the respective module
For MetaPhlAn, HUMAnn and GTDB-Tk the pipeline can download the required database via following flags:
--updatemetaphlan
Download the Metaphlan4 database to the directory set in parameter--metaphlan_db
.--updatehumann
Download the HUMAnN3 database to the directory set in parameter--humann_db
. HUMAnN3 requires the Metaphlan4 database, too.--updategtdbtk
Download the GTDB-Tk reference data to the directory set in parameter--gtdbtk_reference
.
By default, only the Quality Control runs unless additional modules are specified.
Example command:
nextflow run ikmb/tofu-maapo --reads '/path/to/fastqfiles/*_R{1,2}_001.fastq.gz' -profile custom -c tofu.config
Choose one of the following options:
Use the --reads
parameter with a glob pattern for your .fastq.gz files as seen above or
provide a CSV file with the following columns:
id
: Sample identifierread1
: Path to the forward readsread2
: Path to the reverse reads (for paired-end data)
For single-end reads, include only id
and read1
.
Provide SRA Accession IDs via the --sra
option.
Mandatory: Provide your personal NCBI API key with --apikey
.
The pipeline will automatically download the corresponding FASTQ files. Example:
--sra 'SRX1234567' --apikey **YOUR_NCBI_API_KEY**
For mulitple IDs, use:
--sra ['ERR908507', 'ERR908506', 'ERR908505'] --apikey **YOUR_NCBI_API_KEY**
Note: The Nextflow API call to NCBI may result in extra or missing samples. Ensure to verify downloaded data. Use
--exact_matches
to allow only exact ID matches (only for run IDs).
For analysis following modules are available:
--assembly
Run an extended genome assembly workflow with MAGScoT bin refinement.
--humann
Run HUMAnN3, a tool for profiling the abundance of microbial metabolic pathways and other molecular functions
--metaphlan
Run MetaPhlAn4, a tool for profiling the composition of microbial communities--kraken
Run Kraken2, a tool for taxonomic classification tool.--bracken
Run Bracken (Bayesian Reestimation of Abundance with KrakEN) after Kraken2. Kraken2 DB must be bracken-ready--sylph
Run Sylph.--salmon
Run Salmon. Usage not recommended
--outdir
Set a custom output directory, default is "results".
-resume
Resumes pipeline and will continue the run with already completed, cached processes.
-profile
Change the configuration of the pipeline. Valid options are medcluster (default), local or custom. You can add a new profile for your compute system by editing the file custom.config in the folder conf or create a new one and add it in the file nextflow.config under 'profiles'.
-work-dir
Set a custom work directory, default is "work".
-r
Use a specific branch or release version of the pipeline.
--publish_rawreads
Publish unprocessed/raw files downloaded from SRA in the output directory.
--getmetadata
When using SRA input, download fitting runinfo metadata.
--cleanreads
Save QC'ed FASTQ files (disabled by default).--fastp
QC and quality assessment are performed with fastp instead of BBTools and FASTQC--genome
Set host genome. On the IKMB Medcluster valid options are human, mouse or chimp. In other cases this needs to be pre-configured. How to add a host genome to the pipeline?--no_qc
Skips QC-Module. Only use if your input reads are the output of--cleanreads
--metaphlan_db
Directory of Metaphlan database. REQUIRED!--humann_db
: Directory of HUMAnN database. REQUIRED!
--assemblymode
Specify assembly mode- single (default) Single-sample assembly
- group Group-based co-assembly (requires input as CSV with
group
column). - all Cohort-wide co-assembly
We recommend co-assembly with only moderate group sizes (~100 samples) due to hardware restrictions.
--binner
Comma-separated list of binning tools (default: all). Options: concoct,maxbin,semibin,metabat,vamb--contigsminlength
Minimum contig length (default: 2000).--semibin_environment
Specify SemiBin2 environment (default: human_gut). See the SemiBin Documentation for other options. Choose global if no other environment is appropiate.--skip_gtdbtk
Skip GTDB-TK for taxonomical assignment.--skip_checkm
Skip Checkm bin quality check.--gtdbtk_reference
Directory of GTDB-TK Reference.--publish_megahit
Publish assembled Megahit contigs.--publish_rawbins
Publish the results of all used binning tools in the genome assembly workflow.--vamb_groupsize
Set subgroup size for VAMB (default: 100). Adjust based on cohort size.
--magscot_min_sharing
Scoring parameter a [default=1]--magscot_score_a
Scoring parameter a [default=1]--magscot_score_b
Scoring parameter b [default=0.5]--magscot_score_c
Scoring parameter c [default=0.5]--magscot_threshold
Scoring minimum completeness threshold [default=0.5]--magscot_min_markers
Minimum number of unique markers in bins to be considered as seed for bin merging [default=25]--magscot_iterations
Number of merging iterations to perform. [default=2]
--metaphlan_db
Directory of Metaphlan database. REQUIRED!--publish_metaphlanbam
Publish the bam file output of Metaphlan.
--kraken2_db
Directory of used Kraken2 database. Should be Bracken ready for use with Bracken. REQUIRED!
--bracken_length
= 100--bracken_level
= "S"--bracken_threshold
= 0
--sylph_db
Set the path to a sylph databse.--sylph_merge
All sylph profiling will be done in one process. Produces a single output for all samples combined.--sylph_processing
Shortcut for high-throughput data processing with sylph, skips quality control, no other modules available in this mode.
Note: The usage of Salmon for metagenomes is experimental.
--salmon_db
Directory of used salmon database. REQUIRED!--salmon_reference
Path to tab-separated taxonomy file corresponding to the used salmon database. Not required if used database contains taxonomic names. Two column file with header line containing in the first column the bin names used in the salmon database and in the second column the taxonomic assignment by GTDB-Tk in the format "d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli".