Skip to content

Latest commit

 

History

History
139 lines (112 loc) · 7.85 KB

usage.md

File metadata and controls

139 lines (112 loc) · 7.85 KB

TOFU-MAaPO Usage Guide

Note: You must create a custom configuration file and add it with -profile custom -c tofu.config to your TOFU-MAaPO call!

Reference Databases

Reference databases are mandatory for specific modules:
MetaPhlAn DB: --metaphlan_db
HUMAnN DB: --humann_db
Kraken2 DB: --kraken2_db
Sylph DB: --sylph_db
Salmon DB: --salmon_db
GTDB-Tk DB: --gtdbtk_reference

Initialization options

Parameters should be set in the first run of the respective module

For MetaPhlAn, HUMAnn and GTDB-Tk the pipeline can download the required database via following flags:

  • --updatemetaphlan Download the Metaphlan4 database to the directory set in parameter --metaphlan_db.
  • --updatehumann Download the HUMAnN3 database to the directory set in parameter --humann_db. HUMAnN3 requires the Metaphlan4 database, too.
  • --updategtdbtk Download the GTDB-Tk reference data to the directory set in parameter --gtdbtk_reference.

Basic Execution

By default, only the Quality Control runs unless additional modules are specified.

Example command:

nextflow run ikmb/tofu-maapo --reads '/path/to/fastqfiles/*_R{1,2}_001.fastq.gz' -profile custom -c tofu.config

Input Options

Choose one of the following options:

1. FASTQ Files

Use the --reads parameter with a glob pattern for your .fastq.gz files as seen above or provide a CSV file with the following columns:

  • id: Sample identifier
  • read1: Path to the forward reads
  • read2: Path to the reverse reads (for paired-end data)

For single-end reads, include only id and read1.

2. SRA Accessions

Provide SRA Accession IDs via the --sra option.
Mandatory: Provide your personal NCBI API key with --apikey.
The pipeline will automatically download the corresponding FASTQ files. Example:

--sra 'SRX1234567' --apikey **YOUR_NCBI_API_KEY**

For mulitple IDs, use:

--sra ['ERR908507', 'ERR908506', 'ERR908505'] --apikey **YOUR_NCBI_API_KEY**

Note: The Nextflow API call to NCBI may result in extra or missing samples. Ensure to verify downloaded data. Use --exact_matches to allow only exact ID matches (only for run IDs).

Available modules

For analysis following modules are available:

Genome assembly

--assembly Run an extended genome assembly workflow with MAGScoT bin refinement.

Assemby-free metabolic gene abundance estimation

--humann Run HUMAnN3, a tool for profiling the abundance of microbial metabolic pathways and other molecular functions

Taxonomical abundance tools

  • --metaphlan Run MetaPhlAn4, a tool for profiling the composition of microbial communities
  • --kraken Run Kraken2, a tool for taxonomic classification tool.
  • --bracken Run Bracken (Bayesian Reestimation of Abundance with KrakEN) after Kraken2. Kraken2 DB must be bracken-ready
  • --sylph Run Sylph.
  • --salmon Run Salmon. Usage not recommended

General options

--outdir Set a custom output directory, default is "results".
-resume Resumes pipeline and will continue the run with already completed, cached processes.
-profile Change the configuration of the pipeline. Valid options are medcluster (default), local or custom. You can add a new profile for your compute system by editing the file custom.config in the folder conf or create a new one and add it in the file nextflow.config under 'profiles'.
-work-dir Set a custom work directory, default is "work".
-r Use a specific branch or release version of the pipeline.
--publish_rawreads Publish unprocessed/raw files downloaded from SRA in the output directory.
--getmetadata When using SRA input, download fitting runinfo metadata.

Module specific options

QC options

  • --cleanreads Save QC'ed FASTQ files (disabled by default).
  • --fastp QC and quality assessment are performed with fastp instead of BBTools and FASTQC
  • --genome Set host genome. On the IKMB Medcluster valid options are human, mouse or chimp. In other cases this needs to be pre-configured. How to add a host genome to the pipeline?
  • --no_qc Skips QC-Module. Only use if your input reads are the output of --cleanreads

HUMAnN options

  • --metaphlan_db Directory of Metaphlan database. REQUIRED!
  • --humann_db: Directory of HUMAnN database. REQUIRED!

Assembly options

  • --assemblymode Specify assembly mode
    • single (default) Single-sample assembly
    • group Group-based co-assembly (requires input as CSV with group column).
    • all Cohort-wide co-assembly

We recommend co-assembly with only moderate group sizes (~100 samples) due to hardware restrictions.

  • --binner Comma-separated list of binning tools (default: all). Options: concoct,maxbin,semibin,metabat,vamb
  • --contigsminlength Minimum contig length (default: 2000).
  • --semibin_environment Specify SemiBin2 environment (default: human_gut). See the SemiBin Documentation for other options. Choose global if no other environment is appropiate.
  • --skip_gtdbtk Skip GTDB-TK for taxonomical assignment.
  • --skip_checkm Skip Checkm bin quality check.
  • --gtdbtk_reference Directory of GTDB-TK Reference.
  • --publish_megahit Publish assembled Megahit contigs.
  • --publish_rawbins Publish the results of all used binning tools in the genome assembly workflow.
  • --vamb_groupsize Set subgroup size for VAMB (default: 100). Adjust based on cohort size.

MAGScoT options

  • --magscot_min_sharing Scoring parameter a [default=1]
  • --magscot_score_a Scoring parameter a [default=1]
  • --magscot_score_b Scoring parameter b [default=0.5]
  • --magscot_score_c Scoring parameter c [default=0.5]
  • --magscot_threshold Scoring minimum completeness threshold [default=0.5]
  • --magscot_min_markers Minimum number of unique markers in bins to be considered as seed for bin merging [default=25]
  • --magscot_iterations Number of merging iterations to perform. [default=2]

MetaPhlAn options

  • --metaphlan_db Directory of Metaphlan database. REQUIRED!
  • --publish_metaphlanbam Publish the bam file output of Metaphlan.

Kraken2 options

  • --kraken2_db Directory of used Kraken2 database. Should be Bracken ready for use with Bracken. REQUIRED!

Bracken options and their default

  • --bracken_length = 100
  • --bracken_level = "S"
  • --bracken_threshold = 0

Sylph options

  • --sylph_db Set the path to a sylph databse.
  • --sylph_merge All sylph profiling will be done in one process. Produces a single output for all samples combined.
  • --sylph_processing Shortcut for high-throughput data processing with sylph, skips quality control, no other modules available in this mode.

Salmon options

Note: The usage of Salmon for metagenomes is experimental.

  • --salmon_db Directory of used salmon database. REQUIRED!
  • --salmon_reference Path to tab-separated taxonomy file corresponding to the used salmon database. Not required if used database contains taxonomic names. Two column file with header line containing in the first column the bin names used in the salmon database and in the second column the taxonomic assignment by GTDB-Tk in the format "d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli".