Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a87ff76
Added validation
FernandoDuarteF Oct 28, 2025
465c1d8
Added test for each subworkflow
FernandoDuarteF Oct 28, 2025
ca095c7
Added coverM module
FernandoDuarteF Oct 28, 2025
c7ae940
Added assembly submit workflow
FernandoDuarteF Oct 29, 2025
d857e8a
Added params.input to .nf-core.yml
FernandoDuarteF Oct 29, 2025
8184c72
Fixed linting
FernandoDuarteF Oct 29, 2025
5d8b240
Updated assembly submit workflow
FernandoDuarteF Oct 29, 2025
8058ebd
Fixed nextflow.config
FernandoDuarteF Oct 29, 2025
2569abe
Updated assembly submit workflow
FernandoDuarteF Oct 29, 2025
ce8b84e
add draft assembly workflow (not yet tested)
ochkalova Oct 30, 2025
87c30dc
update schema for assembly input validation
ochkalova Oct 30, 2025
f4c9757
update assembly_uploader version in GENERATE_ASSEMBLY_MANIFEST process
ochkalova Oct 30, 2025
72ffa5d
update assembly_uploader version in REGISTERSTUDY process
ochkalova Oct 30, 2025
03f797d
ASSEMBLYSUBMIT workflow fixes after tests
ochkalova Oct 30, 2025
3e49ae3
add ENA_WEBIN and ENA_WEBIN_PASSWORD to env in nextflow.config
ochkalova Oct 30, 2025
eaa3d7b
input schema fixes after tests
ochkalova Oct 30, 2025
8fd8509
append --test flag if params.test_upload = true
ochkalova Oct 30, 2025
4964c4d
linter fixes in REGISTERSTUDY
ochkalova Oct 30, 2025
1321d69
update assembly_uploader to the latest version to fix long alias issue
ochkalova Oct 31, 2025
92de4fb
cleanup, add more TODOs
ochkalova Mar 5, 2026
11b7ae6
use "metagenomic_assemblies" instead of "assemblies" for clarity
ochkalova Mar 5, 2026
ba49cb8
update column name in the samplesheet
ochkalova Mar 5, 2026
26c6296
clean up test configs a little
ochkalova Mar 5, 2026
d068775
linter fixes
ochkalova Mar 6, 2026
1c895ed
update nextflow schema with new parameters
ochkalova Mar 6, 2026
2e97ea5
add TODO to organise tests
ochkalova Mar 6, 2026
b855e46
rename ena_genome_study_accession to submission_study, only
ochkalova Mar 6, 2026
ce6e5be
refactor: only register new study in assembly workflow if submission_…
ochkalova Mar 6, 2026
ac58fff
refactor: update Channel references to use lowercase 'channel'
ochkalova Mar 6, 2026
43ef432
fix: include TPA parameter in assembly manifest generation
ochkalova Mar 6, 2026
4522b71
fix: add 'input' to required fields and update descriptions in nextfl…
ochkalova Mar 6, 2026
c19953d
fix: set upload_tpa to false and mark for removal in genome_uploader …
ochkalova Mar 6, 2026
710d00f
fix: update FASTA file pattern to enforce gzipped format in assembly …
ochkalova Mar 6, 2026
38ab8e7
docs: enhance README with detailed submission modes and add workflow …
ochkalova Mar 6, 2026
a11a9d7
fix: add '--outdir' option to README for output directory specification
ochkalova Mar 6, 2026
5803b9a
fix: rename test_assembly.conf to test.conf to satisfy nf-core linter
ochkalova Mar 6, 2026
39d7414
fix: patch fastavalidator module
ochkalova Mar 6, 2026
6e31241
fix: bugfixes revealed by testing
ochkalova Mar 8, 2026
27b75b2
add more test cases for assemblysubmit
ochkalova Mar 8, 2026
38e68d6
to be fixed: create empty test.config to satisfy linter
ochkalova Mar 8, 2026
bd9a67f
fix: update README for ENA submission examples
ochkalova Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ lint:
files_exist:
- conf/igenomes.config
- conf/igenomes_ignored.config
nextflow_config:
- params.input
files_unchanged:
- .github/PULL_REQUEST_TEMPLATE.md
nf_core_version: 3.5.1
Expand Down
180 changes: 140 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,85 +21,185 @@

## Introduction

**nf-core/seqsubmit** is a bioinformatics pipeline that submits data to public archives such as [ENA](https://www.ebi.ac.uk/ena/browser/home)
**nf-core/seqsubmit** is a Nextflow pipeline for submitting sequence data to [ENA](https://www.ebi.ac.uk/ena/browser/home).
Currently, the pipeline supports three submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:

Pipeline will have several modes
- `mags` for Metagenome Assembled Genomes (MAGs) submission with `GENOMESUBMIT` workflow
- `bins` for bins submission with `GENOMESUBMIT` workflow
- `metagenomic_assemblies` for assembly submission with `ASSEMBLYSUBMIT` workflow

- `mags` for MAGs submission with **genome_submitter** wf
- `bins` for bins submission with **genome_submitter** wf
- `assemblies` for assembly submission with **assembly_submitter** wf
![seqsubmit workflow diagram](assets/seqsubmit_schema.png)

## Requirements

- Webin account registered https://www.ebi.ac.uk/ena/submit/webin/login
- Raw reads submitted into [INSDC](https://www.insdc.org/)
- [Nextflow](https://www.nextflow.io/) `>=25.04.0`
- Webin account registered at https://www.ebi.ac.uk/ena/submit/webin/login
- Raw reads used to assemble contigs submitted to [INSDC](https://www.insdc.org/) and associated accessions available

Setup your environment secrets before running the pipeline:

`nextflow secrets set WEBIN_ACCOUNT "Webin-XXX"`

`nextflow secrets set WEBIN_PASSWORD "XXX"`

Make sure you update with your authorised credentials.
Make sure you update commands above with your authorised credentials.

## genome_submitter
## Input samplesheets

Workflow to submit MAGs and/or bins to ENA.
### `mags` and `bins` modes (`GENOMESUBMIT`)

It takes input `samplesheet.csv` with fields required for [genome_uploader](https://github.com/EBI-Metagenomics/genome_uploader). Fields described in [docs](https://github.com/EBI-Metagenomics/genome_uploader/blob/main/README.md#input-tsv-and-fields).
For now workflow converts CSV into required TSV.
The input must follow `assets/schema_input_genome.json`.

_Future implementation will consider missing fields (for example completeness and contamination) and would run steps to fill in the gaps._
Required columns:

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->
- `sample`
- `fasta` (must end with `.fa.gz` or `.fasta.gz`)
- `accession`
- `assembly_software`
- `binning_software`
- `binning_parameters`
- `stats_generation_software`
- `metagenome`
- `environmental_medium`
- `broad_environment`
- `local_environment`
- `co-assembly`

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/guidelines/graphic_design/workflow_diagrams#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
Columns that required for now, but will be optional in the nearest future:

- `completeness`
- `contamination`
- `genome_coverage`
- `rRNA_presence`
- `NCBI_lineage`

Those fields are metadata required for [genome_uploader](https://github.com/EBI-Metagenomics/genome_uploader) package. They are described in [docs](https://github.com/EBI-Metagenomics/genome_uploader/blob/main/README.md#input-tsv-and-fields).

Example `samplesheet_genome.csv`:

```csv
sample,fasta,accession,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,rRNA_presence,NCBI_lineage
lachnospira_eligens,data/bin_lachnospira_eligens.fa.gz,SRR24458089,spades_v3.15.5,metabat2_v2.6,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,false,marine,cable_bacteria,marine_sediment,false,d__Bacteria;p__Proteobacteria;s_unclassified_Proteobacteria
```

### `metagenomic_assemblies` mode (`ASSEMBLYSUBMIT`)

The input must follow `assets/schema_input_assembly.json`.

Required columns:

- `sample`
- `fasta` (must end with `.fa.gz` or `.fasta.gz`)
- `run_accession`
- `assembler`
- `assembler_version`

At least one of the following must be provided per row:

- reads (`fastq_1`, optional `fastq_2` for paired-end)
- `coverage`

If `coverage` is missing and reads are provided, the workflow calculates average coverage with `coverm`.

Example `samplesheet_assembly.csv`:

```csv
sample,fasta,fastq_1,fastq_2,coverage,run_accession,assembler,assembler_version
assembly_1,data/contigs_1.fasta.gz,data/reads_1.fastq.gz,data/reads_2.fastq.gz,,ERR011322,SPAdes,3.15.5
assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9
```

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:
```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
### Required parameters:

| Parameter | Description |
| -------------------- | --------------------------------------------------------------------------------- |
| `--mode` | Type of the data to be submitted. Options: `[mags, bins, metagenomic_assemblies]` |
| `--input` | Path to the samplesheet describing the data to be submitted |
| `--outdir` | Path to the output directory for pipeline results |
| `--submission_study` | ENA study accession (PRJ/ERP) to submit the data to |
| `--centre_name` | Name of the submitter's organisation |

### Optional parameters:

| Parameter | Description |
| ------------------- | ---------------------------------------------------------------------------------------- |
| `--upload_tpa` | Flag to control the type of assembly study (third party assembly or not). Default: false |
| `--test_upload` | Upload to TEST ENA server instead of LIVE. Default: false |
| `--webincli_submit` | If set to false, submissions will be validated, but not submitted. Default: true |

General command template:

```bash
nextflow run nf-core/seqsubmit \
-profile <docker/singularity/...> \
--mode <mags|bins|metagenomic_assemblies> \
--input <samplesheet.csv> \
--centre_name <your_centre> \
--submission_study <your_study> \
--outdir <outdir>
```

Validation run (submission to the ENA TEST server) in `mags` mode:

```bash
nextflow run nf-core/seqsubmit \
-profile docker \
--mode mags \
--input assets/samplesheet_genomes.csv \
--submission_study <your_study> \
--centre_name TEST_CENTER \
--webincli_submit true \
--test_upload true \
--outdir results/validate_mags
```
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->

Now, you can run the pipeline using:
Validation run (submission to the ENA TEST server) in `metagenomic_assemblies` mode:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
```bash
nextflow run nf-core/seqsubmit \
-profile docker \
--mode metagenomic_assemblies \
--input assets/samplesheet_assembly.csv \
--submission_study <your_study> \
--centre_name TEST_CENTER \
--webincli_submit true \
--test_upload true \
--outdir results/validate_assemblies
```

Live submission example:

```bash
nextflow run nf-core/seqsubmit \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
-profile docker \
--mode metagenomic_assemblies \
--input assets/samplesheet_assembly.csv \
--submission_study PRJEB98843 \
--test_upload false \
--webincli_submit true \
--outdir results/live_assembly
```

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).

For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/seqsubmit/usage) and the [parameter documentation](https://nf-co.re/seqsubmit/parameters).

<!-- TODO nf-core:
## Pipeline output

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/seqsubmit/results) tab on the nf-core website pipeline page.
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/seqsubmit/output).
-->
Key output locations in `--outdir`:

- `upload/manifests/`: generated manifest files for submission
- `upload/webin_cli/`: ENA Webin CLI reports
- `multiqc/`: MultiQC summary report
- `pipeline_info/`: execution reports, trace, DAG, and software versions

For full details, see the [output documentation](https://nf-co.re/seqsubmit/output).

## Credits

Expand Down
4 changes: 4 additions & 0 deletions assets/samplesheet_assembly.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sample,fasta,fastq_1,fastq_2,coverage,run_accession,assembler,assembler_version
sample1,tests/data/contigs.fasta.gz,tests/data/fastq_1.fastq,tests/data/fastq_2.fastq,,ERR000001,SPAdes,3.15
sample2,tests/data/invalid_assembly.fasta.gz,,,45,ERR000002,Velvet,1.2.10
sample3,tests/data/contigs.fasta.gz,,,30,ERR000003,MEGAHIT,1.2.9
File renamed without changes.
114 changes: 114 additions & 0 deletions assets/schema_input_assembly.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-core/seqsubmit/main/assets/schema_input_assembly.json",
"title": "nf-core/seqsubmit pipeline - params.input schema",
"description": "Schema for the sample sheet provided with params.input if params.mode is set to 'metagenomic_assemblies'",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample must be provided and cannot contain spaces",
"meta": ["id"]
},
"fasta": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^([\\S\\s]*\\/)?[^\\s\\/]+\\.f(ast)?a\\.gz$",
"errorMessage": "FASTA file must be provided and have extension '.fa', '.fasta', '.fas', '.fna' (optionally gzipped)",
"description": "Metagenomic assembly FASTA file"
},
"fastq_1": {
"anyOf": [
{
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "FASTQ file must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Forward reads if paired-end or single-end reads FASTQ file"
},
"fastq_2": {
"anyOf": [
{
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "FASTQ file for reverse reads must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Reverse reads FASTQ file if paired-end. Leave empty for single-end reads"
},
"coverage": {
"anyOf": [
{
"type": "number",
"minimum": 0
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "Coverage must be a positive number or empty",
"description": "Estimated value of assembly coverage"
},
"run_accession": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Accession must be provided and cannot contain spaces",
"description": "Accession of the run used to generate the assembly"
},
"assembler": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Assembler must be provided and cannot contain spaces",
"description": "Name of the assembler software used to generate the assembly"
},
"assembler_version": {
"anyOf": [{ "type": "string" }, { "type": "number" }],
"pattern": "^\\S+$",
"errorMessage": "Assembler version must be provided and cannot contain spaces",
"description": "Version of the assembler software used to generate the assembly"
}
},
"required": ["sample", "fasta", "run_accession", "assembler", "assembler_version"],
"anyOf": [
{
"properties": {
"fastq_1": {
"type": "string",
"minLength": 1
}
},
"required": ["fastq_1"]
},
{
"properties": {
"coverage": {
"type": "number",
"minimum": 0
}
},
"required": ["coverage"]
}
],
"errorMessage": {
"anyOf": "Either reads or coverage must be provided in the sample sheet for each assembly"
}
}
}
4 changes: 2 additions & 2 deletions assets/schema_input.json → assets/schema_input_genome.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-core/seqsubmit/main/assets/schema_input.json",
"$id": "https://raw.githubusercontent.com/nf-core/seqsubmit/main/assets/schema_input_genome.json",
"title": "nf-core/seqsubmit pipeline - params.input schema",
"description": "Schema for the file provided with params.input",
"description": "Schema for the file provided with params.input if params.mode is set to 'mags' or 'bins'",
"type": "array",
"items": {
"type": "object",
Expand Down
Binary file added assets/seqsubmit_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 6 additions & 3 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,15 @@ process {

withName: 'GENOME_UPLOAD' {
publishDir = [
path: { "${params.outdir}/upload/manifests" },
path: { "${params.outdir}/${params.mode}/upload/manifests" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: 'ENA_WEBIN_CLI' {
publishDir = [
path: { "${params.outdir}/upload/webin_cli" },
path: { "${params.outdir}/${params.mode}/upload/webin_cli" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
Expand All @@ -37,10 +37,13 @@ process {
withName: 'MULTIQC' {
ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
publishDir = [
path: { "${params.outdir}/multiqc" },
path: { "${params.outdir}/${params.mode}/multiqc" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: 'GENERATE_ASSEMBLY_MANIFEST|ENA_WEBIN_CLI|REGISTERSTUDY' {
ext.args = { params.test_upload ? "--test" : "" }
}
}
Loading
Loading