STEC_KMA

STEC_KMA is a bioinformatics tool designed to process sequencing data for Shiga toxin-producing Escherichia coli (STEC). It uses the KMA (K-mer Alignment) algorithm to map sequencing reads to a combined allele database, identify the best hits, extract relevant reads, and confirm results through alignment using BWA (Burrows-Wheeler Aligner). The tool also identifies alleles with insertions and generates comprehensive reports.

Features

Read Baiting: Uses BBMap to bait reads from sequencing data based on a reference allele database.
KMA Mapping: Maps reads to allele databases using KMA to identify the best hits.
BWA Alignment: Aligns reads to reference alleles to confirm results and identify insertions.
Consensus Sequence Extraction: Extracts consensus sequences and identifies insertions from mapped reads.
Stx Profile Calculation: Calculates Shiga toxin (stx) profiles for each sample.
Comprehensive Reporting: Generates detailed reports, including allele matches, percent identity, and insertion details.

Installation

Step 1: Install Conda

Ensure you have Conda installed. If not, download and install Miniconda or Anaconda from https://docs.conda.io/en/latest/miniconda.html.

Step 2: Create the Conda Environment

Create a Conda environment named stec_kma and install the required dependencies:

conda create -n stec_kma olcbioinformatics::stec_kma
conda activate stec_kma

Usage

Run the stec_kma.py script with the required arguments:

python src/stec_kma.py -s <sequence_path> -d <database_path> -r <report_path> -t <threads> -ID <identity> -c <min_coverage>

Arguments

-s, --sequence_path: Path to the folder containing sequencing reads (required).
-d, --database_path: Name and path of the indexed KMA database (required). This is the value provided to the -o argument when running the kma index command. Ensure that the database has been processed with kma index before running this tool.
-r, --report_path: Path to the folder where reports will be written (optional, defaults to sequence_path/reports).
-t, --threads: Number of threads to use (default: number of CPU cores).
-ID, --identity: Minimum identity percentage for KMA hits (default: 90%).
-c, --min_coverage: Minimum fraction of reads required to call a consensus insertion (default: 0.7).
--verbosity: Logging verbosity level (DEBUG, INFO, WARNING, ERROR, CRITICAL; default: INFO).

Pipeline Overview

The stec_kma.py script orchestrates the following steps:

Locate Samples:
- Organizes FASTQ files into a dictionary and creates subdirectories for each sample.
Bait Reads:
- Uses BBMap to bait reads from the input FASTQ files based on the reference allele database.
Reverse Bait Targets:
- Baits targets from the allele database using the previously baited reads.
Index Baited Sequences:
- Indexes the baited allele sequences using KMA.
Map Reads with KMA:
- Maps reads to the indexed allele database using KMA.
Extract Allele Sequences:
- Extracts allele sequences corresponding to the best hits from the KMA reports.
Map Reads with BWA:
- Aligns reads to reference alleles using BWA to confirm results and identify insertions.
Extract Consensus Sequences:
- Extracts consensus sequences and identifies insertions from the BWA output.
Calculate Stx Profiles:
- Calculates Shiga toxin (stx) profiles for each sample.
Generate Reports:
- Writes detailed reports summarizing the results.

Output

The tool generates the following outputs:

Reports:
- A tab-delimited report (stec_kma_report.tsv) summarizing:
  - Sample name
  - Best allele match
  - Percent identity
  - Stx profiles
  - Notes (e.g., insertions, truncations, or internal stop codons)
FASTA Files:
- Nucleotide and amino acid sequences for alleles with insertions.
Intermediate Files:
- Baited reads, mapped reads, and alignment files for further analysis.

Example

Before running stec_kma.py, ensure that kma index has been run on your database. For example:

kma index -i /path/to/allele_db.fasta -o /path/to/indexed_db

Then, run the tool:

python src/stec_kma.py \
    -s /path/to/sequence_data \
    -d /path/to/indexed_db \
    -r /path/to/output_reports \
    -t 8 \
    -ID 95 \
    -c 0.8 \
    --verbosity DEBUG

Logging

The tool supports configurable logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL). Logs include detailed information about each step of the pipeline, including system commands and outputs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or issues, please contact:

Author: Adam Koziol
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.circleci		.circleci
.github/workflows		.github/workflows
recipes		recipes
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_conda_environment.sh		generate_conda_environment.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STEC_KMA

Features

Installation

Step 1: Install Conda

Step 2: Create the Conda Environment

Usage

Arguments

Pipeline Overview

Output

Example

Logging

License

Contact

About

Uh oh!

Releases 7

Packages

Uh oh!

Languages

License

OLC-Bioinformatics/STEC_KMA

Folders and files

Latest commit

History

Repository files navigation

STEC_KMA

Features

Installation

Step 1: Install Conda

Step 2: Create the Conda Environment

Usage

Arguments

Pipeline Overview

Output

Example

Logging

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Languages

Packages