STEC_KMA is a bioinformatics tool designed to process sequencing data for Shiga toxin-producing Escherichia coli (STEC). It uses the KMA (K-mer Alignment) algorithm to map sequencing reads to a combined allele database, identify the best hits, extract relevant reads, and confirm results through alignment using BWA (Burrows-Wheeler Aligner). The tool also identifies alleles with insertions and generates comprehensive reports.
- Read Baiting: Uses BBMap to bait reads from sequencing data based on a reference allele database.
- KMA Mapping: Maps reads to allele databases using KMA to identify the best hits.
- BWA Alignment: Aligns reads to reference alleles to confirm results and identify insertions.
- Consensus Sequence Extraction: Extracts consensus sequences and identifies insertions from mapped reads.
- Stx Profile Calculation: Calculates Shiga toxin (stx) profiles for each sample.
- Comprehensive Reporting: Generates detailed reports, including allele matches, percent identity, and insertion details.
Ensure you have Conda installed. If not, download and install Miniconda or Anaconda from https://docs.conda.io/en/latest/miniconda.html.
Create a Conda environment named stec_kma and install the required dependencies:
conda create -n stec_kma olcbioinformatics::stec_kma
conda activate stec_kmaRun the stec_kma.py script with the required arguments:
python src/stec_kma.py -s <sequence_path> -d <database_path> -r <report_path> -t <threads> -ID <identity> -c <min_coverage>-s, --sequence_path: Path to the folder containing sequencing reads (required).-d, --database_path: Name and path of the indexed KMA database (required). This is the value provided to the-oargument when running thekma indexcommand. Ensure that the database has been processed withkma indexbefore running this tool.-r, --report_path: Path to the folder where reports will be written (optional, defaults tosequence_path/reports).-t, --threads: Number of threads to use (default: number of CPU cores).-ID, --identity: Minimum identity percentage for KMA hits (default: 90%).-c, --min_coverage: Minimum fraction of reads required to call a consensus insertion (default: 0.7).--verbosity: Logging verbosity level (DEBUG,INFO,WARNING,ERROR,CRITICAL; default:INFO).
The stec_kma.py script orchestrates the following steps:
-
Locate Samples:
- Organizes FASTQ files into a dictionary and creates subdirectories for each sample.
-
Bait Reads:
- Uses BBMap to bait reads from the input FASTQ files based on the reference allele database.
-
Reverse Bait Targets:
- Baits targets from the allele database using the previously baited reads.
-
Index Baited Sequences:
- Indexes the baited allele sequences using KMA.
-
Map Reads with KMA:
- Maps reads to the indexed allele database using KMA.
-
Extract Allele Sequences:
- Extracts allele sequences corresponding to the best hits from the KMA reports.
-
Map Reads with BWA:
- Aligns reads to reference alleles using BWA to confirm results and identify insertions.
-
Extract Consensus Sequences:
- Extracts consensus sequences and identifies insertions from the BWA output.
-
Calculate Stx Profiles:
- Calculates Shiga toxin (stx) profiles for each sample.
-
Generate Reports:
- Writes detailed reports summarizing the results.
The tool generates the following outputs:
-
Reports:
- A tab-delimited report (
stec_kma_report.tsv) summarizing:- Sample name
- Best allele match
- Percent identity
- Stx profiles
- Notes (e.g., insertions, truncations, or internal stop codons)
- A tab-delimited report (
-
FASTA Files:
- Nucleotide and amino acid sequences for alleles with insertions.
-
Intermediate Files:
- Baited reads, mapped reads, and alignment files for further analysis.
Before running stec_kma.py, ensure that kma index has been run on your database. For example:
kma index -i /path/to/allele_db.fasta -o /path/to/indexed_dbThen, run the tool:
python src/stec_kma.py \
-s /path/to/sequence_data \
-d /path/to/indexed_db \
-r /path/to/output_reports \
-t 8 \
-ID 95 \
-c 0.8 \
--verbosity DEBUGThe tool supports configurable logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL). Logs include detailed information about each step of the pipeline, including system commands and outputs.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or issues, please contact:
- Author: Adam Koziol
- Email: [email protected]