Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics

This repository contains instructions to re-produce the main analyses and figures in the paper. The DNAseq and RNAseq data can be dowloaded from NCBI’s Gene Expression Omnibus and the Short Read Archive using the accessions: GSE123544 (RNAseq) and PRJNA526797 (DNAseq).

Processing sequencing data: from raw sequencing data to features with seq2geno as input to the machine learning-based AMR prediction

The Seq2Geno package wraps variant calling, phylogenetic tree inference, pan-genome analysis etc.. It produces the input molecular features for the subsequent antimicrobial resistance classification from the raw sequencing data. For details see the repository of Seq2Geno.

Figure phylogenetic and geographic distribution of Pseudomonas aerugionosa strains: The folder Figure01 contains the data and scripts required to produce Figure 1. More specifically, figure_1a.R creates the map that shows the origin of the Pseudomonas strains used in this study, figure_1b_bar.R and figure_1b_pie.R visualize the extent of drug resistance across all strains, and finally tree_visualize.R produces a depiction of the phylogenetic tree of strains including a number of reference isolates.

AMR classification with support vector machine classification using Model-T

The SVM classification was done with Model-T https://github.com/aweimann/Model-T, which is based on scikit-learn and was used as the prediction engine in our previous work on bacterial trait prediction (Weimann et al. mSystems 2016). learning_curves/learning_curves.info, feature_curves/feature_curves.info and mic_misclassified/mic_misclassified.info are bash scripts that re-produce the respective part of the analysis using the processed sequencing data. Handle with care: They are not intended to be run in one go. For convenience, smaller result tables are included in this repository.

AMR prediction across different combination of data types and different evaluation schemes.

learning_curves/perf_barplot.R using the classification performance summary data in tables learning_curves/perf_all.txt and feature_curves/validation_overall.txt produces Figure 3 and Figure 5 of the paper.

Performance saturation by number of features

feature_curves/feat_and_cparam2perf.R using the classification performance summary data in feat_perf.txt restricted to the best data combinations in best_models.txt produces Figure 4.

Performance saturation by number of samples

learning_curves/learning_curves.info is a bash script that scripts the entire pipeline for this part of the analysis. It is not intended to be run in one go. learning_curves/plot_learning_curve_data.R using the performance summary data in table learning_curves/cv_perf_summary.txt produces Figure 6.

Analyzing misclassified samples

mic_misclassified/mic_miscl_barplot.R using the drug resistance prediction outcome of all strains in table miscl_all_w_validation.txt produces Figure 7. mic_misclassified/breakpoint_enrichment.R uses the table mic_misclassified/miscl_all_w_validation.txt to check for an enrichment of misclassified samples close to the resistance breakpoint and produces table mic_misclassified/misclassified_enrichment_sig.txt.

misclassified_phylogeny/graphlan.sh produces Supplementary Figures 3-6 requiring GraPlAn using the pre-generated XML in misclassified_phylogeny/tree_annot_Tobra.xml etc..

Comparing different ML classifiers with Geno2Pheno

The Geno2Pheno package employs a broad range of classifiers for resistance prediction. See https://github.com/hzi-bifo/Geno2Pheno for details and commands to re-produce the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
cv_folds		cv_folds
feature_curves		feature_curves
figure01		figure01
genomic_features_computation		genomic_features_computation
learning_curves		learning_curves
mic_misclassified		mic_misclassified
misclassified_phylogeny		misclassified_phylogeny
ml_classifier_comparison		ml_classifier_comparison
resistance_table		resistance_table
sequence_types		sequence_types
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics

Processing sequencing data: from raw sequencing data to features with seq2geno as input to the machine learning-based AMR prediction

AMR classification with support vector machine classification using Model-T

AMR prediction across different combination of data types and different evaluation schemes.

Performance saturation by number of features

Performance saturation by number of samples

Analyzing misclassified samples

Comparing different ML classifiers with Geno2Pheno

About

Releases

Packages

Languages

License

hzi-bifo/Predicting_PA_AMR_paper

Folders and files

Latest commit

History

Repository files navigation

Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics

Processing sequencing data: from raw sequencing data to features with seq2geno as input to the machine learning-based AMR prediction

AMR classification with support vector machine classification using Model-T

AMR prediction across different combination of data types and different evaluation schemes.

Performance saturation by number of features

Performance saturation by number of samples

Analyzing misclassified samples

Comparing different ML classifiers with Geno2Pheno

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages