Skip to content

JRaviLab/cyano_adaptation

Repository files navigation

Genotype-phenotype modeling of light ecotypes in Prochlorococcus reveals genomic signatures of ecotypic divergence

Evan Brenner^, Charmie Vang^, Connah Johnson, Janani Ravi*. bioRxiv 2025. doi: 10.1101/2025.10.13.678797

^Co-primary authors. *Corresponding author: janani.ravi@cuanschutz.edu

Abstract

Prochlorococcus species are the most abundant marine photosynthetic bacteria. Despite broadly shared phenotypic traits and marine habitats, they exhibit remarkable genomic diversity. We ask what genomic signatures underlie its ecotypic divergence into high- and low-light adapted lineages, and whether these signatures can still be recovered from incomplete assemblies. From ~1,000 publicly available Prochlorococcus genomes, we focused on those with information on their light adaptation ecotype (high-light/low-light), phylogenetic clades, and depth of isolation. Across these divisions, we calculated average nucleotide identity and constructed pangenomes to assess cyanobacterial core genes vs. those that separate ecotypes. Despite scant conservation, we observe a sharp taxon separation by light ecotypes. Classical machine learning models trained to predict ecotype achieve near-perfect binary classification accuracy even when predicting on partial genomes (Matthews Correlation Coefficient = 0.86 – 1.00), while regression models trained to predict the depth of isolation performed poorly, with high root mean square error values (37.6 – 42.0m). For ecotype prediction, we analyzed top gene features across model runs and classes; these features included photosynthesis-associated genes and pathways, as well as many novel markers of unknown function. When separating ecotypes further by previously described phylogenetic clades, genomic content and composition show even clearer separation among clades, supporting the taxonomic breadth of the Prochlorococcus collective. These results emphasize the genomic specialization underlying ecotypic divergence and support the utility of ML approaches for cyanobacterial ecotype prediction from metagenomic data. Expanded sampling will yield novel clade-specific biology. All data, models, and results are available on GitHub: https://github.com/JRaviLab/cyano_adaptation.

Machine learning to predict cyanobacterial ecotype (light) and depth of isolation

Summary

This project uses public data from the cyanobacterial genus Prochlorococcus to build classical machine learning models that use Prochlorococcus gene content to predict the isolate-level light adaptation ecotype (high-light, HL; low-light, LL) or the depth of isolation in meters. The scripts in this directory process data from NCBI's GenBank and metadata from NCBI and the DOE Joint Genome Institute. The processed data are used for machine learning, statistical summary, and plot generation in R.

Scripts

Each script in this workflow is numbered to run sequentially and reproduce the analysis reported in our manuscript. Each script contains a header description block that describes its purpose, function, inputs, and outputs. Sbatch files (e.g., 03b) are provided as examples for running these more intensive compute tasks on SLURM-based HPC clusters.


License

Cyano_adaptation is made freely available under the terms of Simplified BSD license. Copyright 2026 Battelle Memorial Institute and CU Anschutz (JRaviLab) Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, nor any jurisdiction or organization that has cooperated in the development of these materials, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness or any information, apparatus, product, software, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or Battelle Memorial Institute. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

PACIFIC NORTHWEST NATIONAL LABORATORY
operated by
BATTELLE
for the
UNITED STATES DEPARTMENT OF ENERGY
under Contract DE-AC05-76RL01830

Funding

This project was partially supported by the University of Colorado Anschutz (CUA), the Northwest Biopreparedness Research Virtual Environment project (NW-BRaVE), and CUA under subcontract to PNNL as part of NW-BRaVE. This project also used resources on the project award (Enhancing biopreparedness through a model system to understand the molecular mechanisms that lead to pathogenesis and disease transmission) from the Environmental Molecular Sciences Laboratory, a DOE Office of Science User Facility sponsored by the Biological and Environmental Research program under Contract No. DE-AC05-76RL01830, and the University of Colorado Boulder's high-performance computing resource, Alpine. Alpine is jointly funded by the University of Colorado Boulder, the University of Colorado Anschutz, Colorado State University, and the National Science Foundation (award 2201538). Support from CUA came from start-up funds from the University of Colorado Anschutz, awarded to JR. Support for NW-BRaVE came from the Department of Energy, Office of Science, Biological and Environmental Research program FWP 81832. Pacific Northwest National Laboratory is a multi-program national laboratory operated by Battelle for the DOE under Contract DE-AC05-76RL01830.

About

Genotype-phenotype ML models for cyanobacterial adaptation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors