title | author | output | date |
---|---|---|---|
Extended Data of Habitat Specialization Impacts Clownfish Demographic Resilience to Pleistocene Sea-Level Fluctuations |
Alberto Garcia Jimenez |
html_document |
2024-12-05 |
This repository contains data, analyses and scripts supporting Garcia Jimenez et al 2025 "Habitat Specialization Impacts Clownfish Demographic Resilience to Pleistocene Sea-Level Fluctuations". Below is an overview of the repository structure and its contents.
The repository includes several raw datasets essential for the analyses:
- Samples information: Dataset containing all samples information.
- _popList: species specific .txt files with sample ID and population ID information.
- Genomic data: Genomic data can be retrieved from the NCBI SRA given the SRR contained in samples_info.csv
This folder contains genomic summaries for samples and species, along with their corresponding SRR identifiers.
MSMC2 analyses were conducted to infer population size changes over time and coalescent processes.
- Individual Bootstraps: Estimates of population variability derived from bootstrap replicates based on individual genome selections.
- Multi-Heterozygosity Bootstraps: Assessments of consistency across genomic regions to validate demographic reconstructions.
- Combined Results: Pairwise cross-coalescence rate (CCR) estimates used to calculate population split times.
- Raw MSMC Datasets: Complete results from individual and multi-heterozygosity bootstrap analyses for all species and populations.
- Filtered MSMC Datasets: Cleaned and curated results from bootstrap analyses for all species and populations.
- Geodesic Distance and Split Times: Dataset summarizing estimated geodesic distances and split times between populations.
The popgen
folder contains produced population genomic data for various clownfish species with pairwise genetic differentiation measures (FST), nucleotide diversity ((\pi)) and genetic divergence ((d_{xy})). Below is a detailed description of the contents:
-
FST_produced_dataset.csv
Consolidated average FST values for all species and population pairs produced with vcftools. -
Pi_produced_dataset.csv
Summarized nucleotide diversity ((\pi)) for all species and populations produced with vcftools. -
dxy_produced_dataset.csv
Combined averaged (d_{xy}) values across all species and population pairs produced withpopgenWindows.py
from https://github.com/simonhmartin/genomics_general.
These data sets along with pca
and admixture
results can be reproduced following PopGen pipeline in the corresponding pipelines
folder.
EEMS analyses were performed for multiple species to estimate migration rates and population isolation patterns. This folder structure contains the input data and results of the EEMS (Estimated Effective Migration Surfaces) analyses for different species, demes, and chains. Each subfolder corresponds to a specific species and contains the output data for different analysis runs, distinguished by the number of demes and the chain number.
- AKA
- AKY
- CLK
- CRP
- MEL
- POL
- PRD
- SAN
- Different deme sizes: 50, 200, and 500.
- Three independent chains for each configuration.
- A summary image for each species showing log posterior distributions.
- Subdirectory containing detailed output files for the specific species.
- Each subfolder represents a unique run defined by the number of demes and the chain number. It includes the following files:
- demes.txt: Information on the demes.
- edges.txt: Details about the edges in the model.
- eemsrun.txt: General run information.
- ipmap.txt: IP map data.
- Various
.txt
files related to model parameters and MCMC chains (e.g.,lastdfpars.txt
,mcmcmhyper.txt
, etc.).
- Coordinates of the sampled locations.
- File generated by the bed2diffs program from the EEMS. Contains a pairwise genetic dissimilarity matrix. This matrix quantifies the genetic differences between pairs of demes (spatially defined populations) based on input genotype data, such as a .bed file. The values in the .diffs file represent the genetic distance or dissimilarity, which is used by EEMS to estimate and visualize patterns of gene flow and barriers to migration across a landscape.
- Order of data points.
- Outer coordinates indicating the geographical boundary of the data.
- Configuration files for runs with 50 demes.
- Configuration files for runs with 200 demes.
- Configuration files for runs with 500 demes.
- Area visualization map.
- List of excluded individuals (if any).
- Plink log file.
- Samples name and sex code.
This folder contains input and results of MAPS (Migration And Population Size Surfaces), designed to infer spatial and temporal heterogeneity in population sizes and migration rates across landscapes.
- AKA
- AKY
- CLK
- CRP
- MEL
- POL
- PRD
- SAN
- Different deme sizes: 50, 200, and 500.
- Three independent chains for each configuration.
- Subdirectory containing detailed intput files for the specific species.
- .coord: Coordinates of sampled locations.
- .demes: Information about the demes in the model.
- .edges: Details about the edges in the spatial model.
- .ipmap: IP map data.
- .maps.0.1_Inf.sims, .maps.2_6.sims, .maps.6_Inf.sims: Main input MAPS files for different ibd segment lengths.
- .outer: Outer boundary data.
- _200.demes, _50.demes, _500.demes: Configuration files specific for different deme sizes.
- _ndemes200_params-chain1.ini, _ndemes500_params-chain2.ini, etc.: Configuration files for different chain runs and deme sizes.
- best_deme_bysp.csv: A CSV file summarizing the best number of demes per species based on log posteriors.
- Species Distribution Modeling (SDM): Predictions of current species ranges performed with ENMTML R package (de Andrade et al. 2020).
Scripts used for data processing, analysis, and visualization are included. Detailed descriptions of each script are available in the scripts/
directory.
- eems_plots.R: Script for generating plots related to the EEMS model output.
- fst_dxy_pi_plots.R: Script for plotting genetic differentiation metrics such as FST, DXY, and π.
- maps_plots.R: Script for generating plots from MAPS data.
- process_msmc_reconstructions.R: Script for processing MSMC reconstructions data.
- msmc_analysis.R: Script for the analysis of MSMC data and generating plots.
- pca_admx_plots.R: Script for performing PCA and ADMIX analysis plots.
- test_MAPS_recombmaps.R: Script for visualizing MAPS test on various recombination maps.
- convert_recomb_map_2PLINK.py: Python script for converting pyrho recombination maps to PLINK-compatible formats for downstream analyses.
- get_tmrca.py: Python script to compute the Time to Most Recent Common Ancestor (TMRCA) from msmc2 cross-coalescence results
- plot_utils.py: A collection of Python utility functions from msmc-tools https://github.com/stschiff/msmc-tools
- custom_functions.R: An R script containing reusable functions for statistical analysis, data manipulation, and plotting, tailored for this project.
The SNPCall pipeline facilitates the identification of Single Nucleotide Polymorphisms (SNPs) from raw sequencing data. It includes preprocessing, mapping reads to a reference genome, variant calling, and filtering steps to generate high-quality SNP datasets. See its own README for more information.
The PopGen pipeline encompasses analyses of population genetics metrics, including nucleotide diversity (π), genetic differentiation (FST, DXY), and population structure (PCA and ADMIXTURE). It integrates genomic datasets and generates plots to visualize population-level patterns. See its own README for more information.
The EEMS pipeline supports the application of the EEMS (Estimated Effective Migration Surfaces) model, which maps genetic differentiation across spatial landscapes. This pipeline automates data formatting, model execution, and the creation of migration surface visualizations. See its own README for more information.
The MAPS pipeline (Migration and Population Surfaces) is based on the framework described in Hussain Al-Asadi et al., 2017. It integrates genetic data to infer and visualize migration rates and population density across spatial landscapes. This pipeline processes input data, estimates migration and population parameters, and generates spatial visualizations to interpret historical and contemporary population dynamics. See its own README for more information.
The MSMC pipeline processes data for Multiple Sequentially Markovian Coalescent (MSMC) analyses. It includes steps to estimate historical effective population size, preprocess raw output, and perform cross-population coalescence to estimate populations split time. See its own README for more information.
Follow the instructions below to use this repository effectively:
-
Navigate to the Analysis Folder: Start by moving to the appropriate folder corresponding to the analysis you wish to perform.
-
Review Analysis Instructions: Check the subfolders for specific instructions or documentation related to each analysis or refer to the 'Materials and Methods' section on the manuscript for an overview of the analysis protocols and methodology.
-
Run Scripts for Analysis:
- Use the scripts located in the
pipelines
to produce results andscripts/
folder to replicate and visualize analyses. These scripts are designed to process data and generate results as described in the project. - Ensure that you set the correct working directory in the scripts by adjusting the path to match the location of your project files. You can do this using
setwd("/path/to/your/project")
in R oros.chdir("/path/to/your/project")
in Python. - Make sure to install and load any required R or Python packages and external tools as outlined in the Required Software section.
- Use the scripts located in the
-
Adapt and Customize:
- Customize scripts as needed for different datasets or analysis settings. Be mindful of input data formats and required parameters when modifying the code.
- Refer to the comments within the scripts for additional guidance on how to run specific sections or modify parameters.
-
Output and Results:
- Outputs will be generated in the relevant subfolders and can include figures, log files, or processed data files. Ensure that you have sufficient disk space and the necessary permissions for file output.
By following these steps, you can reproduce analyses and generate results consistent with the project's workflow.
- R and Python are required to run most scripts.
- Additional packages such as
ggplot2
,dplyr
,pandas
,numpy
, etc., may be required based on the script. - ADMIXTURE: A software tool for estimating individual ancestry proportions.
- EEMS (Effective Migration Surfaces): For spatial modeling of gene flow (available at EEMS GitHub).
- MAPS (Migration And Population size Surfaces): For generating spatial maps of genetic differentiation (available at MAPS GitHub).
- MSMC2 (Multiple Sequential Markovian Coalescent): For estimating demographic history from genomic data (available at MSMC2 GitHub).
- bcftools: For processing VCF files, variant calling, and filtering.
- bamtools: For manipulating and analyzing BAM files.
- vcftools: For analyzing VCF files and performing various genomic data operations.
- plink: For genome-wide association studies and data manipulation.
- Ensure that all tools and packages are properly installed and accessible in your system's PATH.
- Depending on the analysis, you may need access to a high-performance computing environment or cluster to handle large datasets efficiently.
For installation guides, refer to the documentation of each tool and package. If you encounter any issues, consult the respective support forums or the documentation for troubleshooting.
- Alberto García Jiménez: Produced the reproducible repository and wrote the README files.
- Marion Talbi and Milan Malinsky: Produced the recombination maps shared in the repository.
- Other contributors: Fieldwork expeditions and sample collection by Théo Gaboriau, Lucy M. Fitzgerald, Sara Heim, Anna Marcionetti, Sarah Schmid, Joris Bertrand, Bruno Frédérich, Fabio Cortesi, Marc Kochzius, Ploypallin Rangseethampanya, Phurinat Ruttanachuchote, Wiphawan Aunkhongthong, Sittiporn Pengsakun, Makamas Sutthacheep, Thamasak Yeemin.
- Alberto García Jiménez: Conceptualized and designed the study, conducted research, performed analyses, interpreted results, and wrote the manuscript.
- Marion Talbi and Milan Malinsky: Generated the recombination maps.
- Théo Gaboriau and Nicolas Salamin: Contributed to design the study, data interpretation and manuscript writing.
- Additional authors have contributed to fieldwork and data collection and the revision of the manuscript.
This project is licensed under the GNU General Public License (GPL) v3. This means you are free to use, modify, and distribute the code, provided that any derivative works are also licensed under the GPL v3. For more details, please refer to the GPL v3 License.