This repository includes zipped data folders, which can be found in the /data directory:
- env_data containing all the data from SHARKweb (physico-chemical data, phytoplankton & zooplankton microscopy counts, bacterioplankton & picoplankton epifluorescence counts + biovolume + carbon sequestration), as well as metadata for sequencing samples with integrated physico-chemical parameters averaged over sampling depths (physical_chemical_processed_translation.tsv).
- seq_data containing the count tables, asv sequences, and taxonomic annotation, both raw and after barnapphttps://github.com/tseemann/barrnap filtartion.
The scripts can be found in the /code/ subfolder
envpredict_core.R - creates and runs all the XGBoost and Random Forest models.
tabpnf_predictions.py - runs the TabPNF predictions.
Prep_data_for_deep_micro.py - prepares data to be used as input for autoencoders to obtain Deep Representations
createDeepRepresentations.sh - creatses Deep Representations with different autencoders.
Prep_data_for_classifiers_after_deep_micro.py - prerpares the Deep Representations for downstream analyses.
Plot_Figure_2.Rmd - compares different prediction algorithms and predicitions made on different types of data (16S vs 18S, different taxonomic levels)
envpredict_eval.R - compares predictions of physicochemical data based on metabarcoding and microscopy data, as well as predictions of phyto- and zoo-plankton based on different data and approaches (Fig. 3 & 4).
interannual_comparison.Rmd - compares predictions of physicochemical parameters for the 2015-2017 dataset based on models trained on the 2019-2020 dataset (different dataset), 2015-2017 dataset (same dataset), or both datasets (FIg. 5, Suppplementary Fig. S1 & S2).
interannual_comparison_linear_regression.Rmd - runs an analysis analogous to interannual_comparison.Rmd, but checks if the actual and predicted values correlate with each other, not if they are the same.\
A Snakefile is available in the main directory of the repository, which allows to rerun most of the analysis with snakemake. The scripts that are run by the pipeline are highlighted in the section above, and are visualized in the running order in dag.png and rulegraph.png. We decided not to include the analysis of Deep Representations in the pipeline, since they have been obtained using a GPU and a considerable amount of computation, yielding poor results. We have not included the Ecological Quality Ratios (EQRs) analysis in the pipeline, and required extra data preparation steps.
The pipeline is by default run in a mode skipping the analysis of Deep Representations, as well as the XGBoost- and TabPFN-based predicitons and their analysis. These setting can be changed by modifying config.yml file. Mind that adding XGBoost and/or TabPFN will substantially increase the runtime of the pipeline.
By adjusting the reading data section of envpredict_core.R, this script can be adjusted to other datasets.
- Unzip code/HEAT.zip
- Run HEAT.R
- Run Plotting_heat.R