Pipeline Scripts

1.process_raw_data.r

Description

This script prepares the data to be analyzed from raw data. It converts sequences count to an abundance table as well as massages the data.

Input

raw_data/full_taxa.tsv: The phylogenetic classification of each microbial element sampled.

raw_data/full_table.nochim.txt: The read number of each SVA sequence in each cow (there can be more than one sample per cow).

raw_data/RuminOmics_Animal_Phenotypes_for_Mizrahi_v2_plus_rt_quantification_with_total_20170921_and_depth.xlsx: The metadata for the sampled cows.

Output

local_output/ASV_full_taxa.csv: The phylogenetic classification of each microbial element sampled, including our internal ASV ID for each sequence.

raw_data/ASV<id>_<id>.tsv: The sum read number (abundance) for each ASVs in the range of those ids, for each cow.

raw_data/all_ASV_sum_reads.csv: The read data for all the ASVs in the same file. The previous list of files was done to preserve progress. (legacy: ASV_data_sum_reads.csv)

local_output/ASV_processed_data.csv: The processed data ready to be analyzed. (legacy: ASV_data_final.csv)

2.filter_data.r

Description

This script filters the data to remove scarce microbes. It includes comparing filtering methods - by sensitivity vs core microbes, to results in the latter.

Input

local_output/ASV_processed_data.csv: The processed data output from prepare_raw_data.r script

Output

local_output/figures/core_microbes_sensitivity_example.pdf:

local_output/core_ASV_50.csv: The file contains the abundance data for the ASVs that appeared in at list 50% (en example, the number correlates with the number in the name) of all cows in a given farm. (legacy: ASV_Core_final_05.csv)

3.exploratory_analysis.r

Description

This script quantifies the unfiltered and filtered ASV data. It finds kingdom distribution, quantifies abundance and richness, and calculates alpha and beta diversity.

Input

output/ASV_processed_data.csv: The original processed data before filtering from the prepare_raw_data.r script. (legacy: ASV_data_final.csv)

output/ASV_full_taxa.csv: The phylogenetic classification of each microbial element sampled.

output/core_ASV_<percent>.csv: The file with the filtered abundance data, by filter percent - for example, when <percent> equals 05 the data is filtered for ASV present in 5% of cows (legacy: ASV_Core_005.csv)

local_output/farm_multilayer_pos_30.csv: A file with only the positive networks for 30% filter arranged in a matrix format.

Output

local_output/figures/ASV_Kingdom_unfiltered.png: Shows abundance and richness per kingdom on the unfiltered ASV processed data.

local_output/figures/ASV_Kingdom_filtered.png: Shows abundance and richness per kingdom on the 30% filtered ASV processed data.

local_output/figures/microbe_distribution_in_cows.pdf: A histogram showing the distribution of microbes in cows.

local_output/figures/bacteria_in_farms.pdf: A histogram showing the distribution of microbes in farms.

local_output/figures/ASV_per_cow_per_farm.png: ASV richness median per cow in each farm.

local_output/figures/ASVs_per_farm.png: The number of distinct ASVs per farm. (legacy: ~/GitHub/ASVs per farm.png)

local_output/figures/ASV_relative_reads_per_cow.png: The distribution of relative reads per cow in each farm.

local_output/figures/beta_diversity_unif_jacc.png: A plot showing the beta-diversity between farms calculated with Jaccard index and unifrac

local_output/figures/beta_div_network.pdf: A network showing the beta-diversity between farms.

local_output/figures/shannon_scale.png: A box plot showing the Shannon diversity as function of scale (country/farm/cow)

local_output/figures/species_richness_scale.png: A box plot showing the species richness as function of scale (country/farm/cow)

local_output/figures/Shared_microbes_between_farms.png: A heat map showing the number of shared microbes between every two farms.

local_output/figures/cooccurrence_links_per_farm.png: A plot showing the number of co-occurrence links per farm.

local_output/figures/cows_per_farm.png: A plot showing the number of cows in each farm.

HPC/create_network_farm.R

Description

This script uses the core ASV data to build a co-occurrence network. It is written to be compatible with running on the HPC, and produces a log file accordingly. It is ment to run on a single experiment (ASV dataset) on a single level (farm/country)

Input

exp_id: A numeric argument given when calling the script (via HPC job), representing the experiment to be performed.

curr_farm: A character argument given when calling the script (via HPC job), representing the name of the farm for which the network will be built.

experiments.csv: Contains a conversion of experiment ID (exp_id) to its correlated data file.

data_file: The file containing the (core) ASV data to be used to build the network. The name of this file is determined in the experiments.csv file, by its ID.

Output

_ASV_cow_mat.csv file: A bipartite matrix for ASV occurrence per cow, used to create the network. name identifier contains experiment id, job id, and farm id.

_COOC.csv file: A file containing the data and metadata for all the co-occurrences calculated (either significant or not) in a specific run. With identifiers as in the file above.

_nodes.csv file: A file containing the node's names and their ids. With id identification as in the file above.

_mat_pos.csv file: A file with only the positive co-occurrences arranged in a matrix format. Identifiers as in the file above.

_mat_neg.csv file: A file with only the negative co-occurrences arranged in a matrix format. Identifiers as in the file above.

_singletons.csv file: A file with only the singletons (a node without edges) in the farm. Identifiers as in the file above.

_edge_list.csv file: A file listing only the significant co-occurrences of the network, in an edge-list format. Identifiers as in the file above.

_log.txt file: A file reporting the progress and result of the HPC job run. its name is generated from job id, experiment id, level and level name.

run_summary.csv: Create it or add a line to the end of it, logging key details on the run.

5.cooccurrence_network_analysis.r

Description

This script analyses the co-occurrence networks which were generated by HPC. It collects them into a multi-layer network by creating interlayer edges, and then runs an Infomap analysis on it.

Input

run_summary.csv: Contains the details for the runs relevant for the experiment.

HPC output: The script reads all the output files (per farm) that are relevant to a specific experiment ID that ran on the HPC.

Output

output/figures/degree_distribution_in_layers.pdf: A plot showing the degree distribution of nodes in each layer (farm).

local_output/farm_multilayer_pos_30.csv: A file with only the positive networks for 30% filter arranged in a matrix format.

6.modularity_analysis.R

Description

This script performs an Infomap analysis on the observed microbiome network and analysis the results. In this scripts informal tool is used to find the modules available in the multilayer network. The multilayer network is build twice, once using a Jaccard values as interlayer edges, and the second using Unifrac values for interlayer edges.

Input

local_output/farm_multilayer_pos_30.csv: A file with only the positive networks for 30% filter arranged in a matrix format.

Output

local_output/multilayer_jaccard.csv: The multilayer build using Jaccard interlayer edges local_output/multilayer_unif.csv: The multilayer build using using Unifrac interlayer edges local_output/farm_modules_pos_30_U.csv: Modules found using Unifrac interlayer edges

7. shuffle_microbe_distribution.r

Description

The script shuffles the data in several resolutions in order to compare them to the observed data. Currently the data is shuffled 500 times.

Input

local_output/core_ASV_30.csv: The abundance data to be shuffled and compare

Output

local_output/shuffle_30/shuff_<level>_<id>: a set of files each have shuffled abundance values according to the <level>. for example a level 'farm' means the abundance data is shuffled inside each farm (across cows) but not between farms. The ids are the sequential number of the shuffle, in the range of 1..500.

HPC/shuffled_partner_fidelity.R

Description

The script is compatible to be run on the HPC. It calculated the partner fidelity of the nodes in the network across layers, using jaccard and unifrac.

Input

<network-identifications>_edge_list.csv: Files containing a shuffled network in edge list format. One for each farm.

net_id: A numeric argument given when calling the script (via HPC job), representing the multilayer network to be modulated.

Output

fidelity_shuff_farm_30.csv: contains the partner fidality results using Jaccard

uniFrec_shuff_farm_30.csv.csv: contains the partner fidality results using UniFrac

9.node_level_analysis.R

Description

This script includes analysis of clustering coefficients and partner fidelity of each ASV in the network

Input

farm_multilayer_pos_30.csv: A file with only the positive networks for 30% filter arranged in a matrix format.

HPC/shuffled/shuffle_farm_curveball_30_shuff_500/shuff_farms_files.csv: Files containing shuffled networks in edge list format. One for each farm.

local_output/fitted_asvs_phylo_tree.rds: A file containing the r object with the phylogenetic tree used in the UniFrac analysis.

HPC/modularity_analysis_pf_shuffled.r

Description

The script is compatible to be run on the HPC. It performs a modularity like the previous script, only the interlayer edges are constructed using partner fidelity, using jaccard and unifrac.

Input

<network-identifications>_edge_list.csv: Files containing a shuffled network in edge list format. One for each farm.

net_id: A numeric argument given when calling the script (via HPC job), representing the multilayer network to be modulated.

Output

_multilayer_pf_jaccard.csv: A file containing the multilayer network as an edge list with jaccard interlayer edges.

_farm_modules_pf_jaccard.csv: A file containing the modularity results from the Infomap analysis with jaccard interlayer edges.

_farm_modules_pos_30_J_multilevel.csv: A file containing the multi-level Infomap analysis results for a network with jaccard interlayer edges.

_multilayer_pf_unif.csv: A file containing the multilayer network as an edge list with Unifrac interlayer edges.

_farm_modules_pf_unif.csv: A file containing the modularity results from the Infomap analysis with Unifrac interlayer edges.

_farm_modules_pos_30_U_multilevel.csv: A file containing the multi-level Infomap analysis results for a network with jaccard interlayer edges.

../farm_modulation_summary_pf_jaccard.csv: Contains the details of the runs relevant for Jaccad interlayer edges experiment.

../farm_modulation_summary_pf_unif.csv: Contains the details of the runs relevant for Unifrac interlayer edges experiment.

11.modularity_analysis_shuffled.R

Description

This script reads the results of the modularity analysis that was done on the shuffles networks and compares them to the results of the same analysis that were done on the observed data.

Input

_farm_modules_pf_unif.csv: modularity results of a given shuffled network.

farm_modulation_summary_pf_unif.csv: Summary data for running the modularity analysis on the all the shuffled networks.

local_output/farm_modules_pos_30_U.csv: Modules found on the observed network

12.cow_genetic_net.r

Description

This script is necessary for the cows' genetic similarity analysis. It is preparing the list of cows to be included in the analysis (cows that has both genetic and microbiome data), and processing the similarity results that were produced by FWASS to a hierarchical clustering analysis by genetic similarity.

Input

raw_data/Cows_SNPs/Ruminomics_NordicRed.fam: List of Nordic red cows in the SNPs dataset raw_data/Cows_SNPs/Ruminomics_Holstein.fam: List of Holstein cows in the SNPs dataset local_output/core_ASV_30.csv: The abundance data for the ASVs that appeared in at list 30% of the cows, including cow identity (microbiome dataset). cows_genetic_results/genmb_similarity_matrix_weighted.csv: Similarity data as calculated by FWASS.

Output

local_output/SNP_micro_intersect_cows.csv: The list of cow ID's that should be in the genetic similarity analysis. local_output/figures/cow_genetics.pdf: A figure to visualize the hierarchical clustering, including farms and countries as attributes.

13.different_filtration_levels_analysis.R

Description

This script includes exploratory analysis in different filtration levels: 20%, 10%, 5%.

Input

local_output/core_ASV_20.csv: The file contains the abundance data for the ASVs that appeared in at list 20%.

local_output/ASV_full_taxa.csv: The file contains the phylogenetic classification of each microbial element sampled.

HPC/run_summary_sup.csv: Contains the details for the runs relevant for the experiment.

Output

local_output/exploratory_analysis_20.pdf: A pdf contains all the plots with the given filter level.

Pipeline Scripts

1.process_raw_data.r

Description

Input

Output

2.filter_data.r

Description

Input

Output

3.exploratory_analysis.r

Description

Input

Output

HPC/create_network_farm.R

Description

Input

Output

5.cooccurrence_network_analysis.r

Description

Input

Output

6.modularity_analysis.R

Description

Input

Output

7. shuffle_microbe_distribution.r

Description

Input

Output

HPC/shuffled_partner_fidelity.R

Description

Input

Output

9.node_level_analysis.R

Description

Input

HPC/modularity_analysis_pf_shuffled.r

Description

Input

Output

11.modularity_analysis_shuffled.R

Description

Input

12.cow_genetic_net.r

Description

Input

Output

13.different_filtration_levels_analysis.R

Description

Input

Output

Clone this wiki locally