-
Notifications
You must be signed in to change notification settings - Fork 0
The Data
Geut Galai edited this page Mar 28, 2024
·
3 revisions
This project includes data for the bacterium of bovine rumen for 9 ranches across Europe. here will be described the data available and its format and metadata.
The raw data was extracted from a paper published by Wallace at al at 2019 (DOI: 10.1126/sciadv.aav8391), to be reprocessed in this project. The relevant files:
-
raw_data/full_taxa.tsv
: phylogenetic metadata on the identified 16s that were sequences while collecting the data. The file is used to identify the sequences read as their clarified organism. Format (with a couple of examples):
seq16S | Kingdom | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|
TACGCGCTAAAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | NA |
CGAAGCGTCGG... | Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales | Methanobacteriaceae | Methanobrevibacter | NA |
-
raw_data/full_table.nochim.txt
: Contains the number of reads each seq16S has in each of the cows in the study. The data is delimited by spaces. This data is used to calculate occurrence, richness and abundance for each organism. Format (with a couple of examples):
sample | TACGCGCTAAAG... | CGAAGCGTCGG... | TACGCGCTTTTC... | CGGAAAGTCGG... | ... |
---|---|---|---|---|---|
ProkA_R1_FI900.fastq | 0 | 486 | 0 | 1263 | ... |
ProkA_R1_FI901.fastq | 4 | 223 | 0 | 544 | ... |
-
raw_data/RuminOmics_Animal_Phenotypes_for_Mizrahi_v2_plus_rt_quantification_with_total_20170921_and_depth.xlsx
: This file contains all the metadata available on each of the cows participating in the study. It contains data in 3 sheets, each sheet with the same data but in a different format (the second sheet is for excel compatibility). It contains data on each cows ID and location, bread, lactation, diet, secretion and more (too long to detail here). As of now, it is mainly used to identify the cows. Note that the name of the farm is in its long version (NUDC/Franciosi/etc..) and not the short one used in the paper (UK1/IT2/etc..)
The main data files that were created and used as part of the study.
-
output/ASV_processed_data.csv
: The file details the abundance of each ASV entity, in each cow, in a long format. Farm and country data is included as well. The format of the file:
Country | Farm | Cow_Code | ASV_ID | Abundance |
---|---|---|---|---|
UK | NUDC | UK161 | ASV_00460 | 46 |
IT | Franciosi | IT641 | ASV_00259 | 26 |
Note that here too, the farm name is in a long version.
-
output/core_ASV_[05/30/50].csv
: The file contains the adundance data of core ASV entities, in a similar format to the one in the file mentioned above (output/ASV_processed_data.csv
). Core ASV are defined as such if they appear in a certain percent of all the cows, correlating to the number on the file's name (5%/30%/50% of cows). File format:
Country | Farm | Cow_Code | ASV_ID | Abundance |
---|---|---|---|---|
UK | UK1 | UK161 | ASV_00460 | 46 |
IT | IT2 | IT641 | ASV_00259 | 26 |
Note that this time the farm name is in a short version.
-
HPC/exp_1/[experiment id]_[job id]_Farm_[farm name]_COOC.csv
: The file contains the results of the co-occurrance analysis for each pair of nodes (ASVs) in a specific farm, be it significant or not. The data includes each ASV's name as well as serial number.
sp1 | sp2 | weight | sp1_inc | sp2_inc | obs_cooccur | exp_cooccur | p_lt | p_gt | sp1_name | sp2_name | level | level_name | edge_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 2 | 0.473 | 184 | 88 | 87 | 87.5 | 0.47568 | 1 | ASV_00001 | ASV_00002 | Farm | IT1 | not_significant |
2 | 1 | 3 | 0.995 | 184 | 185 | 184 | 184 | 1 | 1 | ASV_00001 | ASV_00003 | Farm | IT1 | not_significant |
-
HPC/exp_1/[experiment id]_[job id]_Farm_[farm name]_edge_list.csv
: The file contains theta for the edges of the networks that was built using only the significant values from the co-occurrence analysis (both positive and negative), in a format of edge list. These are only intra-layer edges. The nodes are identified by name (and not id).
from | to | weight | edge_type | level | level_name |
---|---|---|---|---|---|
ASV_00001 | ASV_00007 | 0.978 | pos | Farm | IT1 |
ASV_00001 | ASV_00035 | 0.978 | pos | Farm | IT1 |
- There are more network data files that return as output but as they only contain a subset of the data presented here or present the data in a different format, only the Most useful files are detailed here.