Skip to content

vgl-hub/rdeval-manuscript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

rdeval-manuscript

Scripts used for the rdeval manuscript.

Compute raw reads summary statistics for the Vertebrate Genomes Project

The following commands show how to access SRA accession numbers and metadata from NCBI, download the raw reads and generate summary statistics with rdeval. The process is heavily parallelized, with approximately 4 cores used by each rdeval process, and 8 experiments processed in parallel (32 cores total). SRA samples can optionally be randomly sampled.

First, we can use NCBI datasets and jq to parse the VGP umbrella BioProject and collect all SRA accessions associated with its genomes (here is jq's manual):

datasets summary genome accession PRJNA489243 > vgp-metadata.json
cat vgp-metadata.json | jq -r '.reports[] | .accession + "," + .assembly_info.assembly_name + "," + (.assembly_info.biosample.sample_ids[] | select(.db=="SRA").value)' > raw_data_metadata.ls
cat raw_data_metadata.ls

Then we can download the data (requires fasterq-dump in $PATH) and compute summary statistics by generating .rd files (requires rdeval in $PATH):

bash rdeval_parallel.sh raw_data_metadata.ls 0.5 // sampling fraction, 1 for full data set

We aggregate metrics from multiple .rd files by either:

rdeval *.rd

Or if they don't fit in memory (i.e. in the hundreds):

for file in *.rd; do rdeval --tab $file | cut -f2 | sed "1q;d"; done | awk '{sum += $1}END{print sum}' # total number of reads
for file in *.rd; do rdeval --tab $file | cut -f2 | sed "2q;d"; done | awk '{sum += $1}END{print sum}' # total number of bases

About

Scripts used for the rdeval manuscript

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages