rdeval-manuscript

Scripts used for the rdeval manuscript.

Compute raw reads summary statistics for the Vertebrate Genomes Project

The following commands show how to access SRA accession numbers and metadata from NCBI, download the raw reads and generate summary statistics with rdeval. The process is heavily parallelized, with approximately 4 cores used by each rdeval process, and 8 experiments processed in parallel (32 cores total). SRA samples can optionally be randomly sampled.

First, we can use NCBI datasets and jq to parse the VGP umbrella BioProject and collect all SRA accessions associated with its genomes (here is jq's manual):

datasets summary genome accession PRJNA489243 > vgp-metadata.json
cat vgp-metadata.json | jq -r '.reports[] | .accession + "," + .assembly_info.assembly_name + "," + (.assembly_info.biosample.sample_ids[] | select(.db=="SRA").value)' > raw_data_metadata.ls
cat raw_data_metadata.ls

Then we can download the data (requires fasterq-dump in $PATH) and compute summary statistics by generating .rd files (requires rdeval in $PATH):

bash rdeval_parallel.sh raw_data_metadata.ls 0.5 // sampling fraction, 1 for full data set

We aggregate metrics from multiple .rd files by either:

rdeval *.rd

Or if they don't fit in memory (i.e. in the hundreds):

for file in *.rd; do rdeval --tab $file | cut -f2 | sed "1q;d"; done | awk '{sum += $1}END{print sum}' # total number of reads
for file in *.rd; do rdeval --tab $file | cut -f2 | sed "2q;d"; done | awk '{sum += $1}END{print sum}' # total number of bases

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
rdeval_parallel.sh		rdeval_parallel.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rdeval-manuscript

Compute raw reads summary statistics for the Vertebrate Genomes Project

About

Releases

Packages

Languages

License

vgl-hub/rdeval-manuscript

Folders and files

Latest commit

History

Repository files navigation

rdeval-manuscript

Compute raw reads summary statistics for the Vertebrate Genomes Project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages