Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating PacBio Hifi data sample's genome size with MASH #42

Open
danejo3 opened this issue May 3, 2023 · 0 comments
Open

Calculating PacBio Hifi data sample's genome size with MASH #42

danejo3 opened this issue May 3, 2023 · 0 comments

Comments

@danejo3
Copy link
Collaborator

danejo3 commented May 3, 2023

In #37, calculating the genome size of PacBio Hifi-reads were explored. One of the curious cases that came out of it was that the estimated size did not match the expected size. Sometimes the estimated size calculated by MASH was x2, x3, or more.

The data that was used for testing was Ecoli K12. Ecoli K12 has a genome size of ~4.6 mb (see paper here). It is a very well studied bacteria.

Below is a few MASH calculations.

command parameters estimated genome size coverage
mash sketch [fastq] -r 33.6 mb 40x
mash sketch [fastq] -k 32 -r 51.3 mb 24x
mash sketch [fastq] -k 32 -m 3 -r 11 mb 117x
mash sketch [fastq] -k 21 -m 3 -r 8.7 mb 144x

-r = read set
-k = k-mer size
-m = Minimum copies of each k-mer required to pass noise filter for reads.

While the estimation of the genome was quite off, the largest assembled contig was 4.46 mb with Canu (242 contigs) and 4.64 mb with Flye (1 contig)--close to the expected genome size.

Interestingly, Flye was able to assemble the entire ecoli genome!

In addition to ecoli data, tests on SARS-CoV-2 Omicron variant data was tested. This data was collected from PacBio's public data here.

The expected genome size of SARS-CoV-2 is about 29.8 kb to 29.9 kb. Like the ecoli data, MASH was estimating the genome size x2, x3, or more.

Ran the SARS-CoV-2 data with Canu only because the reads were too short for Flye. However, in order to run Canu with the dataset, the following flags were added to the assembly.

canu -pacbio-hifi seq/input/sample.fq.gz maxThreads=16 -p sample -d analysis/canu genomeSize=29.9k minReadLength=550 minOverlapLength=50 minInputCoverage=0.4 stopOnLowCoverage=0.4

After running the assembler, only 3 contigs were assembled with the longest having the length of 2900 bp--way too short.

Talked to the analysts about the dataset and it was determined that the dataset were extremely poor.

As of now, I am unsure why MASH is getting highly overestimated genome sizes.

A suggestion was made to downsample the reads because of the extreme depth of coverage that Hifi reads provides; however, even then, the estimated size was still off depending on the downsampled coverage.

downsample reads = (total reads * coverage) / (average read length)

Used seqtk to downsample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant