Can sourmash detect contamination from viral and eukaryotic genomes? #1675

asaldivar93 · 2021-07-18T04:58:26Z

Hi, a big thank you to the developers of this useful and versatile tool. I am using sourmash to detect and remove undesired prokaryotes from WGS datasets in my de-novo assembly pipeline of prokaryotic genomes. I was thinking I could use it in a similar way to detect and remove viral and eukaryotic contamination. My first thought was to build a sketch file for NCBI's RefSeq database containing genomes from microbial viruses and microbial eukaryotes, however, I have two doubts:

First, I am not sure if sourmash is adequate for this analysis in such distant lineages.
Second, I am not sure of the computational requirements to run such an analysis. After all, I run my pipeline in a standard desktop PC (16Gb RAM), and I don't even know if building the sketch file on my PC will be possible.

If you have any thoughts or recommendations I would greatly appreciate them.

ctb · 2021-07-18T14:19:06Z

hi @asaldivar93, the short answer is that it should work technically, within the memory constraints you have, but may not be able to do what you want scientifically!

We do have slightly out of date NCBI databases for GenBank fungi, protozoa, and viruses that I'd be happy to make available to you, also. They're actually pretty small. (We're working on building new & updated ones, too; I'll take your issue as a vote to work harder on that :).

Scientifically, you're going to run into two problems in using sourmash for this -

first, the sets of known genomes, especially those in NCBI, aren't really that complete for euk/viruses.

second, sourmash's ability to reach out across evolutionary distance to detect matches is not great. At the moment it's effectively a database lookup tool. So if you don't have the right species in the database, you're not going to find it as a contaminant.

Other than those two issues, you should be fine 😁

BTW, you might also be interested in charcoal. And a while back, @bluegenes and @taylorreiter were talking about using OrthoDB as a source for building databases, but we haven't done much on that since, sorry!

ctb · 2025-01-22T12:54:24Z

please see #3504 for comprehensive eukaryotic databases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can sourmash detect contamination from viral and eukaryotic genomes? #1675

Can sourmash detect contamination from viral and eukaryotic genomes? #1675

asaldivar93 commented Jul 18, 2021

ctb commented Jul 18, 2021

ctb commented Jan 22, 2025

Can sourmash detect contamination from viral and eukaryotic genomes? #1675

Can sourmash detect contamination from viral and eukaryotic genomes? #1675

Comments

asaldivar93 commented Jul 18, 2021

ctb commented Jul 18, 2021

ctb commented Jan 22, 2025