Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can sourmash detect contamination from viral and eukaryotic genomes? #1675

Open
asaldivar93 opened this issue Jul 18, 2021 · 2 comments
Open

Comments

@asaldivar93
Copy link

Hi, a big thank you to the developers of this useful and versatile tool. I am using sourmash to detect and remove undesired prokaryotes from WGS datasets in my de-novo assembly pipeline of prokaryotic genomes. I was thinking I could use it in a similar way to detect and remove viral and eukaryotic contamination. My first thought was to build a sketch file for NCBI's RefSeq database containing genomes from microbial viruses and microbial eukaryotes, however, I have two doubts:

First, I am not sure if sourmash is adequate for this analysis in such distant lineages.
Second, I am not sure of the computational requirements to run such an analysis. After all, I run my pipeline in a standard desktop PC (16Gb RAM), and I don't even know if building the sketch file on my PC will be possible.

If you have any thoughts or recommendations I would greatly appreciate them.

@ctb
Copy link
Contributor

ctb commented Jul 18, 2021

hi @asaldivar93, the short answer is that it should work technically, within the memory constraints you have, but may not be able to do what you want scientifically!

We do have slightly out of date NCBI databases for GenBank fungi, protozoa, and viruses that I'd be happy to make available to you, also. They're actually pretty small. (We're working on building new & updated ones, too; I'll take your issue as a vote to work harder on that :).

Scientifically, you're going to run into two problems in using sourmash for this -

first, the sets of known genomes, especially those in NCBI, aren't really that complete for euk/viruses.

second, sourmash's ability to reach out across evolutionary distance to detect matches is not great. At the moment it's effectively a database lookup tool. So if you don't have the right species in the database, you're not going to find it as a contaminant.

Other than those two issues, you should be fine 😁

BTW, you might also be interested in charcoal. And a while back, @bluegenes and @taylorreiter were talking about using OrthoDB as a source for building databases, but we haven't done much on that since, sorry!

@ctb
Copy link
Contributor

ctb commented Jan 22, 2025

please see #3504 for comprehensive eukaryotic databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants