Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how should we handle combining taxonomies with sourmash tax? #1603

Open
bluegenes opened this issue Jun 18, 2021 · 3 comments
Open

how should we handle combining taxonomies with sourmash tax? #1603

bluegenes opened this issue Jun 18, 2021 · 3 comments
Labels

Comments

@bluegenes
Copy link
Contributor

For now, if identifiers are found in multiple lineage spreadsheets, the reported lineage will be from the last spreadsheet input, as reading each will update the taxonomy dictionary, overwriting prior lineage information for duplicated lineages.

If we want this to work differently, how might we want it to work?

  • e.g. -- prefer gtdb for bacteria and archaea?
    • allow overwriting, as above, or produce separate NCBI spreadsheet with GTDB identifiers removed?
  • produce an aggregated lineages file for all our databases?
@ctb
Copy link
Contributor

ctb commented Jun 18, 2021

hot takes -

  • this is a somewhat expert use case, for now. so we can alert users but don't need to protect them from bad decisions :)
  • so, in particular, if we overwrite identifiers with different tax spreadsheets, we should alert the user; this can be either a warning (that can be turned into an error with a flag), or an error (that can be turned into a warning with a flag). I like the error-but-can-be-overridden idea
  • we could also explicitly detect "colliding" taxonomies for NCBI and GTDG, e.g. "I see you have both Bacteria and d__Bacteria in these tax spreadsheets slash results, this is probably a bad idea, are you sure?"
    • the use case I see for allowing colliding taxonomies is that GTDB will always be a subset of NCBI, so maybe we want to let users know if there are things found in NCBI Bacteria/Archaea that are not found in GTDB?
  • backing up a bit, my goal here is to allow community comparison with a combination of NCBI taxonomy (for euks and viruses) and GTDB taxonomy (for bac and archea). This is a cool feature that was sort-of inspired by functionality needed for charcoal ;). So there are real world use cases! And sourmash can do it!
  • so... final thought for now... perhaps we could provide the appropriate GTDB+NCBI taxonomy spreadsheets ourselves, for our databases?

@ctb
Copy link
Contributor

ctb commented Apr 21, 2022

@ctb
Copy link
Contributor

ctb commented Jan 22, 2025

ref #3504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants