how should we handle combining taxonomies with `sourmash tax`? #1603

bluegenes · 2021-06-18T01:56:24Z

For now, if identifiers are found in multiple lineage spreadsheets, the reported lineage will be from the last spreadsheet input, as reading each will update the taxonomy dictionary, overwriting prior lineage information for duplicated lineages.

If we want this to work differently, how might we want it to work?

e.g. -- prefer gtdb for bacteria and archaea?
- allow overwriting, as above, or produce separate NCBI spreadsheet with GTDB identifiers removed?
produce an aggregated lineages file for all our databases?

ctb · 2021-06-18T14:23:58Z

hot takes -

this is a somewhat expert use case, for now. so we can alert users but don't need to protect them from bad decisions :)
so, in particular, if we overwrite identifiers with different tax spreadsheets, we should alert the user; this can be either a warning (that can be turned into an error with a flag), or an error (that can be turned into a warning with a flag). I like the error-but-can-be-overridden idea
we could also explicitly detect "colliding" taxonomies for NCBI and GTDG, e.g. "I see you have both Bacteria and d__Bacteria in these tax spreadsheets slash results, this is probably a bad idea, are you sure?"
- the use case I see for allowing colliding taxonomies is that GTDB will always be a subset of NCBI, so maybe we want to let users know if there are things found in NCBI Bacteria/Archaea that are not found in GTDB?
backing up a bit, my goal here is to allow community comparison with a combination of NCBI taxonomy (for euks and viruses) and GTDB taxonomy (for bac and archea). This is a cool feature that was sort-of inspired by functionality needed for charcoal ;). So there are real world use cases! And sourmash can do it!
so... final thought for now... perhaps we could provide the appropriate GTDB+NCBI taxonomy spreadsheets ourselves, for our databases?

ctb · 2022-04-21T15:49:42Z

ref https://twitter.com/shenwei356/status/1517167442094850048

ctb · 2025-01-22T12:53:51Z

ref #3504

bluegenes added the taxonomy label Jun 18, 2021

ctb mentioned this issue Aug 13, 2022

translating between taxonomies - maybe a sourmash tax translate? #2201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how should we handle combining taxonomies with `sourmash tax`? #1603

how should we handle combining taxonomies with `sourmash tax`? #1603

bluegenes commented Jun 18, 2021

ctb commented Jun 18, 2021

ctb commented Apr 21, 2022

ctb commented Jan 22, 2025

how should we handle combining taxonomies with sourmash tax? #1603

how should we handle combining taxonomies with sourmash tax? #1603

Comments

bluegenes commented Jun 18, 2021

ctb commented Jun 18, 2021

ctb commented Apr 21, 2022

ctb commented Jan 22, 2025

how should we handle combining taxonomies with `sourmash tax`? #1603

how should we handle combining taxonomies with `sourmash tax`? #1603