You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now, if identifiers are found in multiple lineage spreadsheets, the reported lineage will be from the last spreadsheet input, as reading each will update the taxonomy dictionary, overwriting prior lineage information for duplicated lineages.
If we want this to work differently, how might we want it to work?
e.g. -- prefer gtdb for bacteria and archaea?
allow overwriting, as above, or produce separate NCBI spreadsheet with GTDB identifiers removed?
produce an aggregated lineages file for all our databases?
The text was updated successfully, but these errors were encountered:
this is a somewhat expert use case, for now. so we can alert users but don't need to protect them from bad decisions :)
so, in particular, if we overwrite identifiers with different tax spreadsheets, we should alert the user; this can be either a warning (that can be turned into an error with a flag), or an error (that can be turned into a warning with a flag). I like the error-but-can-be-overridden idea
we could also explicitly detect "colliding" taxonomies for NCBI and GTDG, e.g. "I see you have both Bacteria and d__Bacteria in these tax spreadsheets slash results, this is probably a bad idea, are you sure?"
the use case I see for allowing colliding taxonomies is that GTDB will always be a subset of NCBI, so maybe we want to let users know if there are things found in NCBI Bacteria/Archaea that are not found in GTDB?
backing up a bit, my goal here is to allow community comparison with a combination of NCBI taxonomy (for euks and viruses) and GTDB taxonomy (for bac and archea). This is a cool feature that was sort-of inspired by functionality needed for charcoal ;). So there are real world use cases! And sourmash can do it!
so... final thought for now... perhaps we could provide the appropriate GTDB+NCBI taxonomy spreadsheets ourselves, for our databases?
For now, if identifiers are found in multiple lineage spreadsheets, the reported lineage will be from the last spreadsheet input, as reading each will update the taxonomy dictionary, overwriting prior lineage information for duplicated lineages.
If we want this to work differently, how might we want it to work?
The text was updated successfully, but these errors were encountered: