-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metaphlan profiles with duplicate entries by aggregating them #153
base: dev
Are you sure you want to change the base?
Conversation
thanks for your effort on this :) it seems to not err any longer, which is a great improvement! i tried w a pretty small/ and mostly manufactured test case built from subsetting and modifying one of the test files in the pr:
and i got:
i notice there is no entry for 0, which i was expecting in the result. given what i saw quickly in the comments/ code w this pr, i had expected that these two entries in particular would have wound up there:
was that not the intention, or is it a bug? given the coercion to 'counts' by multiplying by a large int, this loss of data seems particularly bad. it seems to me a person could not reliably recreate the relative abundances they got from metaphlan for the properly classified entities without at min retaining the unclassified and un-ncbi-mappable ones as unclassified. also, as i was looking at this i wondered.. for the two entries just above here, is it possible/ reasonable to assign them to 1239 (Firmicutes) rather than 0 (unclassified)? i didnt immediately think of a reason to not do that, but maybe i missed something. |
Thank you for taking such a thorough look at it. It indeed sounds like the implementation is still buggy. And assigning to the next higher rank that has an ID rather than to unclassified makes so much sense! Thank you for the suggestion |
There was a bug with operator precedence in pandas. Need to always use parentheses... |
I thought about your idea regarding ranks some more and now discovered that both the previous implementation and the one that you propose are wrong. The relative abundance at each rank sums to 100%. In your example,
Firmicutes comprise ~28% of all bacteria. At the next lower rank we have
which sum up to ~100% again. It would be wrong to add those abundances to 1239, as that will break the assumption that each rank sums to 100%. It is equally wrong to add them to the unclassified node. Instead, they should just be dropped from the profile. |
@@ -83,31 +85,35 @@ def transform( | |||
}, | |||
) | |||
) | |||
# MetaPhlAn sometimes assigns a taxon by name only, but no identifier exists. | |||
# We have no choice but to count those entries as unclassified at the moment. | |||
# We drop unclassified entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of feel we should warn for this if this happens (even if it's included in teh higher classified taxon), I'm not really happy with dropping 'valid' contents from the output of the tool when we just say we are essentially changing teh format...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum in docs, but nice to have would be a warn on CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a warning emitted if you look a few lines below.
I see what you mean... And I suppose the two entities in the example are already counted towards 1239 now I think about it. Though removing entries means a rank won't sum to 100 any longer either, it'll just be below 100 rather than above. In the case they're removed, we're likely creating inconsistency in relative abundances from one taxonomic rank to another and also making it impossible to recreate the relative abundances originally produced by metaphlan. Maybe that's the best short term solution anyhow, but it seems a thing to explain carefully to people. I wonder what other options we have? Can we build a taxonomy from SGBs or something? Or other ideas not for now, but for a more complete solution later? |
Yeah, not great either way. I guess, one could think of one unclassified grouping per rank, but I don't know what identifier to assign to each. |
@d-callan would you be up for testing if this works as you would expect, please? You can install this directly with