Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metaphlan profiles with duplicate entries by aggregating them #153

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

Midnighter
Copy link
Contributor

@d-callan would you be up for testing if this works as you would expect, please? You can install this directly with

pip install https://github.com/taxprofiler/taxpasta/archive/fix-metaphlan.zip

@Midnighter Midnighter self-assigned this Oct 28, 2024
@Midnighter Midnighter requested a review from a team October 28, 2024 10:55
@d-callan
Copy link

thanks for your effort on this :) it seems to not err any longer, which is a great improvement! i tried w a pretty small/ and mostly manufactured test case built from subsetting and modifying one of the test files in the pr:

#mpa_vOct22_CHOCOPhlAnSGB_202212
#/usr/local/bin/metaphlan --nproc 6 --input_type fastq 2612.merged.fastq.gz --bowtie2out 2612_se_metaphlan4-db.metaphlan.bowtie2out.txt --bowtie2db metaphlan4 --index mpa_vOct22_CHOCOPhlAnSGB_202212 --biom 2612_se_metaphlan4-db.metaphlan.biom --output_file 2612_se_metaphlan4-db.metaphlan_profile.txt
#707440 reads processed
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       89.92809        
k__Archaea      2157    10.07191        
k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168        
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

and i got:

taxonomy_id     count
2       89928090
2157    10071910
976     43036220
201174  29046690
1239    27901680
186801  95146200
91061   906520

i notice there is no entry for 0, which i was expecting in the result. given what i saw quickly in the comments/ code w this pr, i had expected that these two entries in particular would have wound up there:

k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756

was that not the intention, or is it a bug? given the coercion to 'counts' by multiplying by a large int, this loss of data seems particularly bad. it seems to me a person could not reliably recreate the relative abundances they got from metaphlan for the properly classified entities without at min retaining the unclassified and un-ncbi-mappable ones as unclassified.

also, as i was looking at this i wondered.. for the two entries just above here, is it possible/ reasonable to assign them to 1239 (Firmicutes) rather than 0 (unclassified)? i didnt immediately think of a reason to not do that, but maybe i missed something.

@Midnighter
Copy link
Contributor Author

Thank you for taking such a thorough look at it. It indeed sounds like the implementation is still buggy.

And assigning to the next higher rank that has an ID rather than to unclassified makes so much sense! Thank you for the suggestion ☺️

@Midnighter
Copy link
Contributor Author

There was a bug with operator precedence in pandas. Need to always use parentheses...

@Midnighter
Copy link
Contributor Author

Midnighter commented Oct 30, 2024

I thought about your idea regarding ranks some more and now discovered that both the previous implementation and the one that you propose are wrong. The relative abundance at each rank sums to 100%. In your example,

k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168 

Firmicutes comprise ~28% of all bacteria. At the next lower rank we have

k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

which sum up to ~100% again. It would be wrong to add those abundances to 1239, as that will break the assumption that each rank sums to 100%. It is equally wrong to add them to the unclassified node. Instead, they should just be dropped from the profile.

@@ -83,31 +85,35 @@ def transform(
},
)
)
# MetaPhlAn sometimes assigns a taxon by name only, but no identifier exists.
# We have no choice but to count those entries as unclassified at the moment.
# We drop unclassified entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of feel we should warn for this if this happens (even if it's included in teh higher classified taxon), I'm not really happy with dropping 'valid' contents from the output of the tool when we just say we are essentially changing teh format...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum in docs, but nice to have would be a warn on CLI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a warning emitted if you look a few lines below.

@d-callan
Copy link

I see what you mean... And I suppose the two entities in the example are already counted towards 1239 now I think about it. Though removing entries means a rank won't sum to 100 any longer either, it'll just be below 100 rather than above. In the case they're removed, we're likely creating inconsistency in relative abundances from one taxonomic rank to another and also making it impossible to recreate the relative abundances originally produced by metaphlan. Maybe that's the best short term solution anyhow, but it seems a thing to explain carefully to people.

I wonder what other options we have? Can we build a taxonomy from SGBs or something? Or other ideas not for now, but for a more complete solution later?

@Midnighter
Copy link
Contributor Author

Yeah, not great either way. I guess, one could think of one unclassified grouping per rank, but I don't know what identifier to assign to each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported
3 participants