Fix metaphlan profiles with duplicate entries by aggregating them #153

Midnighter · 2024-10-28T10:47:09Z

fix [BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported #140
description of feature/fix
tests added/passed
add an entry to the changelog

@d-callan would you be up for testing if this works as you would expect, please? You can install this directly with

pip install https://github.com/taxprofiler/taxpasta/archive/fix-metaphlan.zip

d-callan · 2024-10-30T19:43:58Z

thanks for your effort on this :) it seems to not err any longer, which is a great improvement! i tried w a pretty small/ and mostly manufactured test case built from subsetting and modifying one of the test files in the pr:

#mpa_vOct22_CHOCOPhlAnSGB_202212
#/usr/local/bin/metaphlan --nproc 6 --input_type fastq 2612.merged.fastq.gz --bowtie2out 2612_se_metaphlan4-db.metaphlan.bowtie2out.txt --bowtie2db metaphlan4 --index mpa_vOct22_CHOCOPhlAnSGB_202212 --biom 2612_se_metaphlan4-db.metaphlan.biom --output_file 2612_se_metaphlan4-db.metaphlan_profile.txt
#707440 reads processed
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       89.92809        
k__Archaea      2157    10.07191        
k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168        
k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

and i got:

taxonomy_id     count
2       89928090
2157    10071910
976     43036220
201174  29046690
1239    27901680
186801  95146200
91061   906520

i notice there is no entry for 0, which i was expecting in the result. given what i saw quickly in the comments/ code w this pr, i had expected that these two entries in particular would have wound up there:

k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756

was that not the intention, or is it a bug? given the coercion to 'counts' by multiplying by a large int, this loss of data seems particularly bad. it seems to me a person could not reliably recreate the relative abundances they got from metaphlan for the properly classified entities without at min retaining the unclassified and un-ncbi-mappable ones as unclassified.

also, as i was looking at this i wondered.. for the two entries just above here, is it possible/ reasonable to assign them to 1239 (Firmicutes) rather than 0 (unclassified)? i didnt immediately think of a reason to not do that, but maybe i missed something.

Midnighter · 2024-10-30T22:18:27Z

Thank you for taking such a thorough look at it. It indeed sounds like the implementation is still buggy.

And assigning to the next higher rank that has an ID rather than to unclassified makes so much sense! Thank you for the suggestion ☺️

Midnighter · 2024-10-30T23:27:33Z

There was a bug with operator precedence in pandas. Need to always use parentheses...

Midnighter · 2024-10-30T23:46:50Z

I thought about your idea regarding ranks some more and now discovered that both the previous implementation and the one that you propose are wrong. The relative abundance at each rank sums to 100%. In your example,

k__Bacteria|p__Bacteroidetes    2|976   43.03622        
k__Bacteria|p__Actinobacteria   2|201174        29.04669        
k__Bacteria|p__Firmicutes       2|1239  27.90168

Firmicutes comprise ~28% of all bacteria. At the next lower rank we have

k__Bacteria|p__Firmicutes|c__Clostridia 2|1239|186801   95.1462 
k__Bacteria|p__Firmicutes|c__CFGB9120   2|1239| 2.26587 
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 1.82756 
k__Bacteria|p__Firmicutes|c__Bacilli    2|1239|91061    0.90652

which sum up to ~100% again. It would be wrong to add those abundances to 1239, as that will break the assumption that each rank sums to 100%. It is equally wrong to add them to the unclassified node. Instead, they should just be dropped from the profile.

jfy133 · 2024-10-31T18:19:04Z

src/taxpasta/infrastructure/application/metaphlan/metaphlan_profile_standardisation_service.py

@@ -83,31 +85,35 @@ def transform(
                },
            )
        )
+        # MetaPhlAn sometimes assigns a taxon by name only, but no identifier exists.
+        # We have no choice but to count those entries as unclassified at the moment.
+        # We drop unclassified entries.


I sort of feel we should warn for this if this happens (even if it's included in teh higher classified taxon), I'm not really happy with dropping 'valid' contents from the output of the tool when we just say we are essentially changing teh format...

At a minimum in docs, but nice to have would be a warn on CLI

There is a warning emitted if you look a few lines below.

d-callan · 2024-10-31T18:45:41Z

I see what you mean... And I suppose the two entities in the example are already counted towards 1239 now I think about it. Though removing entries means a rank won't sum to 100 any longer either, it'll just be below 100 rather than above. In the case they're removed, we're likely creating inconsistency in relative abundances from one taxonomic rank to another and also making it impossible to recreate the relative abundances originally produced by metaphlan. Maybe that's the best short term solution anyhow, but it seems a thing to explain carefully to people.

I wonder what other options we have? Can we build a taxonomy from SGBs or something? Or other ideas not for now, but for a more complete solution later?

Midnighter · 2024-10-31T18:50:35Z

Yeah, not great either way. I guess, one could think of one unclassified grouping per rank, but I don't know what identifier to assign to each.

Midnighter added 4 commits October 23, 2024 13:50

chore: do not remove trailing whitespace from txt files

df8bc9f

chore: add failing profiles for testing

c1d1d37

docs: insert descriptive comments

80e6de0

fix: aggregate all duplicate entries

8c19722

Midnighter self-assigned this Oct 28, 2024

Midnighter requested a review from a team October 28, 2024 10:55

fix: enclose condition in parentheses

0fef340

Midnighter added 4 commits October 31, 2024 01:13

fix: drop unclassified entries

4755f60

fix: invert condition

3efd141

fix: unclassified are empty strings

1db9edd

chore: add small MetaPhlAn test profile

ff470cc

jfy133 reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metaphlan profiles with duplicate entries by aggregating them #153

Fix metaphlan profiles with duplicate entries by aggregating them #153

Midnighter commented Oct 28, 2024

d-callan commented Oct 30, 2024

Midnighter commented Oct 30, 2024

Midnighter commented Oct 30, 2024

Midnighter commented Oct 30, 2024 •

edited

Loading

jfy133 Oct 31, 2024

jfy133 Oct 31, 2024

Midnighter Oct 31, 2024

d-callan commented Oct 31, 2024

Midnighter commented Oct 31, 2024

Fix metaphlan profiles with duplicate entries by aggregating them #153

Are you sure you want to change the base?

Fix metaphlan profiles with duplicate entries by aggregating them #153

Conversation

Midnighter commented Oct 28, 2024

d-callan commented Oct 30, 2024

Midnighter commented Oct 30, 2024

Midnighter commented Oct 30, 2024

Midnighter commented Oct 30, 2024 • edited Loading

jfy133 Oct 31, 2024

Choose a reason for hiding this comment

jfy133 Oct 31, 2024

Choose a reason for hiding this comment

Midnighter Oct 31, 2024

Choose a reason for hiding this comment

d-callan commented Oct 31, 2024

Midnighter commented Oct 31, 2024

Midnighter commented Oct 30, 2024 •

edited

Loading