Faster `FragmentAnnotation` equality check #73

bittremieux · 2024-11-24T06:41:50Z

Checking whether two FragmentAnnotations are equal is slow due to the regex comparison in __str__ . The regex can be compiled or more direct equality checking can be implemented.

It's important to benchmark the runtime of the old implementation versus any optimized implementation.

The text was updated successfully, but these errors were encountered:

bittremieux · 2025-03-04T15:09:11Z

Please don't hijack issues with unrelated questions. We are busily preparing additional information for interested GSoC contributors.

dikshant182004 · 2025-03-05T04:22:42Z

hi @bittremieux ,i have understood all your sayings and it will not be repeated in future.however ,for the time being can i contribute in this issue

bittremieux · 2025-03-05T08:37:57Z

Yes, this would certainly be a good issue to start contributing.

mandeepnh5 · 2025-03-12T12:13:22Z

Hi @bittremieux I have changed the code from str to comparing directly with its relevant attributes
Here is the code i used for benchmarking

import timeit
from spectrum_utils.fragment_annotation import FragmentAnnotation

fa1 = FragmentAnnotation(ion_type="y", neutral_loss="-H2O", isotope=1, charge=2, adduct="[M+H+Na]", analyte_number=4, mz_delta=(0.1, "Da"))
fa2 = FragmentAnnotation(ion_type="y", neutral_loss="-H2O", isotope=1, charge=2, adduct="[M+H+Na]", analyte_number=4, mz_delta=(0.1, "Da"))

def old_eq(fa1, fa2):
    return str(fa1) == str(fa2)

def new_eq(fa1, fa2):
    if not isinstance(fa2, FragmentAnnotation):
        return False
    return (
        fa1.ion_type == fa2.ion_type
        and fa1.neutral_loss == fa2.neutral_loss
        and fa1.isotope == fa2.isotope
        and fa1.charge == fa2.charge
        and fa1.adduct == fa2.adduct
        and fa1.analyte_number == fa2.analyte_number
        and fa1.mz_delta == fa2.mz_delta
    )
old_time = timeit.timeit(lambda: old_eq(fa1, fa2), number=100000)
new_time = timeit.timeit(lambda: new_eq(fa1, fa2), number=100000)

print(f"Old time: {old_time:.4f} seconds")
print(f"New time: {new_time:.4f} seconds")

Here is the results
Old time: 0.3438 seconds
New time: 0.0358 seconds

Please let me know if its proper and i will open a PR

dikshant182004 · 2025-03-12T14:48:27Z

hi @bittremieux ,i have also benchmark the resullts in past using pytest‑benchmark fixture but those results are a bit different from what @mandeepnh5 is getting using timeit module,in my results i was getting the old implementation performance better than the new one (direct equality check) >

Name (time in us)            Min                    Max                Mean              StdDev              Median   
             IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_old_eq_runtime      40.8001 (1.0)      11,738.2000 (3.20)      50.4980 (1.0)       85.1722 (1.0)       41.9999 (1.0)       2.8000 (1.0)      129;3471       19.8028 (1.0)       19842           1
test_new_eq_runtime     543.5001 (13.32)     3,669.0002 (1.0)      644.0556 (12.75)    175.7144 (2.06)     598.8001 (14.26)    67.6251 (24.15)     110;144        1.5527 (0.08)       1575           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------```

so what should we consider out as our evaluation criteria .

bittremieux · 2025-03-16T18:36:27Z

@mandeepnh5 Yes, that looks appropriate for a PR, please go ahead. 🙂

@dikshant182004 The discrepancy is indeed weird. I'm not familiar with pytest-benchmark, is it possible that there is some overhead influencing the results? What is your specific implementation?

Ultimately we want to use to implementation that gives us the best performance, which also needs to be benchmarked.

mandeepnh5 · 2025-03-17T04:44:10Z

Thanks I will create a pr

dikshant182004 · 2025-03-17T06:48:29Z

hi @bittremieux ,i was not able to find my previous implementation so i tried it once again by just fixing some parts of the @mandeepnh5 code using pytest-benchmark fixture .here is the code.

import pytest
from spectrum_utils.fragment_annotation import FragmentAnnotation

def new_equality_check(fa1, fa2):
    if not isinstance(fa2, FragmentAnnotation):
        return False
    return (fa1.ion_type == fa2.ion_type
        and fa1.neutral_loss == fa2.neutral_loss
        and fa1.isotope == fa2.isotope
        and fa1.charge == fa2.charge
        and fa1.adduct == fa2.adduct
        and fa1.analyte_number == fa2.analyte_number
        and fa1.mz_delta == fa2.mz_delta)

def create_fragments():
    frag1 = FragmentAnnotation(
        ion_type='b',
        neutral_loss=18,
        isotope=0,
        charge=1,
        adduct='',
        analyte_number=1,
        mz_delta=(0.0, "Da")
    )
    frag2 = FragmentAnnotation(
        ion_type='b',
        neutral_loss=18,
        isotope=0,
        charge=1,
        adduct='',
        analyte_number=1,
        mz_delta=(0.0, "Da")
    )
    return frag1, frag2

@pytest.mark.parametrize("equality_fn", [
    lambda f1, f2: FragmentAnnotation.__eq__,
    lambda f1, f2: new_equality_check(f1, f2)  
])

def test_eq_runtime(benchmark, equality_fn):
    frag1, frag2 = create_fragments()

    def run_eq():
        for _ in range(1000):
            equality_fn(frag1, frag2)

    result = benchmark(run_eq)
    return result

so the benchmarking results are expected to be same as i discussed earlier :-

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in us)                   Min                   Max                Mean              StdDev
 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_eq_runtime[<lambda>0]      84.1999 (1.0)      2,422.7998 (1.0)      140.9994 (1.0)      101.5418 (1.0)       
86.8002 (1.0)      120.8005 (1.0)        519;71        7.0922 (1.0)        6362           1
test_eq_runtime[<lambda>1]     621.0003 (7.38)     5,900.1003 (2.44)     922.7282 (6.54)     376.3427 (3.71)     692.7000 (7.98)     666.1997 (5.51)        362;3        1.0837 (0.15)       1550           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.     
  OPS: Operations Per Second, computed as 1 / Mean ```

dikshant182004 · 2025-03-17T06:59:21Z

i feel using pytest benchmark fixture would be more prominent then the timeit module because we can integrate with the testing framework and provides statistical insights that can help decide if the performance improvement is significant and consistent across multiple runs.

bittremieux · 2025-03-17T13:14:16Z

Yes, I agree that pytest-benchmark seems like a more principled approach. I'm not really familiar with it though. Does it also support running timing comparisons between successive version of the code to flag excessive slowdowns that would be newly introduced?

The discrepancy in results is very weird though. Intuitively I'd expect that the new implementation should be faster because it's less complex (e.g. no regex matching, etc.) and because of the eager evalution of the and statement. Do you have any insights why this is not the case in your results?

dikshant182004 · 2025-03-17T13:59:59Z

However i am also not familier with the pytest‑benchmark but after doing some research i find that it does support tracking performance changes over time using options like --benchmark-save and --benchmark-compare to persist benchmark results across runs to detect and flag any excessive slowdowns introduced in new versions.
there are some of the insights i feel that may be the reason for the discrepancy:-
-Python's string comparisons are implemented in C and are extremely optimized. since the string representation is computed once it would be fast.
-May be the regex used inside the str method is precompiled or cached, reducing its runtime cost on subsequent calls.
-In the new implementation, each attribute is accessed and compared individually in Python. so i feel the overhead of multiple Python-level comparisons can add up.
i feel these were some of the reasons for our benchmark results based on my research ,they can be wrong since they are just assumptions .what are your insights in this ?
i have also tried a cached -key approach previously which was only 15 percent slower than the older implementation but it is more reliable for longer run, @bittremieux

bittremieux · 2025-03-17T18:49:15Z

it does support tracking performance changes over time using options like --benchmark-save and --benchmark-compare to persist benchmark results across runs to detect and flag any excessive slowdowns introduced in new versions.

That could be interesting to set up for all time-sensitive functions at some point, to ensure that newer versions don't introduce (excessive) slowdowns.

Direct comparison should always be faster than creating a string, let alone if a regex is used. In contrast, in your test it's not even a bit slower, but up to an order of magnitude. Which doesn't make sense to me at all. We'll have to evaluate that in more details, and @Janne98 will do an independent evaluation of #80.

dikshant182004 · 2025-03-18T15:50:08Z

okay @bittremieux ,i will try to learn more about pytest benchmark.Additionally i am planning to use cpython for this issue ,i feel it will give more promising results in benchmarking ,do u have anything else in your suggestions that would be more fast other than direct comparison.

bittremieux · 2025-03-18T15:57:54Z

What do you mean with CPython? That's the default Python backend. Or do you mean Cython? In which case I don't think it's such a good idea, because it will make installation on different systems much more challenging. If we want to compile anything, Numba is a better approach.

dikshant182004 · 2025-03-18T16:57:52Z

yeah numba can be a good option to try with .i will give try using it . and cython may be is not a good idea as u said it become more challenging furthur

bittremieux · 2025-03-21T08:17:16Z

Fixed in #80.

bittremieux added the enhancement New feature or request label Nov 24, 2024

bittremieux added the good first issue Good for newcomers label Mar 4, 2025

mandeepnh5 mentioned this issue Mar 17, 2025

Faster FragmentAnnotation equality check by directly comparing the attributes #80

Merged

bittremieux closed this as completed Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `FragmentAnnotation` equality check #73

Faster `FragmentAnnotation` equality check #73

bittremieux commented Nov 24, 2024 •

edited

Loading

bittremieux commented Mar 4, 2025

dikshant182004 commented Mar 5, 2025

bittremieux commented Mar 5, 2025

mandeepnh5 commented Mar 12, 2025 •

edited

Loading

dikshant182004 commented Mar 12, 2025 •

edited

Loading

bittremieux commented Mar 16, 2025

mandeepnh5 commented Mar 17, 2025

dikshant182004 commented Mar 17, 2025 •

edited

Loading

dikshant182004 commented Mar 17, 2025 •

edited

Loading

bittremieux commented Mar 17, 2025

dikshant182004 commented Mar 17, 2025 •

edited

Loading

bittremieux commented Mar 17, 2025

dikshant182004 commented Mar 18, 2025

bittremieux commented Mar 18, 2025

dikshant182004 commented Mar 18, 2025 •

edited

Loading

bittremieux commented Mar 21, 2025

Faster FragmentAnnotation equality check #73

Faster FragmentAnnotation equality check #73

Comments

bittremieux commented Nov 24, 2024 • edited Loading

bittremieux commented Mar 4, 2025

dikshant182004 commented Mar 5, 2025

bittremieux commented Mar 5, 2025

mandeepnh5 commented Mar 12, 2025 • edited Loading

dikshant182004 commented Mar 12, 2025 • edited Loading

bittremieux commented Mar 16, 2025

mandeepnh5 commented Mar 17, 2025

dikshant182004 commented Mar 17, 2025 • edited Loading

dikshant182004 commented Mar 17, 2025 • edited Loading

bittremieux commented Mar 17, 2025

dikshant182004 commented Mar 17, 2025 • edited Loading

bittremieux commented Mar 17, 2025

dikshant182004 commented Mar 18, 2025

bittremieux commented Mar 18, 2025

dikshant182004 commented Mar 18, 2025 • edited Loading

bittremieux commented Mar 21, 2025

Faster `FragmentAnnotation` equality check #73

Faster `FragmentAnnotation` equality check #73

bittremieux commented Nov 24, 2024 •

edited

Loading

mandeepnh5 commented Mar 12, 2025 •

edited

Loading

dikshant182004 commented Mar 12, 2025 •

edited

Loading

dikshant182004 commented Mar 17, 2025 •

edited

Loading

dikshant182004 commented Mar 17, 2025 •

edited

Loading

dikshant182004 commented Mar 17, 2025 •

edited

Loading

dikshant182004 commented Mar 18, 2025 •

edited

Loading