[ENH] Repository benchmarking #3026

chrisholder · 2025-08-12T18:15:51Z

Reference Issues/PRs

What does this implement/fix?

This PR adds performance benchmarking capabilities (runtime-focused) to the repository, enabling us to measure improvements or regressions over time and easily compare performance between branches. This allows PRs to be accompanied by clear, reproducible performance data and visualisations.

Benchmarking is implemented using airspeed-velocity (asv), the same tool used by SciPy (docs). Our configuration closely follows SciPy’s, with some of their utility methods adapted for aeon.

An example benchmark has been added for the distance module. Example output tables and graphs are included in the comments below.

New dependency

asv (developer-only dependency)

Usage example

Once merged into main, you can define benchmark classes to measure performance across shapes, datasets, or algorithms. For example, here’s the Euclidean distance benchmark which exists at benchmarks/benchmarks/distance.py:

class Euclidean(Benchmark):
    # params: (n_cases, n_channels, n_timepoints)
    params = [[
        (10, 1, 10), (100, 1, 100), (100, 1, 500),
        (10, 3, 10), (100, 3, 100), (100, 3, 500)
    ]]
    param_names = ["shape"]

    def setup(self, shape):
        self.a = make_example_3d_numpy(*shape, return_y=False, random_state=1)
        self.b = make_example_3d_numpy(*shape, return_y=False, random_state=2)

    def time_dist(self, shape):
        euclidean_distance(self.a[0], self.b[0])

    def time_pairwise_dist(self, shape):
        euclidean_pairwise_distance(self.a)

    def time_one_to_multiple_dist(self, shape):
        euclidean_pairwise_distance(self.a[0], self.b)

    def time_multiple_to_multiple_dist(self, shape):
        euclidean_pairwise_distance(self.a, self.b)

Running benchmarks

From the benchmarks directory:

# Run a specific benchmark class
asv run --bench "distances.Euclidean"

# Run a specific method
asv run --bench "distances.Euclidean.time_dist"

# Run all benchmarks
asv run --bench .

# Skip already-run commits
asv run --skip-existing-commits --bench "distances.Euclidean.time_dist"

Comparing branches

After running benchmarks on main

git checkout some-other-branch
asv run --bench "distances.Euclidean"
asv compare main some-other-branch

Example output:

All benchmarks:

| Change   | Before [45b9b86f] <aeon-benchmarking>   | After [2b253ef0] <some-other-branch>   |   Ratio | Benchmark (Parameter)                                             |
|----------|-----------------------------------------|-------------------------------|---------|-------------------------------------------------------------------|
| +        | 300±2ns                                 | 645±6ns                       |    2.15 | distances.Euclidean.time_indv_dist((10, 1, 10))                   |
| +        | 308±1ns                                 | 648±10ns                      |    2.1  | distances.Euclidean.time_indv_dist((10, 3, 10))                   |
| +        | 327±4ns                                 | 656±6ns                       |    2.01 | distances.Euclidean.time_indv_dist((100, 1, 100))                 |
| +        | 424±6ns                                 | 756±7ns                       |    1.79 | distances.Euclidean.time_indv_dist((100, 1, 500))                 |
| +        | 350±5ns                                 | 701±7ns                       |    2    | distances.Euclidean.time_indv_dist((100, 3, 100))                 |
| +        | 585±10ns                                | 941±4ns                       |    1.61 | distances.Euclidean.time_indv_dist((100, 3, 500))                 |
| +        | 194±20μs                                | 261±0.9μs                     |    1.35 | distances.Euclidean.time_multiple_to_multiple_dist((10, 1, 10))   |
| +        | 181±3μs                                 | 261±2μs                       |    1.44 | distances.Euclidean.time_multiple_to_multiple_dist((10, 3, 10))   |
| +        | 595±20μs                                | 21.6±0.3ms                    |   36.34 | distances.Euclidean.time_multiple_to_multiple_dist((100, 1, 100)) |
| +        | 1.34±0.2ms                              | 21.8±0.1ms                    |   16.29 | distances.Euclidean.time_multiple_to_multiple_dist((100, 1, 500)) |
| +        | 842±4μs                                 | 21.3±0.2ms                    |   25.32 | distances.Euclidean.time_multiple_to_multiple_dist((100, 3, 100)) |
| +        | 2.89±0.03ms                             | 24.2±0.1ms                    |    8.37 | distances.Euclidean.time_multiple_to_multiple_dist((100, 3, 500)) |
| -        | 168±2μs                                 | 62.9±0.3μs                    |    0.37 | distances.Euclidean.time_one_to_multiple_dist((10, 1, 10))        |
| -        | 175±4μs                                 | 64.0±0.5μs                    |    0.37 | distances.Euclidean.time_one_to_multiple_dist((10, 3, 10))        |
| +        | 234±1μs                                 | 328±4μs                       |    1.4  | distances.Euclidean.time_one_to_multiple_dist((100, 1, 100))      |
| +        | 243±3μs                                 | 331±7μs                       |    1.37 | distances.Euclidean.time_one_to_multiple_dist((100, 1, 500))      |
| +        | 238±5μs                                 | 334±10μs                      |    1.41 | distances.Euclidean.time_one_to_multiple_dist((100, 3, 100))      |
| +        | 268±2μs                                 | 356±2μs                       |    1.33 | distances.Euclidean.time_one_to_multiple_dist((100, 3, 500))      |
| -        | 161±6μs                                 | 133±0.4μs                     |    0.83 | distances.Euclidean.time_pairwise_dist((10, 1, 10))               |
| -        | 168±8μs                                 | 133±2μs                       |    0.79 | distances.Euclidean.time_pairwise_dist((10, 3, 10))               |
| +        | 345±3μs                                 | 11.3±0.5ms                    |   32.89 | distances.Euclidean.time_pairwise_dist((100, 1, 100))             |
| +        | 706±7μs                                 | 11.8±0.06ms                   |   16.79 | distances.Euclidean.time_pairwise_dist((100, 1, 500))             |
| +        | 492±10μs                                | 11.5±0.1ms                    |   23.27 | distances.Euclidean.time_pairwise_dist((100, 3, 100))             |
| +        | 1.51±0ms                                | 12.4±0.04ms                   |    8.2  | distances.Euclidean.time_pairwise_dist((100, 3, 500))             |

Visualisation

You can generate interactive performance graphs:

asv publish
asv preview

This launches a local web server with plots and historical trends.

Examples of some pages and graphs include:

and

For a more detailed example see the example html report asv links: https://pv.github.io/numpy-bench/

The web interface provides:

Performance trends over time
Detailed comparison graphs between branches
Parameter-specific performance analysis
Historical regression tracking

Benefits

Quantitative PR Reviews: Reviewers can objectively assess performance impacts
Regression Detection: Automatically identify performance degradations
Optimisation Validation: Verify that performance improvements work as intended
Historical Tracking: Monitor performance evolution across releases
Comprehensive Testing: Test performance across various input configurations

Future Work

Extend benchmark coverage to additional modules
Set up CI integration for automatic performance testing
Establish performance regression thresholds
Create developer documentation for writing benchmarks

aeon-actions-bot · 2025-08-12T18:16:13Z

Thank you for contributing to `aeon`

I have added the following labels to this PR based on the title: [ enhancement ].

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

Run pre-commit checks for all files
Run mypy typecheck tests
Run all pytest tests and configurations
Run all notebook example tests
Run numba-disabled codecov tests
Stop automatic pre-commit fixes (always disabled for drafts)
Disable numba cache loading
Push an empty commit to re-run CI checks

chrisholder · 2025-08-12T18:19:50Z

I want to add a web page tutorial similar to: https://docs.scipy.org/doc/scipy/dev/contributor/benchmarking.html

chrisholder · 2025-08-12T23:15:34Z

This is a further example of how we could use this with our estimators.

Here is a benchmark for Kmeans:

class KMeansBenchmark(EstimatorBenchmark):
    ks = [2, 4, 8]
    inits = ["random", "kmeans++"]
    distances = ["euclidean", "dtw"]
    average_methods = ["mean", "ba"]

    # extend the base grid
    params = EstimatorBenchmark.params + [ks, inits, distances, average_methods]
    param_names = EstimatorBenchmark.param_names + ["k", "init", "distance", "average_method"]

    def _build_estimator(self, k, init, distance, average_method) -> BaseEstimator:
        return aeon_clust.TimeSeriesKMeans(
            n_clusters=k,
            init=init,
            distance=distance,
            averaging_method=average_method,
            n_init=1,
            random_state=1
        )

I have defined a base estimator benchmark class that just generates data and fit predict timing methods.

class EstimatorBenchmark(Benchmark, ABC):
    # Base grid (shared across all estimators)
    shapes = [
        (10, 1, 10),
        (100, 1, 100),
        (10, 3, 10),
        (100, 3, 100),
    ]

    # Subclasses will append their own grids to these:
    params = [shapes]
    param_names = ["shape"]

    def setup(self, shape, *est_params):
        # Data
        self.X_train = make_example_3d_numpy(*shape, return_y=False, random_state=1)
        self.X_test = make_example_3d_numpy(*shape, return_y=False, random_state=2)

        # Pre-fit once for predict timing
        self.prefit_estimator = self._build_estimator(*est_params)
        self.prefit_estimator.fit(self.X_train)

    def time_fit(self, shape, *est_params):
        est = self._build_estimator(*est_params)  # fresh each run
        est.fit(self.X_train)

    def time_predict(self, shape, *est_params):
        self.prefit_estimator.predict(self.X_test)

    @abstractmethod
    def _build_estimator(self, *est_params) -> BaseEstimator:
        """Return an unfitted estimator configured with the given params."""
        ...

For each time series of each shape a kmeans estimator is ran for 3 different values of k, two init, 2 distances, and 2 average methods.

For fit this produces a table that looks like:

[75.00%] ··· clustering.KMeansBenchmark.time_fit                                                                                                                                                                                                                                                                 ok
[75.00%] ··· ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================
             --                                                                                                    init / distance / average_method                                                                                   
             ----------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                 shape      k   random / euclidean / mean   random / euclidean / ba   random / dtw / mean   random / dtw / ba   kmeans++ / euclidean / mean   kmeans++ / euclidean / ba   kmeans++ / dtw / mean   kmeans++ / dtw / ba 
             ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================
              (10, 1, 10)   2            721±20μs                 1.44±0.01ms               926±10μs           2.01±0.08ms              1.09±0.02ms                   2.46±0.1ms               1.09±0.03ms            2.13±0.03ms     
              (10, 1, 10)   4            783±50μs                 1.37±0.02ms               791±10μs           1.94±0.02ms              1.26±0.01ms                   2.22±0.2ms               1.33±0.03ms            2.61±0.09ms     
              (10, 1, 10)   8            608±40μs                 1.34±0.08ms               883±40μs           1.36±0.04ms              2.01±0.05ms                   2.55±0.3ms               2.17±0.05ms            2.48±0.05ms     
              (10, 3, 10)   2            906±20μs                 2.55±0.04ms             1.01±0.08ms          2.00±0.03ms              1.26±0.02ms                  2.51±0.09ms                 945±20μs              1.84±0.2ms     
              (10, 3, 10)   4            745±20μs                 1.97±0.01ms               773±20μs           1.24±0.02ms              1.63±0.02ms                  2.89±0.04ms               1.52±0.02ms            2.58±0.04ms     
              (10, 3, 10)   8            555±7μs                    741±10μs                848±10μs           1.28±0.05ms              2.36±0.05ms                  2.60±0.02ms                2.32±0.1ms             2.73±0.1ms     
             ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================

TonyBagnall · 2025-08-13T10:20:18Z

I like the look of it, I dont suppose it also profiles? I would very much like to be able to profile algorithms

chrisholder · 2025-08-13T14:09:46Z

ASV has built-in support for profiling individual benchmarks. For example, to run a profile of the KMeansBenchmark.time_fit you would run:

asv profile clustering.KmeansBenchamark.time_fit aeon-benchmarking --gui snakeviz

This runs the benchmark under cProfile on the specified branch (aeon-benchmarking in this case) and opens the results directly in SnakeViz. SnakeViz is a new dev dependency I'll also add thats allows you to visualise the cProfile. Happy to use a different gui if anyone has any suggestions. We can also output .prof files for later inspection or load them programmatically via asv.results to explore in Python.

Profiling works for any benchmark in the suite, so we can do this for clustering, distances, or any other algorithm we have defined a benchmark for (see above for defining a benchmark).

Here is an example of a visual profile I have for Kmeans:

SebastianSchmidl

Love this ❤️

SebastianSchmidl · 2025-08-14T07:11:37Z

benchmarks/benchmarks/clustering.py

+    ks = [2, 4, 8]
+    inits = ["random", "kmeans++"]
+    distances = tuple(DISTANCES_DICT.keys())  # all supported distances
+    distances = ["euclidean"]


Probably, you want to remove the distances overwrite before the merge

I've had some thoughts since writing this benchmark about how we should do models that could have a lot of different parameters. I'll write a new comment below this outlining my thoughts after writing a few benchmarks.

SebastianSchmidl · 2025-08-14T07:14:01Z

benchmarks/README.md

+otherwise. Some of the benchmarking features in `spin` also tell ASV to use the aeon
+compiled by `spin`. To run the benchmarks, you will need to install the "dev"


What is spin?

Ah I need to remove this from the docs thats something scipy uses and I decided we probably don't need it.

chrisholder · 2025-08-14T12:47:18Z

Just going to put this here for future reference on how ASV runs benchmarks:

General ASV settings / rules:

Benchmark duration
Each individual benchmark case (one parameter combination) should complete in under 60 seconds to avoid hitting ASV’s default timeout.
Repetitions and statistical uncertainty
ASV runs each benchmark multiple times until the statistical uncertainty (coefficient of variation) is below a threshold, or until it reaches a maximum repeat limit.
By default:
- Each benchmark is run at least 5 times
- Up to 20 times if the target uncertainty is not reached
Result aggregation
The final reported benchmark time is the median of all runs, not the mean — this reduces the impact of outliers on results.
Warm-up phase
ASV includes a built-in warm-up period before timing starts, so we generally don’t need to manually pre-cache or pre-run functions inside the benchmark code.

chrisholder · 2025-08-14T13:28:48Z

I also wanted to define a small set of initial guidelines for how we should conduct benchmarking in Aeon.

What should we benchmark?

Any public-facing function where monitoring performance would be valuable.

Examples:

Estimators
Distance functions
Dataset loading functions

It may also be useful to benchmark functions that are used frequently internally.

Examples:

Conversion/Preprocessing time series logic
Numba utilities
Base class logic

How should we benchmark

When designing benchmarks, be mindful of runtime.
Only parameterise an estimator with variables that are independent from other benchmarks.

For example, for KMeans which accepts a distance parameter, it might seem reasonable to benchmark every possible distance. However, we already have dedicated benchmarks for distance functions, so repeating them here would be redundant. Our goal with the benchmark is to highlight performance that is specific to that benchmark.

We should also consider the input data shapes.
At a minimum, all benchmarks should be conducted over a range of:

Number of cases
Number of timepoints
Number of channels (if supported)

This range does not need to be huge. For example, for estimators and distances, I currently use the following shapes:

(10, 1, 10),
(10, 1, 1000),
(50, 1, 100),
(10, 3, 10),
(10, 3, 1000),
(50, 3, 100),

Due to how ASV runs benchmarks, even with relatively modest changes in the number of cases (e.g., up to a few thousand time series), this range is sufficient to detect performance changes across different dataset sizes.

I am currently only testing with NumPy format. Other formats are fine to include if you believe they could show different performance characteristics.

SebastianSchmidl · 2025-08-15T07:04:23Z

Overall, I agree with what you said above. My five cents:

What should we benchmark?

Any public-facing function where monitoring performance would be valuable.

Examples:

Estimators

Distance functions

Dataset loading functions

▶️ For the dataset loading, we need to make sure to exclude all network calls. It does not make sense to benchmark the network connection. Most interesting would be the parser for .ts-files, IMO.

For example, for KMeans which accepts a distance parameter, it might seem reasonable to benchmark every possible distance. However, we already have dedicated benchmarks for distance functions, so repeating them here would be redundant. Our goal with the benchmark is to highlight performance that is specific to that benchmark.

▶️ Is it reasonable to define default dummy functions in the benchmarking module? If we want to benchmark the overhead of the estimator instead of its dependencies/components, it would make sense to have no-op-stubs for the components. E.g., for a distance d(x, y), just return x[0] or something similar?

chrisholder added 2 commits August 12, 2025 15:30

initial setup for benchmarking

c409013

added warmup

45b9b86

aeon-actions-bot bot added the enhancement New feature, improvement request or other non-bug code enhancement label Aug 12, 2025

chrisholder added 3 commits August 12, 2025 19:23

cont

627caa7

cont

f5c795e

removed html

6372c97

chrisholder added 2 commits August 13, 2025 15:13

added dev snakeviz and clustering benchmarks

d476a67

added dev snakeviz and clustering benchmarks

f993135

SebastianSchmidl reviewed Aug 14, 2025

View reviewed changes

chrisholder added 3 commits August 14, 2025 15:51

simplfied benchmarks

29ec738

removed warm ups

b4e401d

cleanup

ca07a6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Repository benchmarking #3026

[ENH] Repository benchmarking #3026

Uh oh!

chrisholder commented Aug 12, 2025 •

edited

Loading

Uh oh!

aeon-actions-bot bot commented Aug 12, 2025

Uh oh!

chrisholder commented Aug 12, 2025

Uh oh!

chrisholder commented Aug 12, 2025 •

edited

Loading

Uh oh!

TonyBagnall commented Aug 13, 2025

Uh oh!

chrisholder commented Aug 13, 2025 •

edited

Loading

Uh oh!

SebastianSchmidl left a comment

Uh oh!

SebastianSchmidl Aug 14, 2025 •

edited

Loading

Uh oh!

chrisholder Aug 14, 2025

Uh oh!

SebastianSchmidl Aug 14, 2025

Uh oh!

chrisholder Aug 14, 2025

Uh oh!

chrisholder commented Aug 14, 2025 •

edited

Loading

Uh oh!

chrisholder commented Aug 14, 2025

Uh oh!

SebastianSchmidl commented Aug 15, 2025

What should we benchmark?

Uh oh!

Uh oh!

		otherwise. Some of the benchmarking features in `spin` also tell ASV to use the aeon
		compiled by `spin`. To run the benchmarks, you will need to install the "dev"

[ENH] Repository benchmarking #3026

Are you sure you want to change the base?

[ENH] Repository benchmarking #3026

Uh oh!

Conversation

chrisholder commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix?

New dependency

Usage example

Running benchmarks

Comparing branches

Visualisation

The web interface provides:

Benefits

Future Work

Uh oh!

aeon-actions-bot bot commented Aug 12, 2025

Thank you for contributing to aeon

Uh oh!

chrisholder commented Aug 12, 2025

Uh oh!

chrisholder commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TonyBagnall commented Aug 13, 2025

Uh oh!

chrisholder commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SebastianSchmidl left a comment

Choose a reason for hiding this comment

Uh oh!

SebastianSchmidl Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisholder Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

SebastianSchmidl Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

chrisholder Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

chrisholder commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General ASV settings / rules:

Uh oh!

chrisholder commented Aug 14, 2025

What should we benchmark?

How should we benchmark

Uh oh!

SebastianSchmidl commented Aug 15, 2025

What should we benchmark?

Uh oh!

Uh oh!

chrisholder commented Aug 12, 2025 •

edited

Loading

Thank you for contributing to `aeon`

chrisholder commented Aug 12, 2025 •

edited

Loading

chrisholder commented Aug 13, 2025 •

edited

Loading

SebastianSchmidl Aug 14, 2025 •

edited

Loading

chrisholder commented Aug 14, 2025 •

edited

Loading