Skip to content

[ENH] Repository benchmarking #3026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

[ENH] Repository benchmarking #3026

wants to merge 10 commits into from

Conversation

chrisholder
Copy link
Contributor

@chrisholder chrisholder commented Aug 12, 2025

Reference Issues/PRs

What does this implement/fix?

This PR adds performance benchmarking capabilities (runtime-focused) to the repository, enabling us to measure improvements or regressions over time and easily compare performance between branches. This allows PRs to be accompanied by clear, reproducible performance data and visualisations.

Benchmarking is implemented using airspeed-velocity (asv), the same tool used by SciPy (docs). Our configuration closely follows SciPy’s, with some of their utility methods adapted for aeon.

An example benchmark has been added for the distance module. Example output tables and graphs are included in the comments below.


New dependency

  • asv (developer-only dependency)

Usage example

Once merged into main, you can define benchmark classes to measure performance across shapes, datasets, or algorithms. For example, here’s the Euclidean distance benchmark which exists at benchmarks/benchmarks/distance.py:

class Euclidean(Benchmark):
    # params: (n_cases, n_channels, n_timepoints)
    params = [[
        (10, 1, 10), (100, 1, 100), (100, 1, 500),
        (10, 3, 10), (100, 3, 100), (100, 3, 500)
    ]]
    param_names = ["shape"]

    def setup(self, shape):
        self.a = make_example_3d_numpy(*shape, return_y=False, random_state=1)
        self.b = make_example_3d_numpy(*shape, return_y=False, random_state=2)

    def time_dist(self, shape):
        euclidean_distance(self.a[0], self.b[0])

    def time_pairwise_dist(self, shape):
        euclidean_pairwise_distance(self.a)

    def time_one_to_multiple_dist(self, shape):
        euclidean_pairwise_distance(self.a[0], self.b)

    def time_multiple_to_multiple_dist(self, shape):
        euclidean_pairwise_distance(self.a, self.b)

Running benchmarks

From the benchmarks directory:

# Run a specific benchmark class
asv run --bench "distances.Euclidean"

# Run a specific method
asv run --bench "distances.Euclidean.time_dist"

# Run all benchmarks
asv run --bench .

# Skip already-run commits
asv run --skip-existing-commits --bench "distances.Euclidean.time_dist"

Comparing branches

After running benchmarks on main

git checkout some-other-branch
asv run --bench "distances.Euclidean"
asv compare main some-other-branch

Example output:

All benchmarks:

| Change   | Before [45b9b86f] <aeon-benchmarking>   | After [2b253ef0] <some-other-branch>   |   Ratio | Benchmark (Parameter)                                             |
|----------|-----------------------------------------|-------------------------------|---------|-------------------------------------------------------------------|
| +        | 300±2ns                                 | 645±6ns                       |    2.15 | distances.Euclidean.time_indv_dist((10, 1, 10))                   |
| +        | 308±1ns                                 | 648±10ns                      |    2.1  | distances.Euclidean.time_indv_dist((10, 3, 10))                   |
| +        | 327±4ns                                 | 656±6ns                       |    2.01 | distances.Euclidean.time_indv_dist((100, 1, 100))                 |
| +        | 424±6ns                                 | 756±7ns                       |    1.79 | distances.Euclidean.time_indv_dist((100, 1, 500))                 |
| +        | 350±5ns                                 | 701±7ns                       |    2    | distances.Euclidean.time_indv_dist((100, 3, 100))                 |
| +        | 585±10ns                                | 941±4ns                       |    1.61 | distances.Euclidean.time_indv_dist((100, 3, 500))                 |
| +        | 194±20μs                                | 261±0.9μs                     |    1.35 | distances.Euclidean.time_multiple_to_multiple_dist((10, 1, 10))   |
| +        | 181±3μs                                 | 261±2μs                       |    1.44 | distances.Euclidean.time_multiple_to_multiple_dist((10, 3, 10))   |
| +        | 595±20μs                                | 21.6±0.3ms                    |   36.34 | distances.Euclidean.time_multiple_to_multiple_dist((100, 1, 100)) |
| +        | 1.34±0.2ms                              | 21.8±0.1ms                    |   16.29 | distances.Euclidean.time_multiple_to_multiple_dist((100, 1, 500)) |
| +        | 842±4μs                                 | 21.3±0.2ms                    |   25.32 | distances.Euclidean.time_multiple_to_multiple_dist((100, 3, 100)) |
| +        | 2.89±0.03ms                             | 24.2±0.1ms                    |    8.37 | distances.Euclidean.time_multiple_to_multiple_dist((100, 3, 500)) |
| -        | 168±2μs                                 | 62.9±0.3μs                    |    0.37 | distances.Euclidean.time_one_to_multiple_dist((10, 1, 10))        |
| -        | 175±4μs                                 | 64.0±0.5μs                    |    0.37 | distances.Euclidean.time_one_to_multiple_dist((10, 3, 10))        |
| +        | 234±1μs                                 | 328±4μs                       |    1.4  | distances.Euclidean.time_one_to_multiple_dist((100, 1, 100))      |
| +        | 243±3μs                                 | 331±7μs                       |    1.37 | distances.Euclidean.time_one_to_multiple_dist((100, 1, 500))      |
| +        | 238±5μs                                 | 334±10μs                      |    1.41 | distances.Euclidean.time_one_to_multiple_dist((100, 3, 100))      |
| +        | 268±2μs                                 | 356±2μs                       |    1.33 | distances.Euclidean.time_one_to_multiple_dist((100, 3, 500))      |
| -        | 161±6μs                                 | 133±0.4μs                     |    0.83 | distances.Euclidean.time_pairwise_dist((10, 1, 10))               |
| -        | 168±8μs                                 | 133±2μs                       |    0.79 | distances.Euclidean.time_pairwise_dist((10, 3, 10))               |
| +        | 345±3μs                                 | 11.3±0.5ms                    |   32.89 | distances.Euclidean.time_pairwise_dist((100, 1, 100))             |
| +        | 706±7μs                                 | 11.8±0.06ms                   |   16.79 | distances.Euclidean.time_pairwise_dist((100, 1, 500))             |
| +        | 492±10μs                                | 11.5±0.1ms                    |   23.27 | distances.Euclidean.time_pairwise_dist((100, 3, 100))             |
| +        | 1.51±0ms                                | 12.4±0.04ms                   |    8.2  | distances.Euclidean.time_pairwise_dist((100, 3, 500))             |

Visualisation

You can generate interactive performance graphs:

asv publish
asv preview

This launches a local web server with plots and historical trends.

Examples of some pages and graphs include:

image

and

image

For a more detailed example see the example html report asv links: https://pv.github.io/numpy-bench/

The web interface provides:

  • Performance trends over time
  • Detailed comparison graphs between branches
  • Parameter-specific performance analysis
  • Historical regression tracking

Benefits

  • Quantitative PR Reviews: Reviewers can objectively assess performance impacts
  • Regression Detection: Automatically identify performance degradations
  • Optimisation Validation: Verify that performance improvements work as intended
  • Historical Tracking: Monitor performance evolution across releases
  • Comprehensive Testing: Test performance across various input configurations

Future Work

  • Extend benchmark coverage to additional modules
  • Set up CI integration for automatic performance testing
  • Establish performance regression thresholds
  • Create developer documentation for writing benchmarks

@aeon-actions-bot aeon-actions-bot bot added the enhancement New feature, improvement request or other non-bug code enhancement label Aug 12, 2025
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ enhancement ].

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Push an empty commit to re-run CI checks

@chrisholder
Copy link
Contributor Author

I want to add a web page tutorial similar to: https://docs.scipy.org/doc/scipy/dev/contributor/benchmarking.html

@chrisholder
Copy link
Contributor Author

chrisholder commented Aug 12, 2025

This is a further example of how we could use this with our estimators.

Here is a benchmark for Kmeans:

class KMeansBenchmark(EstimatorBenchmark):
    ks = [2, 4, 8]
    inits = ["random", "kmeans++"]
    distances = ["euclidean", "dtw"]
    average_methods = ["mean", "ba"]

    # extend the base grid
    params = EstimatorBenchmark.params + [ks, inits, distances, average_methods]
    param_names = EstimatorBenchmark.param_names + ["k", "init", "distance", "average_method"]

    def _build_estimator(self, k, init, distance, average_method) -> BaseEstimator:
        return aeon_clust.TimeSeriesKMeans(
            n_clusters=k,
            init=init,
            distance=distance,
            averaging_method=average_method,
            n_init=1,
            random_state=1
        )

I have defined a base estimator benchmark class that just generates data and fit predict timing methods.

class EstimatorBenchmark(Benchmark, ABC):
    # Base grid (shared across all estimators)
    shapes = [
        (10, 1, 10),
        (100, 1, 100),
        (10, 3, 10),
        (100, 3, 100),
    ]

    # Subclasses will append their own grids to these:
    params = [shapes]
    param_names = ["shape"]

    def setup(self, shape, *est_params):
        # Data
        self.X_train = make_example_3d_numpy(*shape, return_y=False, random_state=1)
        self.X_test = make_example_3d_numpy(*shape, return_y=False, random_state=2)

        # Pre-fit once for predict timing
        self.prefit_estimator = self._build_estimator(*est_params)
        self.prefit_estimator.fit(self.X_train)

    def time_fit(self, shape, *est_params):
        est = self._build_estimator(*est_params)  # fresh each run
        est.fit(self.X_train)

    def time_predict(self, shape, *est_params):
        self.prefit_estimator.predict(self.X_test)

    @abstractmethod
    def _build_estimator(self, *est_params) -> BaseEstimator:
        """Return an unfitted estimator configured with the given params."""
        ...

For each time series of each shape a kmeans estimator is ran for 3 different values of k, two init, 2 distances, and 2 average methods.

For fit this produces a table that looks like:

[75.00%] ··· clustering.KMeansBenchmark.time_fit                                                                                                                                                                                                                                                                 ok
[75.00%] ··· ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================
             --                                                                                                    init / distance / average_method                                                                                   
             ----------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                 shape      k   random / euclidean / mean   random / euclidean / ba   random / dtw / mean   random / dtw / ba   kmeans++ / euclidean / mean   kmeans++ / euclidean / ba   kmeans++ / dtw / mean   kmeans++ / dtw / ba 
             ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================
              (10, 1, 10)   2            721±20μs                 1.44±0.01ms               926±10μs           2.01±0.08ms              1.09±0.02ms                   2.46±0.1ms               1.09±0.03ms            2.13±0.03ms     
              (10, 1, 10)   4            783±50μs                 1.37±0.02ms               791±10μs           1.94±0.02ms              1.26±0.01ms                   2.22±0.2ms               1.33±0.03ms            2.61±0.09ms     
              (10, 1, 10)   8            608±40μs                 1.34±0.08ms               883±40μs           1.36±0.04ms              2.01±0.05ms                   2.55±0.3ms               2.17±0.05ms            2.48±0.05ms     
              (10, 3, 10)   2            906±20μs                 2.55±0.04ms             1.01±0.08ms          2.00±0.03ms              1.26±0.02ms                  2.51±0.09ms                 945±20μs              1.84±0.2ms     
              (10, 3, 10)   4            745±20μs                 1.97±0.01ms               773±20μs           1.24±0.02ms              1.63±0.02ms                  2.89±0.04ms               1.52±0.02ms            2.58±0.04ms     
              (10, 3, 10)   8            555±7μs                    741±10μs                848±10μs           1.28±0.05ms              2.36±0.05ms                  2.60±0.02ms                2.32±0.1ms             2.73±0.1ms     
             ============= === =========================== ========================= ===================== =================== ============================= =========================== ======================= =====================

@TonyBagnall
Copy link
Contributor

I like the look of it, I dont suppose it also profiles? I would very much like to be able to profile algorithms

@chrisholder
Copy link
Contributor Author

chrisholder commented Aug 13, 2025

ASV has built-in support for profiling individual benchmarks. For example, to run a profile of the KMeansBenchmark.time_fit you would run:

asv profile clustering.KmeansBenchamark.time_fit aeon-benchmarking --gui snakeviz

This runs the benchmark under cProfile on the specified branch (aeon-benchmarking in this case) and opens the results directly in SnakeViz. SnakeViz is a new dev dependency I'll also add thats allows you to visualise the cProfile. Happy to use a different gui if anyone has any suggestions. We can also output .prof files for later inspection or load them programmatically via asv.results to explore in Python.

Profiling works for any benchmark in the suite, so we can do this for clustering, distances, or any other algorithm we have defined a benchmark for (see above for defining a benchmark).

Here is an example of a visual profile I have for Kmeans:

image

Copy link
Member

@SebastianSchmidl SebastianSchmidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this ❤️

ks = [2, 4, 8]
inits = ["random", "kmeans++"]
distances = tuple(DISTANCES_DICT.keys()) # all supported distances
distances = ["euclidean"]
Copy link
Member

@SebastianSchmidl SebastianSchmidl Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, you want to remove the distances overwrite before the merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had some thoughts since writing this benchmark about how we should do models that could have a lot of different parameters. I'll write a new comment below this outlining my thoughts after writing a few benchmarks.

Comment on lines +9 to +10
otherwise. Some of the benchmarking features in `spin` also tell ASV to use the aeon
compiled by `spin`. To run the benchmarks, you will need to install the "dev"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is spin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I need to remove this from the docs thats something scipy uses and I decided we probably don't need it.

@chrisholder
Copy link
Contributor Author

chrisholder commented Aug 14, 2025

Just going to put this here for future reference on how ASV runs benchmarks:

General ASV settings / rules:

  1. Benchmark duration
    Each individual benchmark case (one parameter combination) should complete in under 60 seconds to avoid hitting ASV’s default timeout.

  2. Repetitions and statistical uncertainty
    ASV runs each benchmark multiple times until the statistical uncertainty (coefficient of variation) is below a threshold, or until it reaches a maximum repeat limit.
    By default:

    • Each benchmark is run at least 5 times
    • Up to 20 times if the target uncertainty is not reached
  3. Result aggregation
    The final reported benchmark time is the median of all runs, not the mean — this reduces the impact of outliers on results.

  4. Warm-up phase
    ASV includes a built-in warm-up period before timing starts, so we generally don’t need to manually pre-cache or pre-run functions inside the benchmark code.

@chrisholder
Copy link
Contributor Author

I also wanted to define a small set of initial guidelines for how we should conduct benchmarking in Aeon.

What should we benchmark?

Any public-facing function where monitoring performance would be valuable.

Examples:

  • Estimators
  • Distance functions
  • Dataset loading functions

It may also be useful to benchmark functions that are used frequently internally.

Examples:

  • Conversion/Preprocessing time series logic
  • Numba utilities
  • Base class logic

How should we benchmark

When designing benchmarks, be mindful of runtime.
Only parameterise an estimator with variables that are independent from other benchmarks.

For example, for KMeans which accepts a distance parameter, it might seem reasonable to benchmark every possible distance. However, we already have dedicated benchmarks for distance functions, so repeating them here would be redundant. Our goal with the benchmark is to highlight performance that is specific to that benchmark.

We should also consider the input data shapes.
At a minimum, all benchmarks should be conducted over a range of:

  • Number of cases
  • Number of timepoints
  • Number of channels (if supported)

This range does not need to be huge. For example, for estimators and distances, I currently use the following shapes:

(10, 1, 10),
(10, 1, 1000),
(50, 1, 100),
(10, 3, 10),
(10, 3, 1000),
(50, 3, 100),

Due to how ASV runs benchmarks, even with relatively modest changes in the number of cases (e.g., up to a few thousand time series), this range is sufficient to detect performance changes across different dataset sizes.

I am currently only testing with NumPy format. Other formats are fine to include if you believe they could show different performance characteristics.

@SebastianSchmidl
Copy link
Member

Overall, I agree with what you said above. My five cents:

What should we benchmark?

Any public-facing function where monitoring performance would be valuable.

Examples:

  • Estimators
  • Distance functions
  • Dataset loading functions

▶️ For the dataset loading, we need to make sure to exclude all network calls. It does not make sense to benchmark the network connection. Most interesting would be the parser for .ts-files, IMO.

For example, for KMeans which accepts a distance parameter, it might seem reasonable to benchmark every possible distance. However, we already have dedicated benchmarks for distance functions, so repeating them here would be redundant. Our goal with the benchmark is to highlight performance that is specific to that benchmark.

▶️ Is it reasonable to define default dummy functions in the benchmarking module? If we want to benchmark the overhead of the estimator instead of its dependencies/components, it would make sense to have no-op-stubs for the components. E.g., for a distance d(x, y), just return x[0] or something similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature, improvement request or other non-bug code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants