Skip to content

interp - Prefer broadcast over reindex when possible #10554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Jul 21, 2025

When a variable is a scalar it is faster to broadcast instead of using reindex. Use that when doing dataset interpolation.

import numpy as np
import dask.array as da

import xarray as xr

ds = xr.Dataset(
    data_vars={
        "variable_name": (
            "time",
            da.from_array(np.array(["test"], dtype=str), chunks=(1,)),
        )
    },
    coords={"time": ("time", np.array([0]))},
)

%timeit ds.interp(time=np.linspace(0, 10, 50))
%timeit ds.interp(time=np.linspace(0, 10, 100))
%timeit ds.interp(time=np.linspace(0, 10, 1000))

Main:

16.8 ms ± 620 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.5 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
214 ms ± 7.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR:

5.3 ms ± 138 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.29 ms ± 87 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.29 ms ± 188 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Illviljan Illviljan changed the title Prefer broadcast over reindex when possible interp - Prefer broadcast over reindex when possible Jul 21, 2025
@Illviljan Illviljan marked this pull request as ready for review July 21, 2025 13:33
to_broadcast = (var.copy().squeeze(),) + tuple(
dest for index, dest in use_indexers.values()
)
variables[name] = broadcast_variables(*to_broadcast)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the semantics from copies to views. We'll have to manually deepcopy these vars to avoid confusing downstream errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadcast_variables(*to_broadcast)[0].copy(deep=True) should do the trick I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would work. This could be a good opportunity to look for optimizations in reindex if you have the bandwidth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using copy(deep=True) now. I couldn't see a noticeable difference with the example above.

Last time I followed the reindex path Dask was the bottleneck. Though I'm not very familiar with those functions.
I recall reindex was about the same speed as interpolation a few years ago.

@Illviljan Illviljan added the plan to merge Final call for comments label Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reindex is very slow with small chunksizes
2 participants