Skip to content

Add Index.validate_dataarray_coord #10137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Mar 17, 2025

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Same goal than #10116 using the alternative approach suggested in #10116 (comment) where the propagation (validation) of coordinates in a DataArray is delegated to their index (if any).

I find this approach cleaner than #10116. Here is the same notebook example adapted for this PR.

Index API alternatives

  1. Index.validate_dataarray_coord(self, name: Hashable, var: Variable, dims: set[Hashable]) -> None

    • check one index coordinate at a time, thus called multiple times for a multi-coordinate index
  2. Index.validate_dataarray_coords(self, variables: dict[Hashable, Variable], dims: set[Hashable]) -> None

    • check all coordinates of an index at once
    • perhaps could make things easier for developers of custom Xarray indexes (perhaps not?), but it would make Xarray internals more complicated with possible significant impact on performance (we'd have to rely on Indexes.group_by_index while in option 1 we simply need to iterate over _variables, _coords and/or _indexes).
  3. Index.validate_dataarray_coords(self, variables: dict[Hashable, Variable], dims: set[Hashable]) -> Coordinates

    • Same than option 2 but allow returning a new set of coordinate variables and their indexes.
    • This has much more flexibility (e.g., convert a custom 2D spatial index to a 1D default PandasIndex when the result has only one of the two spatial dimensions), but I find it also complicated and prone to confusion.

Option 1, the simplest and working well with IntervalIndex, is currently implemented in this PR.

cc @shoyer @dcherian

@benbovy benbovy marked this pull request as ready for review March 17, 2025 12:19
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick thoughts

benbovy added 4 commits March 18, 2025 12:09
Functions with a leading underscore are marked by pyright as unused if
they are not used from within the module in which they are defined.

Also remove unneeded nested import.
Move check_dataarray_coords in xarray.core.coordinates module and rename
it to validate_dataarray_coords (name consistent with
Index.validate_dataarray_coord).

Move CoordinateValidationError from xarray.core.indexes to
xarray.core.coordinates module.
@benbovy
Copy link
Member Author

benbovy commented Mar 18, 2025

Thanks @shoyer for taking a look!

This is now ready for (another round of) review.

Compared to #10116 this PR still represents a major data model change for DataArray (in the sense that it allows overriding strict enforcement of the DataArray model in certain cases), but I think that the risk is mitigated since it is here done explicitly via Index.validate_dataarray_coord and the default behavior is to keep conforming to the DataArray model. Limiting an index to (in)validate its coordinates may also be good to keep things simple and predictable for end-users.

@dcherian regarding your #10116 (comment), in the example notebook the DataArray reprs shown without any list of dimensions for the Coordinates section do not strike me much. That said, the Coordinates repr is still basic and might benefit from such addition (it doesn't have any html repr BTW).

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@dcherian
Copy link
Contributor

dcherian commented Mar 29, 2025

That said, the Coordinates repr is still basic and might benefit from such addition (it doesn't have any html repr BTW).

Yeah I think it could be improved, but certainly not a blocker here.

IMO we can merge. I'm not a huge fan of the method name validate_dataarray_coord but I couldn't come up with anything better

benbovy and others added 2 commits March 31, 2025 09:14
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
@benbovy benbovy added the plan to merge Final call for comments label Mar 31, 2025
@benbovy
Copy link
Member Author

benbovy commented Mar 31, 2025

Thanks @dcherian and @shoyer. I added "plan to merge" but I'm open to revisit anything here.

@dcherian dcherian mentioned this pull request Apr 4, 2025
13 tasks
@shoyer
Copy link
Member

shoyer commented Apr 23, 2025

Thanks @benbovy for this change, and apologies for the slow review.

My one big question is if validate_dataarray_coord could be replaced by a method that returns a boolean rather than potentially raising an error. It's not clear to me what the advantages of raising an error are.

The fall-back logic of checking coordinate dimensions could also potentially be implemented on the base Index class, which would allow for customizing it if desired.

Instead of adding the ``Index.validate_dataarray_coord`` method that may
raise an error, add ``Index.should_add_coord_in_datarray`` method that
returns a boolean.
@benbovy
Copy link
Member Author

benbovy commented Apr 24, 2025

@shoyer I like your suggestion and find it cleaner.

In 3e55af0 I replaced the validate_dataarray_coord with should_add_coord_in_dataarray (couldn't find a better name, suggestions welcome!).

I also removed the fall-back in Dataset._construct_dataarray. The logic of checking coordinate dimensions is implemented in the Index base class (in Index.should_add_coord_in_dataarray).

Is that what you had in mind?

benbovy added 3 commits April 24, 2025 11:53
Also do not add a coordinate in a DataArray indexed from a Dataset if
``index.should_add_coord_in_datarray`` returns False, even if the
coordinate variable dimensions are compatible with the DataArray
dimensions (note: there's no test for this case yet, I don't know if it
represents any real use case).
@dcherian
Copy link
Contributor

Yeah this method's a hard one to name. The only thing that comes to mind is should_propagate_variable_with_dataarray :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants