Skip to content

Feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) #2192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

camriddell
Copy link
Member

@camriddell camriddell commented Mar 11, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Adds support for the nw.Enum datatype for pandas (backed by pandas.CategoricalDtype(…, ordered=True)

The current implementation diverges from pandas/Polars in two broad ways

  1. We do not check for None, NaN, or Null (both pandas and Polars raise when they construct a CategoricalDtype/Enum with these in the categories list.
  2. pandas allows arbitrary (hashable) objects to be stored as the categories, whereas Polars only allows integers. The current implementation is type-hinted to follow suit with pandas, but we do not perform this check instead letting the backend library raise as needed.
>>> import narwhals as nw
>>> import pandas as pd
>>> s = nw.new_series('foo', ['a', 'b', 'c'], dtype=nw.Enum(['a', 'b', 'c', 'd']), native_namespace=pd)
>>> s
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|                Narwhals Series                |
|-----------------------------------------------|
|0    a                                         |
|1    b                                         |
|2    c                                         |
|Name: foo, dtype: category                     |
|Categories (4, object): ['a' < 'b' < 'c' < 'd']|
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

- add conversion from native to pandas
- add conversion from native to Polars
@camriddell camriddell changed the title Feat: Feat: nw.Enum support for pandas Mar 11, 2025
@camriddell camriddell requested a review from MarcoGorelli March 12, 2025 16:03
@camriddell camriddell requested a review from FBruzzesi March 13, 2025 18:16
@MarcoGorelli
Copy link
Member

thanks! it's encouraging that this doesn't break downstream tests

sorry i didn't get round to it for tomorrow's release, will try to get it in for next week's one πŸ‘

except ImportError as exc: # pragma: no cover
msg = f"Unable to convert to {dtype} to to the following exception: {exc.msg}"
raise ImportError(msg) from exc
return pd.CategoricalDtype(categories=dtype.categories, ordered=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we can do something pandas-specific here, as this is used by cudf and modin too - could we generalise?

Copy link
Member

@dangotbanned dangotbanned Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this.
pandas is already a module-level import?

import pandas as pd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned you're right- I pulled this code from a pretty old branch I had so that must have just been leftover. I'll delete it.

@MarcoGorelli I'll look into generalizing cudf and modin

Comment on lines 359 to 362
if dtype == "category":
if native_dtype.ordered:
return dtypes.Enum(categories=native_dtype.categories)
return dtypes.Categorical()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a breaking change, so i'm not totally sure about it - could we preserve the current behaviour in v1 and only make this change in the main namespace? the version variable is available in this function, you can use that

dangotbanned added a commit that referenced this pull request Mar 29, 2025
I noticed a new one in (#2192) and thought I'd get them all in one sweep
MarcoGorelli pushed a commit that referenced this pull request Mar 29, 2025
* chore(typing): Resolve `_polars.utils` dtype ignores

I noticed a new one in (#2192) and thought I'd get them all in one sweep

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: "coverage"

Just replacing the original `getattr`, there was already no coverage for that

https://github.com/narwhals-dev/narwhals/actions/runs/14145863466/job/39633072966?pr=2312

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Apr 4, 2025
@MarcoGorelli
Copy link
Member

thanks Cam - looks like there's a xpass

FAILED tests/series_only/cast_test.py::test_cast_to_enum_v1[modin[pyarrow]]

@@ -72,6 +72,36 @@ def __hash__(self: Self) -> int:
return hash(self.__class__)


class Enum(NwEnum):
"""A fixed categorical encoding of a unique set of strings.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned the typing here gets a bit wonky as we currently need v1._dtypes... implementations to inherit from what is defined in nw.dtypes. However nw.dtypes.Enum had its call signature changed which should not be propagated down to v1._dtypes.Enum so I implemented this functionality to skip a level of inheritance on its defined methods.

If feels like I may have something backwards here though? Would love to hear you thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @camriddell, will take a look in the morning

Copy link
Member

@dangotbanned dangotbanned Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a tricky one, but I have a few ideas I'm gonna try out today.

I did a search of existing usage and looked at what we allow in tests.

I think our main concern should be preserving the behavior of isinstance(..., nw.Enum).
The cases with dtype == nw.Enum are simple to handle without subclassing.

I haven't tried out customizing-instance-and-subclass-checks yet - but have thought about it for another DType issue (#2050 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wanting to avoid subclassing, since this is a pretty clear Liskov substitution principle violation (not your fault, just how v1 inheriting from main works)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's ok to allow Enum to accept categories in v1 as well, so long as == nw.Enum keeps working - can you check what we do for Datetime and Duration? I think something similar might work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's ok to allow Enum to accept categories in v1 as well, so long as == nw.Enum keeps working - can you check what we do for Datetime and Duration? I think something similar might work?

Lol @MarcoGorelli the timing on this πŸ˜… (105e394)

Let this be a warning, check your files after you use a linter.
@dangotbanned dangotbanned self-requested a review April 9, 2025 11:30
Comment on lines 452 to 464
def __init__(self, categories: Iterable[Hashable] | type[enum.Enum]) -> None:
# TODO(Unassigned): pandas errors on NaN, NA, NaT OR duplicated value category
# Polars errors on Null, NaN OR duplicated OR any non-string category
# should the intersection of the above be caught at the narwhals layer?
if isinstance(categories, type) and issubclass(categories, enum.Enum):
categories = (getattr(v, "value", v) for v in categories.__members__.values())
self.categories = [*categories]

def __eq__(self: Self, other: object) -> bool:
# allow comparing object instances to class
if type(other) is type:
return other is Enum
return isinstance(other, type(self)) and self.categories == other.categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camriddell you're using a list to describe:

A fixed categorical encoding of a unique set of strings.

pl.Enum, for reference, uses a pl.Series and makes quite a lot of checks to ensure they can keep that promise

https://github.com/pola-rs/polars/blob/py-1.26.0/py-polars/polars/datatypes/classes.py#L645-L727

This all type checks fine, and passes through at runtime - despite being the wrong types and non-unique:

import narwhals as nw

very_invalid = nw.Enum(
    ["beluga", "narwhal", "orca", (), nw.Enum([]), "narwhal", ((),), "narwhal", "orca"]
)
>>> very_invalid
Enum(categories=['beluga', 'narwhal', 'orca', (), Enum(categories=[]), 'narwhal', ((),), 'narwhal', 'orca'])

Since a list was used, I can break the "fixed" promise with:

very_invalid.categories.insert(0, [9, 2, 3, 5])
>>> very_invalid
Enum(categories=[[9, 2, 3, 5], 'beluga', 'narwhal', 'orca', (), Enum(categories=[]), 'narwhal', ((),), 'narwhal', 'orca'])

Or go further with:

very_invalid.categories.sort(key=lambda x: not isinstance(x,str))
>>> very_invalid
Enum(categories=['beluga', 'narwhal', 'orca', 'narwhal', 'narwhal', 'orca', [9, 2, 3, 5], (), Enum(categories=[]), ((),)])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to restrict inputs to a nw.Series - since we'd need access to a backend to make the same checks.

Possibly raise on Enum.__setattr__ as well πŸ€”

Copy link
Member Author

@camriddell camriddell Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed on performing uniqueness checks during instantiation like Polars.
  2. I'm agnostic on enforcing that inputs should be only strings, apologies for ignoring my own docstring on that one. While this is core to Polars behavior, pandas differs here by allowing any hashable object.
    • Originally, my thought was to let the backends figure out what they can/can't handle when we convert to them during runtime. But this would mean that code that works for one backend simply wouldn't work for another, so perhaps taking the strictest set of requirements is favorable?

I'm still a fan of using basic data structures here for agnostic data modeling as it is less complex than having a dtype that needs a backend for creation. For immutability, we can coerce to a tuple instead of a list at instantiation so that the above manipulations become disallowed.

As an aside, Polars doesn't make guarantees about immutability. I don't know if we should go out of our way to force users to not manipulate data in this manner either.

>>> e = pl.Enum(['a', 'b', 'c'])
>>> e.categories
shape: (3,)
Series: 'category' [str]
[
        "a"
        "b"
        "c"
]
>>> e.categories[0] = 'd'
>>> e
Enum(categories=['d', 'b', 'c'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on the mutability bit!

If we have a choice of only list or tuple, then I think choosing the later at least hints the intent πŸ™‚

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push an update factoring this in and we can see if it is satisfactory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we can really use nw.Series as lazy backends don't support it

not sure i see the issue with tuple?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we can really use nw.Series as lazy backends don't support it

@MarcoGorelli For the test I'm talking about testing this, which all works and passes type checking:

import polars as pl

import narwhals as nw

categories = pl.Series(["a", "b", "c"])
enum = pl.Enum(categories)
>>> enum
Enum(categories=['a', 'b', 'c'])

categories_nw = nw.from_native(categories, series_only=True)
enum_nw = nw.Enum(categories_nw)
>>> enum_nw
Enum(categories=('a', 'b', 'c'))

enum_nw_direct = nw.Enum(pl.Series(["a", "b", "c"]))
>>> enum_nw_direct
Enum(categories=('a', 'b', 'c'))

We could align the __repr__ with polars as well - but not too important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy backends don't support it

That part isn't an issue here, since we're just relying on an object with __iter__ defined when we do

https://github.com/camriddell/narwhals/blob/3357c551f23d261d5b6f04ea06db72540bef7af2/narwhals/dtypes.py#L460

            self.categories = tuple(categories)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I thought you were suggesting to store categories in a nw.Series, instead perhaps you're suggesting that nw.Series should work as an input to categories=? if so, i agree, and I think that that should indeed work

Copy link
Member

@dangotbanned dangotbanned Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah apologies for the confusion, I did suggest that before! πŸ˜…

Just wanting to be extra cautious, since it currently works but isn't documented or included in tests - so could easily be lost in a refactor later

Update

Little test added in (84ea789)

Important

There's still other stuff in the thread that isn't resolved

"""

def __init__(self, categories: Iterable[Hashable] | type[enum.Enum]) -> None:
# TODO(Unassigned): pandas errors on NaN, NA, NaT OR duplicated value category
# Polars errors on Null, NaN OR duplicated OR any non-string category
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli @dangotbanned any thoughts on whether we should perform a NaN value check at instantiation? Or should we let each backend handle their cases as to these kinds of checks.

Note that @dangotbanned and I are contemplating the duplicated/non-string categories in another thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest leaving all of these to the backends

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, this looks very close

@MarcoGorelli MarcoGorelli changed the title Feat: nw.Enum support for pandas Feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) Apr 13, 2025
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @camriddell , and @dangotbanned for reviews!

i think this is mostly ready?

@dangotbanned
Copy link
Member

dangotbanned commented Apr 13, 2025

thanks @camriddell , and @dangotbanned for reviews!

i think this is mostly ready?

Thanks @MarcoGorelli

Yeah I think after (#2192 (comment)) - my only other unresolved note was adding some guards like in https://github.com/pola-rs/polars/blob/py-1.26.0/py-polars/polars/datatypes/classes.py#L645-L727

@camriddell seemed on board with the suggestion - we've just been a bit more active πŸ˜‰

My concern there is if we leave it to raise when it reaches each backend - we won't have consistent errors and the traceback might not be super clear where the origin was

@camriddell
Copy link
Member Author

camriddell commented Apr 14, 2025

thanks @camriddell , and @dangotbanned for reviews!
i think this is mostly ready?

Thanks @MarcoGorelli

Yeah I think after (#2192 (comment)) - my only other unresolved note was adding some guards like in https://github.com/pola-rs/polars/blob/py-1.26.0/py-polars/polars/datatypes/classes.py#L645-L727

@camriddell seemed on board with the suggestion - we've just been a bit more active πŸ˜‰

My concern there is if we leave it to raise when it reaches each backend - we won't have consistent errors and the traceback might not be super clear where the origin was

I think I'm in favor of taking the strictest route here, and that is to mimic Polars again:

  1. Only allow containers of strings as valid categories
  2. Explicitly disallow None and float('nan')

Adding onto this discussion:

If we only let backends raise, we will hit an issue where some code only work with specific backends which reduces the purpose of Narwhals. With the current Enum targeting the pandas_like and Polars backends, I see this primarily happening in the space where writing code with a pandas backend in mind will break if a user passes in a Polars DataFrame because the Enum(…) had non-string categories.

The current checks we want to enforce are:

  • no duplicates.
  • no NaN or Null values

If we enforce categories as Iterable[Any] then those checks

  • duplicates should be straightforward
  • NaN can likely be relied on by verifying value != value as this follows IEEE-754 impl. so it should be fairly uniform across backends (I believe we also use this semantic in other places internally).
  • Null may be problematic since we would need to verify None, pandas.NA, pyarrow.NullScalar, and who knows how many other Null-value representation future backends may have.

If we limit users to Iterable[str], then for the checks themselves:

  • duplicates is very doable (as it was before).
  • Null/NaN values can be explicitly checked against float('nan') and None.
  • All disallowed values can be found via not isinstance(value, str). Barring some overrides against isinstance this should be fairly robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dtypes enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants