-
Notifications
You must be signed in to change notification settings - Fork 136
Feat: let nw.Enum
accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)
#2192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- add conversion from native to pandas - add conversion from native to Polars
thanks! it's encouraging that this doesn't break downstream tests sorry i didn't get round to it for tomorrow's release, will try to get it in for next week's one π |
narwhals/_pandas_like/utils.py
Outdated
except ImportError as exc: # pragma: no cover | ||
msg = f"Unable to convert to {dtype} to to the following exception: {exc.msg}" | ||
raise ImportError(msg) from exc | ||
return pd.CategoricalDtype(categories=dtype.categories, ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we can do something pandas-specific here, as this is used by cudf and modin too - could we generalise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this.
pandas
is already a module-level import?
narwhals/narwhals/_pandas_like/utils.py
Line 13 in 5550ad8
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dangotbanned you're right- I pulled this code from a pretty old branch I had so that must have just been leftover. I'll delete it.
@MarcoGorelli I'll look into generalizing cudf
and modin
narwhals/_pandas_like/utils.py
Outdated
if dtype == "category": | ||
if native_dtype.ordered: | ||
return dtypes.Enum(categories=native_dtype.categories) | ||
return dtypes.Categorical() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a breaking change, so i'm not totally sure about it - could we preserve the current behaviour in v1
and only make this change in the main namespace? the version
variable is available in this function, you can use that
I noticed a new one in (#2192) and thought I'd get them all in one sweep
* chore(typing): Resolve `_polars.utils` dtype ignores I noticed a new one in (#2192) and thought I'd get them all in one sweep * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: "coverage" Just replacing the original `getattr`, there was already no coverage for that https://github.com/narwhals-dev/narwhals/actions/runs/14145863466/job/39633072966?pr=2312 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This will be fixed in a future modin release modin-project/modin#7487
thanks Cam - looks like there's a xpass
|
@@ -72,6 +72,36 @@ def __hash__(self: Self) -> int: | |||
return hash(self.__class__) | |||
|
|||
|
|||
class Enum(NwEnum): | |||
"""A fixed categorical encoding of a unique set of strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dangotbanned the typing here gets a bit wonky as we currently need v1._dtypes...
implementations to inherit from what is defined in nw.dtypes
. However nw.dtypes.Enum
had its call signature changed which should not be propagated down to v1._dtypes.Enum
so I implemented this functionality to skip a level of inheritance on its defined methods.
If feels like I may have something backwards here though? Would love to hear you thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping @camriddell, will take a look in the morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a tricky one, but I have a few ideas I'm gonna try out today.
I did a search of existing usage and looked at what we allow in tests.
I think our main concern should be preserving the behavior of isinstance(..., nw.Enum)
.
The cases with dtype == nw.Enum
are simple to handle without subclassing.
I haven't tried out customizing-instance-and-subclass-checks yet - but have thought about it for another DType
issue (#2050 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wanting to avoid subclassing, since this is a pretty clear Liskov substitution principle violation (not your fault, just how v1
inheriting from main
works)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's ok to allow Enum
to accept categories
in v1 as well, so long as == nw.Enum
keeps working - can you check what we do for Datetime
and Duration
? I think something similar might work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's ok to allow
Enum
to acceptcategories
in v1 as well, so long as== nw.Enum
keeps working - can you check what we do forDatetime
andDuration
? I think something similar might work?
Lol @MarcoGorelli the timing on this π (105e394)
Let this be a warning, check your files after you use a linter.
narwhals/dtypes.py
Outdated
def __init__(self, categories: Iterable[Hashable] | type[enum.Enum]) -> None: | ||
# TODO(Unassigned): pandas errors on NaN, NA, NaT OR duplicated value category | ||
# Polars errors on Null, NaN OR duplicated OR any non-string category | ||
# should the intersection of the above be caught at the narwhals layer? | ||
if isinstance(categories, type) and issubclass(categories, enum.Enum): | ||
categories = (getattr(v, "value", v) for v in categories.__members__.values()) | ||
self.categories = [*categories] | ||
|
||
def __eq__(self: Self, other: object) -> bool: | ||
# allow comparing object instances to class | ||
if type(other) is type: | ||
return other is Enum | ||
return isinstance(other, type(self)) and self.categories == other.categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camriddell you're using a list
to describe:
A fixed categorical encoding of a unique set of strings.
pl.Enum
, for reference, uses a pl.Series
and makes quite a lot of checks to ensure they can keep that promise
https://github.com/pola-rs/polars/blob/py-1.26.0/py-polars/polars/datatypes/classes.py#L645-L727
This all type checks fine, and passes through at runtime - despite being the wrong types and non-unique:
import narwhals as nw
very_invalid = nw.Enum(
["beluga", "narwhal", "orca", (), nw.Enum([]), "narwhal", ((),), "narwhal", "orca"]
)
>>> very_invalid
Enum(categories=['beluga', 'narwhal', 'orca', (), Enum(categories=[]), 'narwhal', ((),), 'narwhal', 'orca'])
Since a list
was used, I can break the "fixed" promise with:
very_invalid.categories.insert(0, [9, 2, 3, 5])
>>> very_invalid
Enum(categories=[[9, 2, 3, 5], 'beluga', 'narwhal', 'orca', (), Enum(categories=[]), 'narwhal', ((),), 'narwhal', 'orca'])
Or go further with:
very_invalid.categories.sort(key=lambda x: not isinstance(x,str))
>>> very_invalid
Enum(categories=['beluga', 'narwhal', 'orca', 'narwhal', 'narwhal', 'orca', [9, 2, 3, 5], (), Enum(categories=[]), ((),)])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to restrict inputs to a nw.Series
- since we'd need access to a backend to make the same checks.
Possibly raise on Enum.__setattr__
as well π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agreed on performing uniqueness checks during instantiation like Polars.
- I'm agnostic on enforcing that inputs should be only strings, apologies for ignoring my own docstring on that one. While this is core to Polars behavior, pandas differs here by allowing any hashable object.
- Originally, my thought was to let the backends figure out what they can/can't handle when we convert to them during runtime. But this would mean that code that works for one backend simply wouldn't work for another, so perhaps taking the strictest set of requirements is favorable?
I'm still a fan of using basic data structures here for agnostic data modeling as it is less complex than having a dtype that needs a backend for creation. For immutability, we can coerce to a tuple instead of a list at instantiation so that the above manipulations become disallowed.
As an aside, Polars doesn't make guarantees about immutability. I don't know if we should go out of our way to force users to not manipulate data in this manner either.
>>> e = pl.Enum(['a', 'b', 'c'])
>>> e.categories
shape: (3,)
Series: 'category' [str]
[
"a"
"b"
"c"
]
>>> e.categories[0] = 'd'
>>> e
Enum(categories=['d', 'b', 'c'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up on the mutability bit!
If we have a choice of only list
or tuple
, then I think choosing the later at least hints the intent π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push an update factoring this in and we can see if it is satisfactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we can really use nw.Series
as lazy backends don't support it
not sure i see the issue with tuple
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we can really use
nw.Series
as lazy backends don't support it
@MarcoGorelli For the test I'm talking about testing this, which all works and passes type checking:
import polars as pl
import narwhals as nw
categories = pl.Series(["a", "b", "c"])
enum = pl.Enum(categories)
>>> enum
Enum(categories=['a', 'b', 'c'])
categories_nw = nw.from_native(categories, series_only=True)
enum_nw = nw.Enum(categories_nw)
>>> enum_nw
Enum(categories=('a', 'b', 'c'))
enum_nw_direct = nw.Enum(pl.Series(["a", "b", "c"]))
>>> enum_nw_direct
Enum(categories=('a', 'b', 'c'))
We could align the __repr__
with polars
as well - but not too important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy backends don't support it
That part isn't an issue here, since we're just relying on an object with __iter__
defined when we do
self.categories = tuple(categories)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I thought you were suggesting to store categories
in a nw.Series
, instead perhaps you're suggesting that nw.Series
should work as an input to categories=
? if so, i agree, and I think that that should indeed work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah apologies for the confusion, I did suggest that before! π
Just wanting to be extra cautious, since it currently works but isn't documented or included in tests - so could easily be lost in a refactor later
Update
Little test added in (84ea789)
Important
There's still other stuff in the thread that isn't resolved
""" | ||
|
||
def __init__(self, categories: Iterable[Hashable] | type[enum.Enum]) -> None: | ||
# TODO(Unassigned): pandas errors on NaN, NA, NaT OR duplicated value category | ||
# Polars errors on Null, NaN OR duplicated OR any non-string category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli @dangotbanned any thoughts on whether we should perform a NaN value check at instantiation? Or should we let each backend handle their cases as to these kinds of checks.
Note that @dangotbanned and I are contemplating the duplicated/non-string categories in another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suggest leaving all of these to the backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, this looks very close
nw.Enum
accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli Yeah I think after (#2192 (comment)) - my only other unresolved note was adding some guards like in https://github.com/pola-rs/polars/blob/py-1.26.0/py-polars/polars/datatypes/classes.py#L645-L727 @camriddell seemed on board with the suggestion - we've just been a bit more active π
My concern there is if we leave it to raise when it reaches each backend - we won't have consistent errors and the traceback might not be super clear where the origin was |
I think I'm in favor of taking the strictest route here, and that is to mimic Polars again:
Adding onto this discussion: If we only let backends raise, we will hit an issue where some code only work with specific backends which reduces the purpose of Narwhals. With the current Enum targeting the The current checks we want to enforce are:
If we enforce categories as Iterable[Any] then those checks
If we limit users to Iterable[str], then for the checks themselves:
|
What type of PR is this? (check all applicable)
Related issues
Enum
take arguments, allow it in constructionΒ #1541Checklist
If you have comments or can explain your changes, please do so below
Adds support for the
nw.Enum
datatype for pandas (backed bypandas.CategoricalDtype(β¦, ordered=True)
The current implementation diverges from pandas/Polars in two broad ways