-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
ENH: Introduce pandas.col
#62103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Introduce pandas.col
#62103
Conversation
When this is added, and then released, One comment is that I'm not sure it will support some basic arithmetic, such as: result = df.assign(addcon=pd.col("a") + 10) Or alignment with other series: b = df["b"] # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b) Also, don't you need to add some tests?? |
Thanks for taking a look!
Yup, they're both supported: In [8]: df = pd.DataFrame({'a': [1,2,3]})
In [9]: s = pd.Series([90,100,110], index=[2,1,0])
In [10]: df.assign(
...: b=pd.col('a')+10,
...: c=pd.col('a')+s,
...: )
Out[10]:
a b c
0 1 11 111
1 2 12 102
2 3 13 93
😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change |
I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using |
Is assign the main use case? |
Currently it would only work in places that accept Getting it to work in |
628a3b0
to
b41b99d
Compare
I haven't seen any objections, so I'll work on adding docs + user guide + tests If anyone intends to block this then I'd appreciate it if you could speak out as soon as possible (also going to cc @mroeschke here in case you were against this) |
I would be OK adding this API. |
pandas/core/col.py
Outdated
if not isinstance(col_name, Hashable): | ||
msg = f"Expected Hashable, got: {type(col_name)}" | ||
raise TypeError(msg) | ||
return Expr(lambda df: df[col_name], f"col({col_name!r})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the traceback like when col_name
doesn't exist in df? e.g.
df.drop("a", axis=1).assign(a_other=pd.col("a") + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good one, thanks, I've added a proper error message
In [2]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
In [3]: df.assign(c=pd.col('name').mean())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 1
----> 1 df.assign(c=pd.col('name').mean())
File ~/pandas-dev/pandas/core/frame.py:5328, in DataFrame.assign(self, **kwargs)
5325 data = self.copy(deep=False)
5327 for k, v in kwargs.items():
-> 5328 data[k] = com.apply_if_callable(v, data)
5329 return data
File ~/pandas-dev/pandas/core/common.py:384, in apply_if_callable(maybe_callable, obj, **kwargs)
373 """
374 Evaluate possibly callable input using obj and kwargs if it is callable,
375 otherwise return as it is.
(...) 381 **kwargs
382 """
383 if callable(maybe_callable):
--> 384 return maybe_callable(obj, **kwargs)
386 return maybe_callable
File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df)
66 def __call__(self, df: DataFrame) -> Any:
---> 67 return self._func(df)
File ~/pandas-dev/pandas/core/col.py:166, in Expr.__getattr__.<locals>.wrapper.<locals>.<lambda>(df)
163 args_str = ", ".join(all_args)
164 repr_str = f"{self._repr_str}.{attr}({args_str})"
--> 166 return Expr(lambda df: func(df, *args, **kwargs), repr_str)
File ~/pandas-dev/pandas/core/col.py:145, in Expr.__getattr__.<locals>.func(df, *args, **kwargs)
143 parsed_args = _parse_args(df, *args)
144 parsed_kwargs = _parse_kwargs(df, **kwargs)
--> 145 return getattr(self(df), attr)(*parsed_args, **parsed_kwargs)
File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df)
66 def __call__(self, df: DataFrame) -> Any:
---> 67 return self._func(df)
File ~/pandas-dev/pandas/core/col.py:290, in col.<locals>.func(df)
285 if col_name not in df.columns:
286 msg = (
287 f"Column '{col_name}' not found in given DataFrame.\n\n"
288 f"Hint: did you mean one of {df.columns.tolist()} instead?"
289 )
--> 290 raise ValueError(msg)
291 return df[col_name]
ValueError: Column 'name' not found in given DataFrame.
Hint: did you mean one of ['a', 'b'] instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions - and I caught one typo in the docs
pandas/core/col.py
Outdated
class Expr: | ||
""" | ||
Class representing a deferred column. | ||
|
||
This is not meant to be instantiated directly. Instead, use :meth:`pandas.col`. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a private class, i.e., class _Expr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review!
I thought about that, although the return type of col
is Expr
, and so if anyone wanted to check if an object is an expression or annotate a parameter to a function, then they'd need this in the public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think it should go in pandas.api.typing
. That's what we have done with things like DataFrameGroupBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks - done ✅
xref @jbrockmendel 's comment #56499 (comment)
I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)
Demo:
Output:
Repr demo:
What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in
pandas-stubs
. But, I think it should be possible to develop tooling to automate theExpr
docs and types based on theSeries
ones (going to cc @Dr-Irv here too then)As for the "
col
" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the conventionI'm opening as a request for comments. Would people want this API to be part of pandas?
One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use
assign
to increment two columns' values by 10 and try to writedf.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
then you'll be in for a big surprisewhereas with
pd.col
, you get what you were probably expecting:Further advantages:
<function __main__.<lambda>(df)
Expected objections:
TODO:
Potential follow-ups (if there's interest):