Skip to content

ENH: Introduce pandas.col #62103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 13, 2025

xref @jbrockmendel 's comment #56499 (comment)

I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)

Demo:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(
    {
        "a": [1, -2, 3],
        "b": [4, 5, 6],
        "c": [datetime(2020, 1, 1), datetime(2025, 4, 2), datetime(2026, 12, 3)],
        "d": ["fox", "beluga", "narwhal"],
    }
)

result = df.assign(
    # The usual Series methods are supported
    a_abs=pd.col("a").abs(),
    # And can be combined
    a_centered=pd.col("a") - pd.col("a").mean(),
    a_plus_b=pd.col("a") + pd.col("b"),
    # Namespace are supported too
    c_year=pd.col("c").dt.year,
    c_month_name=pd.col("c").dt.strftime("%B"),
    d_upper=pd.col("d").str.upper(),
).loc[pd.col("a_abs") > 1]  # This works in `loc` too

print(result)

Output:

   a  b          c        d  a_abs  a_centered  a_plus_b  c_year c_month_name  d_upper
1 -2  5 2025-04-02   beluga      2   -2.666667         3    2025        April   BELUGA
2  3  6 2026-12-03  narwhal      3    2.333333         9    2026     December  NARWHAL

Repr demo:

In [4]: pd.col('value')
Out[4]: col('value')

In [5]: pd.col('value') * pd.col('weight')
Out[5]: (col('value') * col('weight'))

In [6]: (pd.col('value') - pd.col('value').mean()) / pd.col('value').std()
Out[6]: ((col('value') - col('value').mean()) / col('value').std())

In [7]: pd.col('timestamp').dt.strftime('%B')
Out[7]: col('timestamp').dt.strftime('%B')

What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

As for the "col" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the convention


I'm opening as a request for comments. Would people want this API to be part of pandas?

One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use assign to increment two columns' values by 10 and try to write df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')}) then you'll be in for a big surprise

In [19]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [20]: df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
Out[20]:
    a   b
0  14  14
1  15  15
2  16  16

whereas with pd.col, you get what you were probably expecting:

In [4]: df.assign(**{col: pd.col(col) + 10 for col in ('a', 'b')})
Out[4]: 
    a   b
0  11  14
1  12  15
2  13  16

Further advantages:

  • expressions are introspectable so the repr can be made to look nice, whereas an anonymous lambda is always going to look something like <function __main__.<lambda>(df)
  • the syntax looks more modern and more aligned with modern tools

Expected objections:

  • this expands the pandas API even further. Sure, I don't disagree, but I think this is a common enough and longstanding enough request that it's worth expanding it for this

TODO:

  • tests, API docs, user guide. But first, I just wanted to get a feel for people's thoughts, and to see if anyone's opposed to it

Potential follow-ups (if there's interest):

  • serialise / deserialise expressions

@MarcoGorelli MarcoGorelli changed the title ENH: Introduce pandas.col RFC: Introduce pandas.col Aug 13, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

When this is added, and then released, pandas-stubs can be updated with proper stubs.

One comment is that I'm not sure it will support some basic arithmetic, such as:

result = df.assign(addcon=pd.col("a") + 10)

Or alignment with other series:

b = df["b"]  # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b)

Also, don't you need to add some tests??

@MarcoGorelli
Copy link
Member Author

Thanks for taking a look!

One comment is that I'm not sure it will support some basic arithmetic [...] Or alignment with other series:

Yup, they're both supported:

In [8]: df = pd.DataFrame({'a': [1,2,3]})

In [9]: s = pd.Series([90,100,110], index=[2,1,0])

In [10]: df.assign(
    ...:     b=pd.col('a')+10,
    ...:     c=pd.col('a')+s,
    ...: )
Out[10]: 
   a   b    c
0  1  11  111
1  2  12  102
2  3  13   93

Also, don't you need to add some tests??

😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using df.assign(foo=lambda df: df["a"] + df["b"]) would still work, but df.assign(foo=pd.col("a") + pd.col("b")) is cleaner.

@jbrockmendel
Copy link
Member

Is assign the main use case?

@MarcoGorelli
Copy link
Member Author

Currently it would only work in places that accept DataFrame -> Series callables which, as far as I know, is only DataFrame.assign and filtering with DataFrame.loc

Getting it to work in GroupBy.agg is more complex, but it is possible, albeit with some restrictions

@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 14, 2025 10:09
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 15, 2025

I haven't seen any objections, so I'll work on adding docs + user guide + tests

If anyone intends to block this then I'd appreciate it if you could speak out as soon as possible (also going to cc @mroeschke here in case you were against this)

@mroeschke
Copy link
Member

I would be OK adding this API.

if not isinstance(col_name, Hashable):
msg = f"Expected Hashable, got: {type(col_name)}"
raise TypeError(msg)
return Expr(lambda df: df[col_name], f"col({col_name!r})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the traceback like when col_name doesn't exist in df? e.g.

df.drop("a", axis=1).assign(a_other=pd.col("a") + 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one, thanks, I've added a proper error message

In [2]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [3]: df.assign(c=pd.col('name').mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 df.assign(c=pd.col('name').mean())

File ~/pandas-dev/pandas/core/frame.py:5328, in DataFrame.assign(self, **kwargs)
   5325 data = self.copy(deep=False)
   5327 for k, v in kwargs.items():
-> 5328     data[k] = com.apply_if_callable(v, data)
   5329 return data

File ~/pandas-dev/pandas/core/common.py:384, in apply_if_callable(maybe_callable, obj, **kwargs)
    373 """
    374 Evaluate possibly callable input using obj and kwargs if it is callable,
    375 otherwise return as it is.
   (...)    381 **kwargs
    382 """
    383 if callable(maybe_callable):
--> 384     return maybe_callable(obj, **kwargs)
    386 return maybe_callable

File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df)
     66 def __call__(self, df: DataFrame) -> Any:
---> 67     return self._func(df)

File ~/pandas-dev/pandas/core/col.py:166, in Expr.__getattr__.<locals>.wrapper.<locals>.<lambda>(df)
    163 args_str = ", ".join(all_args)
    164 repr_str = f"{self._repr_str}.{attr}({args_str})"
--> 166 return Expr(lambda df: func(df, *args, **kwargs), repr_str)

File ~/pandas-dev/pandas/core/col.py:145, in Expr.__getattr__.<locals>.func(df, *args, **kwargs)
    143 parsed_args = _parse_args(df, *args)
    144 parsed_kwargs = _parse_kwargs(df, **kwargs)
--> 145 return getattr(self(df), attr)(*parsed_args, **parsed_kwargs)

File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df)
     66 def __call__(self, df: DataFrame) -> Any:
---> 67     return self._func(df)

File ~/pandas-dev/pandas/core/col.py:290, in col.<locals>.func(df)
    285 if col_name not in df.columns:
    286     msg = (
    287         f"Column '{col_name}' not found in given DataFrame.\n\n"
    288         f"Hint: did you mean one of {df.columns.tolist()} instead?"
    289     )
--> 290     raise ValueError(msg)
    291 return df[col_name]

ValueError: Column 'name' not found in given DataFrame.

Hint: did you mean one of ['a', 'b'] instead?

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions - and I caught one typo in the docs

Comment on lines 53 to 58
class Expr:
"""
Class representing a deferred column.

This is not meant to be instantiated directly. Instead, use :meth:`pandas.col`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this a private class, i.e., class _Expr ?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

I thought about that, although the return type of col is Expr, and so if anyone wanted to check if an object is an expression or annotate a parameter to a function, then they'd need this in the public API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think it should go in pandas.api.typing . That's what we have done with things like DataFrameGroupBy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks - done ✅

@MarcoGorelli MarcoGorelli changed the title RFC: Introduce pandas.col ENH: Introduce pandas.col Aug 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants