ENH: Introduce `pandas.col` #62103

MarcoGorelli · 2025-08-13T20:16:39Z

xref @jbrockmendel 's comment #56499 (comment)

I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)

Demo:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(
    {
        "a": [1, -2, 3],
        "b": [4, 5, 6],
        "c": [datetime(2020, 1, 1), datetime(2025, 4, 2), datetime(2026, 12, 3)],
        "d": ["fox", "beluga", "narwhal"],
    }
)

result = df.assign(
    # The usual Series methods are supported
    a_abs=pd.col("a").abs(),
    # And can be combined
    a_centered=pd.col("a") - pd.col("a").mean(),
    a_plus_b=pd.col("a") + pd.col("b"),
    # Namespace are supported too
    c_year=pd.col("c").dt.year,
    c_month_name=pd.col("c").dt.strftime("%B"),
    d_upper=pd.col("d").str.upper(),
).loc[pd.col("a_abs") > 1]  # This works in `loc` too

print(result)

Output:

   a  b          c        d  a_abs  a_centered  a_plus_b  c_year c_month_name  d_upper
1 -2  5 2025-04-02   beluga      2   -2.666667         3    2025        April   BELUGA
2  3  6 2026-12-03  narwhal      3    2.333333         9    2026     December  NARWHAL

Repr demo:

In [4]: pd.col('value')
Out[4]: col('value')

In [5]: pd.col('value') * pd.col('weight')
Out[5]: (col('value') * col('weight'))

In [6]: (pd.col('value') - pd.col('value').mean()) / pd.col('value').std()
Out[6]: ((col('value') - col('value').mean()) / col('value').std())

In [7]: pd.col('timestamp').dt.strftime('%B')
Out[7]: col('timestamp').dt.strftime('%B')

What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

As for the "col" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the convention

I'm opening as a request for comments. Would people want this API to be part of pandas?

One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use assign to increment two columns' values by 10 and try to write df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')}) then you'll be in for a big surprise

In [19]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [20]: df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
Out[20]:
    a   b
0  14  14
1  15  15
2  16  16

whereas with pd.col, you get what you were probably expecting:

In [4]: df.assign(**{col: pd.col(col) + 10 for col in ('a', 'b')})
Out[4]: 
    a   b
0  11  14
1  12  15
2  13  16

Further advantages:

expressions are introspectable so the repr can be made to look nice, whereas an anonymous lambda is always going to look something like <function __main__.<lambda>(df)
the syntax looks more modern and more aligned with modern tools

Expected objections:

this expands the pandas API even further. Sure, I don't disagree, but I think this is a common enough and longstanding enough request that it's worth expanding it for this

TODO:

tests, API docs, user guide. But first, I just wanted to get a feel for people's thoughts, and to see if anyone's opposed to it

Potential follow-ups (if there's interest):

serialise / deserialise expressions

Dr-Irv · 2025-08-13T20:57:19Z

For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

When this is added, and then released, pandas-stubs can be updated with proper stubs.

One comment is that I'm not sure it will support some basic arithmetic, such as:

result = df.assign(addcon=pd.col("a") + 10)

Or alignment with other series:

b = df["b"]  # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b)

Also, don't you need to add some tests??

MarcoGorelli · 2025-08-13T21:36:31Z

Thanks for taking a look!

One comment is that I'm not sure it will support some basic arithmetic [...] Or alignment with other series:

Yup, they're both supported:

In [8]: df = pd.DataFrame({'a': [1,2,3]})

In [9]: s = pd.Series([90,100,110], index=[2,1,0])

In [10]: df.assign(
    ...:     b=pd.col('a')+10,
    ...:     c=pd.col('a')+s,
    ...: )
Out[10]: 
   a   b    c
0  1  11  111
1  2  12  102
2  3  13   93

Also, don't you need to add some tests??

😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

Dr-Irv · 2025-08-13T21:42:34Z

Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using df.assign(foo=lambda df: df["a"] + df["b"]) would still work, but df.assign(foo=pd.col("a") + pd.col("b")) is cleaner.

pandas/core/col.py

jbrockmendel · 2025-08-14T04:06:06Z

Is assign the main use case?

MarcoGorelli · 2025-08-14T07:34:39Z

Currently it would only work in places that accept DataFrame -> Series callables which, as far as I know, is only DataFrame.assign and filtering with DataFrame.loc

Getting it to work in GroupBy.agg is more complex, but it is possible, albeit with some restrictions

MarcoGorelli · 2025-08-15T17:22:11Z

I haven't seen any objections, so I'll work on adding docs + user guide + tests

If anyone intends to block this then I'd appreciate it if you could speak out as soon as possible (also going to cc @mroeschke here in case you were against this)

mroeschke · 2025-08-15T17:44:24Z

I would be OK adding this API.

mroeschke · 2025-08-15T17:50:36Z

pandas/core/col.py

+    if not isinstance(col_name, Hashable):
+        msg = f"Expected Hashable, got: {type(col_name)}"
+        raise TypeError(msg)
+    return Expr(lambda df: df[col_name], f"col({col_name!r})")


What's the traceback like when col_name doesn't exist in df? e.g.

df.drop("a", axis=1).assign(a_other=pd.col("a") + 1)

good one, thanks, I've added a proper error message

In [2]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}) In [3]: df.assign(c=pd.col('name').mean()) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[3], line 1 ----> 1 df.assign(c=pd.col('name').mean()) File ~/pandas-dev/pandas/core/frame.py:5328, in DataFrame.assign(self, **kwargs) 5325 data = self.copy(deep=False) 5327 for k, v in kwargs.items(): -> 5328 data[k] = com.apply_if_callable(v, data) 5329 return data File ~/pandas-dev/pandas/core/common.py:384, in apply_if_callable(maybe_callable, obj, **kwargs) 373 """ 374 Evaluate possibly callable input using obj and kwargs if it is callable, 375 otherwise return as it is. (...) 381 **kwargs 382 """ 383 if callable(maybe_callable): --> 384 return maybe_callable(obj, **kwargs) 386 return maybe_callable File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df) 66 def __call__(self, df: DataFrame) -> Any: ---> 67 return self._func(df) File ~/pandas-dev/pandas/core/col.py:166, in Expr.__getattr__.<locals>.wrapper.<locals>.<lambda>(df) 163 args_str = ", ".join(all_args) 164 repr_str = f"{self._repr_str}.{attr}({args_str})" --> 166 return Expr(lambda df: func(df, *args, **kwargs), repr_str) File ~/pandas-dev/pandas/core/col.py:145, in Expr.__getattr__.<locals>.func(df, *args, **kwargs) 143 parsed_args = _parse_args(df, *args) 144 parsed_kwargs = _parse_kwargs(df, **kwargs) --> 145 return getattr(self(df), attr)(*parsed_args, **parsed_kwargs) File ~/pandas-dev/pandas/core/col.py:67, in Expr.__call__(self, df) 66 def __call__(self, df: DataFrame) -> Any: ---> 67 return self._func(df) File ~/pandas-dev/pandas/core/col.py:290, in col.<locals>.func(df) 285 if col_name not in df.columns: 286 msg = ( 287 f"Column '{col_name}' not found in given DataFrame.\n\n" 288 f"Hint: did you mean one of {df.columns.tolist()} instead?" 289 ) --> 290 raise ValueError(msg) 291 return df[col_name] ValueError: Column 'name' not found in given DataFrame. Hint: did you mean one of ['a', 'b'] instead?

Dr-Irv

Just a few questions - and I caught one typo in the docs

Dr-Irv · 2025-08-17T01:45:20Z

pandas/core/col.py

+class Expr:
+    """
+    Class representing a deferred column.
+
+    This is not meant to be instantiated directly. Instead, use :meth:`pandas.col`.
+    """


Maybe make this a private class, i.e., class _Expr ?

Thanks for your review!

I thought about that, although the return type of col is Expr, and so if anyone wanted to check if an object is an expression or annotate a parameter to a function, then they'd need this in the public API

Then I think it should go in pandas.api.typing . That's what we have done with things like DataFrameGroupBy

sure, thanks - done ✅

pandas/core/col.py

ENH: Introduce pandas.col

3d17e56

MarcoGorelli changed the title ~~ENH: Introduce pandas.col~~ RFC: Introduce pandas.col Aug 13, 2025

MarcoGorelli mentioned this pull request Aug 13, 2025

ENH: pandas mutate, add R's mutate functionality to enable users to easily create new columns in data frames #56499

Open

3 tasks

api test, typing

9fcaba3

dangotbanned reviewed Aug 13, 2025

View reviewed changes

pandas/core/col.py Outdated Show resolved Hide resolved

typing

b41b99d

MarcoGorelli force-pushed the pandas-col branch from 628a3b0 to b41b99d Compare August 14, 2025 09:47

MarcoGorelli marked this pull request as ready for review August 14, 2025 10:09

add pretty repr

60c09c2

mroeschke reviewed Aug 15, 2025

View reviewed changes

MarcoGorelli added 10 commits August 16, 2025 15:52

improve error message

9e4e0c5

test repr

fe78aa2

test namespaces

04044af

docs

a95aeb4

reference in dsintro

4dc8e55

Merge remote-tracking branch 'upstream/main' into pandas-col

13d8e5c

fixup link

e2aeb4f

fixup docs

fa3e793

fixup

0bc918a

add test file

a0939f9

Dr-Irv reviewed Aug 17, 2025

View reviewed changes

MarcoGorelli added 2 commits August 17, 2025 10:07

simplify, support custom series extensions too

a703982

test accessor

48228cc

MarcoGorelli added 2 commits August 17, 2025 10:11

📝 fix typo

d6f55a1

typing

b2ed136

MarcoGorelli changed the title ~~RFC: Introduce pandas.col~~ ENH: Introduce pandas.col Aug 17, 2025

MarcoGorelli added 3 commits August 17, 2025 19:34

move Expr to api.typing

c8f0193

move Expr to api/typing

e6ea343

rename Expr to Expression

96990d6

Uh oh!

ENH: Introduce pandas.col #62103

Are you sure you want to change the base?

ENH: Introduce pandas.col #62103

Conversation

MarcoGorelli commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv commented Aug 13, 2025

Uh oh!

MarcoGorelli commented Aug 13, 2025

Uh oh!

Dr-Irv commented Aug 13, 2025

Uh oh!

Uh oh!

jbrockmendel commented Aug 14, 2025

Uh oh!

MarcoGorelli commented Aug 14, 2025

Uh oh!

MarcoGorelli commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke commented Aug 15, 2025

Uh oh!

mroeschke Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Introduce `pandas.col` #62103

ENH: Introduce `pandas.col` #62103

MarcoGorelli commented Aug 13, 2025 •

edited

Loading

MarcoGorelli commented Aug 15, 2025 •

edited

Loading

MarcoGorelli Aug 17, 2025 •

edited

Loading