Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

mroeschke · 2025-02-07T23:17:18Z

Description

xref #17649

For cudf.pandas, we will dispatch to pandas instead of silently dropping the duplicate column

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vyasr

I'm approving, deferring to you on to add a warning in the non-pandas_compatible case as well.

Do we need to add similar logic for duplicate columns in a dataframe itself? i.e. to prevent something like

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df.rename({'b': 'a'}, axis=1)

or any other way that duplicate names could manifest? That could be done in another PR of course.

vyasr · 2025-02-08T01:37:07Z

python/cudf/cudf/core/groupby/groupby.py

+                    for values in aggs.values()
+                ):
+                    # In non pandas_compatible mode, we would just drop the duplicate aggregation.
+                    # Should we issue a UserWarning?


Are you asking if we should start issuing a warning in non-pandas_compatible mode, i.e. in an else clause? I would support that.

Yup, exactly. Thanks added a UserWarning in non-pandas_compatible mode.

…dup_column

mroeschke · 2025-02-08T19:48:03Z

any other way that duplicate names could manifest?

Yeah we have some checks like this in ColumnAccessor already that would check if duplicate columns would be created (e.g. your example raises in cudf).

The harder-to-catch cases are when we're preprocessing column-related operations with mappings (dicts) first which won't alert us if duplicates are introduced.

mroeschke · 2025-02-10T19:24:27Z

/merge

Raise NotImplementedError for if duplicate columns would be created

b905486

mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 7, 2025

mroeschke self-assigned this Feb 7, 2025

mroeschke requested a review from a team as a code owner February 7, 2025 23:17

mroeschke requested review from vyasr and brandon-b-miller February 7, 2025 23:17

vyasr approved these changes Feb 8, 2025

View reviewed changes

mroeschke added 2 commits February 8, 2025 11:32

Merge remote-tracking branch 'upstream/branch-25.04' into bug/gb_agg/…

4ebbf04

…dup_column

Add userwarning for dropping duplicate columns

4cc9b48

rapids-bot bot merged commit 1643e0a into rapidsai:branch-25.04 Feb 10, 2025
108 checks passed

mroeschke deleted the bug/gb_agg/dup_column branch February 10, 2025 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

mroeschke commented Feb 7, 2025

vyasr left a comment

vyasr Feb 8, 2025

mroeschke Feb 8, 2025

mroeschke commented Feb 8, 2025 •

edited

Loading

mroeschke commented Feb 10, 2025

Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

Conversation

mroeschke commented Feb 7, 2025

Description

Checklist

vyasr left a comment

Choose a reason for hiding this comment

vyasr Feb 8, 2025

Choose a reason for hiding this comment

mroeschke Feb 8, 2025

Choose a reason for hiding this comment

mroeschke commented Feb 8, 2025 • edited Loading

mroeschke commented Feb 10, 2025

mroeschke commented Feb 8, 2025 •

edited

Loading