Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise NotImplementedError for groupby.agg if duplicate columns would be created #17956

Merged
merged 3 commits into from
Feb 10, 2025

Conversation

mroeschke
Copy link
Contributor

Description

xref #17649

For cudf.pandas, we will dispatch to pandas instead of silently dropping the duplicate column

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 7, 2025
@mroeschke mroeschke self-assigned this Feb 7, 2025
@mroeschke mroeschke requested a review from a team as a code owner February 7, 2025 23:17
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, deferring to you on to add a warning in the non-pandas_compatible case as well.

Do we need to add similar logic for duplicate columns in a dataframe itself? i.e. to prevent something like

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df.rename({'b': 'a'}, axis=1)

or any other way that duplicate names could manifest? That could be done in another PR of course.

for values in aggs.values()
):
# In non pandas_compatible mode, we would just drop the duplicate aggregation.
# Should we issue a UserWarning?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking if we should start issuing a warning in non-pandas_compatible mode, i.e. in an else clause? I would support that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, exactly. Thanks added a UserWarning in non-pandas_compatible mode.

@mroeschke
Copy link
Contributor Author

mroeschke commented Feb 8, 2025

any other way that duplicate names could manifest?

Yeah we have some checks like this in ColumnAccessor already that would check if duplicate columns would be created (e.g. your example raises in cudf).

The harder-to-catch cases are when we're preprocessing column-related operations with mappings (dicts) first which won't alert us if duplicates are introduced.

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 1643e0a into rapidsai:branch-25.04 Feb 10, 2025
108 checks passed
@mroeschke mroeschke deleted the bug/gb_agg/dup_column branch February 10, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants