-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: Improve documentation for DataFrame.__setitem__ and .loc assignment from Series #61804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@WillAyd Any thought on this resolution? |
Thanks @niruta25 for the PR
I see that "align" is found 16 times when searching "Intro to data structures" section of the docs. This chapter is only preceded by "10 minutes to pandas" so i'm not sure that the linked issue which states "The current documentation is incomplete and vague about how Series alignment works in assignments." is correct that this fundamental paradigm of pandas is not covered in the documentation. I'm not a member of the documentation team so others may be more positive to these changes, but if I was to review this PR, I would prefer to see more discussion on the issue itself before proceeding to the PR stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK; from a quick glance I didn't see anything in the preceding text that was as clear
doc/source/user_guide/indexing.rst
Outdated
#If you want positional assignment instead of index alignment: | ||
# Convert Series to array/list for positional assignment | ||
|
||
df['positional'] = s1.values # or s1.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .values
is typically discouraged, so I think should remove this
doc/source/user_guide/indexing.rst
Outdated
df['positional'] = s1.values # or s1.tolist() | ||
|
||
# Or reset the Series index to match DataFrame index | ||
df['reset_index'] = s1.reindex(df.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but I think naming the column reset_index
here is distracting; maybe just call s1_values
?
by index labels, not by position. This means: | ||
|
||
* Values from the Series are matched to DataFrame rows by index label | ||
* If a Series index label doesn't exist in the DataFrame index, it's ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the difference between this and the line directly following it with the distinction of ignored versus NaN - can you help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Series index labels NOT in DataFrame → IGNORED
df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])
s = pd.Series([10, 20, 30, 40, 50], index=['x', 'y', 'a', 'b', 'z'])
df['B'] = s
df
A B
x 1 10
y 2 20
z 3 50
# Values for 'a' and 'b' are completely ignored!
- DataFrame index labels NOT in Series → NaN
df = pd.DataFrame({'A': [1, 2, 3, 4]}, index=['x', 'y', 'z', 'w'])
s = pd.Series([100, 300], index=['x', 'z']) # Missing 'y' and 'w'
df['B'] = s
df
A B
x 1 100.0
y 2 NaN
z 3 300.0
w 4 NaN
# Rows 'y' and 'w' get NaN because they're missing from Series
Added the ignored example to documentation.
pandas/core/indexing.py
Outdated
**Assignment with Series** | ||
|
||
When assigning a Series to .loc[row_indexer, col_indexer], pandas aligns | ||
the Series by index labels, not by order or position. This is consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the line This is consistent with pandas' general alignment behavior.
pandas/tests/frame/test_api.py
Outdated
@@ -378,3 +378,22 @@ def test_inspect_getmembers(self): | |||
# GH38740 | |||
df = DataFrame() | |||
inspect.getmembers(df) | |||
|
|||
def test_setitem_series_alignment_documentation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely appreciate you adding tests, but since this is a documentation change you shouldn't need to add anything here. Feel free to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! Will generate another PR to add these tests in. Would be good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#61822 for follow up for adding tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there an original discussion on tests being required? From a glance at them I would think our existing test base already covers those use cases, but maybe I am overlooking something
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The core problem is that when assigning a Series, pandas aligns on index and values in the Series that don't match an index label will result in NaN DOC: Improve documentation for DataFrame.setitem and .loc assignment from Series · Issue #61662 · pandas-dev/pandas, but this behavior is poorly documented.
My proposed solution addresses the issue comprehensively by:
The fix emphasizes that pandas performs index-based alignment rather than positional assignment, which is the source of confusion for many users. The documentation will now clearly explain that when you assign a Series to a DataFrame column, pandas matches values by index labels, not by position, and missing labels result in NaN values.
This solution follows pandas' documentation conventions and provides both reference documentation and practical examples that will help users understand and correctly use this important feature.