-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
(feat): Support for pandas
ExtensionArray
#8723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 63 commits
b2712f1
47bddd2
dc8b788
75524c8
c9ab452
1f3d0fa
8a70e3c
f5a6505
08a4feb
d5b218b
00256fa
b7ddbd6
a165851
a826edd
fde19ea
4c55707
58ba17d
a255310
4e78b7e
d9cedf5
426664d
22ca77d
f32cfdf
60f8927
ff22d76
2153e81
b6d0b31
d285871
d847277
8238c64
1260cd4
b04ef98
b9937bf
0bba03f
b714549
a3a678c
e521844
2d3e930
04c9969
5514539
bedfa5c
e6c2690
82dbda9
12217ed
dd5b87d
761a874
52cabc8
e0d58fa
c1e0e64
17e3390
dd2ef39
c8e6bfe
b2a9517
f5e1bd0
407fad1
3a47f09
fdd3de4
6b23629
1c9047f
9be6b03
d9304f1
6ec6725
bc9ac4c
1e906db
6fb8668
8f034b4
90a6de6
2bd422a
ff67943
661d9f2
caee1c6
1d12f5e
31dfbb5
23b347f
902c74b
0b64506
0c7e023
dd7fe98
f0df768
e2f0487
1eb6741
2a7300a
9cceadc
f2588c1
a0a63bd
5bb2bde
f85f166
7ecdeba
6bc40fc
e9dc53f
4791799
c649362
fc60dcf
0374086
b9515a6
72bf807
63b6c42
1d18439
17f05da
c906c81
e6db83b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -129,6 +129,7 @@ module = [ | |
"opt_einsum.*", | ||
"pandas.*", | ||
"pooch.*", | ||
"pyarrow.*", | ||
"pydap.*", | ||
"pytest.*", | ||
"scipy.*", | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
from typing import IO, TYPE_CHECKING, Any, Callable, Generic, Literal, cast, overload | ||
|
||
import numpy as np | ||
from pandas.api.types import is_extension_array_dtype | ||
|
||
# remove once numpy 2.0 is the oldest supported version | ||
try: | ||
|
@@ -6835,10 +6836,13 @@ def reduce( | |
if ( | ||
# Some reduction functions (e.g. std, var) need to run on variables | ||
# that don't have the reduce dims: PR5393 | ||
not reduce_dims | ||
or not numeric_only | ||
or np.issubdtype(var.dtype, np.number) | ||
or (var.dtype == np.bool_) | ||
( | ||
not reduce_dims | ||
or not numeric_only | ||
or np.issubdtype(var.dtype, np.number) | ||
or (var.dtype == np.bool_) | ||
) | ||
and not is_extension_array_dtype(var.dtype) | ||
): | ||
# prefer to aggregate over axis=None rather than | ||
# axis=(0, 1) if they will be equivalent, because | ||
|
@@ -7151,13 +7155,37 @@ def to_pandas(self) -> pd.Series | pd.DataFrame: | |
) | ||
|
||
def _to_dataframe(self, ordered_dims: Mapping[Any, int]): | ||
columns = [k for k in self.variables if k not in self.dims] | ||
columns_in_order = [k for k in self.variables if k not in self.dims] | ||
non_extension_array_columns = [ | ||
k | ||
for k in columns_in_order | ||
if not is_extension_array_dtype(self.variables[k].data) | ||
] | ||
extension_array_columns = [ | ||
k | ||
for k in columns_in_order | ||
if is_extension_array_dtype(self.variables[k].data) | ||
] | ||
data = [ | ||
self._variables[k].set_dims(ordered_dims).values.reshape(-1) | ||
for k in columns | ||
for k in non_extension_array_columns | ||
] | ||
index = self.coords.to_index([*ordered_dims]) | ||
return pd.DataFrame(dict(zip(columns, data)), index=index) | ||
broadcasted_df = pd.DataFrame( | ||
dict(zip(non_extension_array_columns, data)), index=index | ||
) | ||
for extension_array_column in extension_array_columns: | ||
extension_array = self.variables[extension_array_column].data.array | ||
index = self[self.variables[extension_array_column].dims[0]].data | ||
extension_array_df = pd.DataFrame( | ||
{extension_array_column: extension_array}, | ||
index=self[self.variables[extension_array_column].dims[0]].data, | ||
) | ||
extension_array_df.index.name = self.variables[extension_array_column].dims[ | ||
0 | ||
] | ||
broadcasted_df = broadcasted_df.join(extension_array_df) | ||
Comment on lines
+7194
to
+7204
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pandas-dev/pandas#57676 Not sure what to do. I don't think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also not sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be good to sort this out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shoyer Could you maybe give some details on using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's open an issue to remind ourselves to make this more efficient. I guess the core problem is that extension arrays cannot be broadcast to nD with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #8950 done! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is true.
I think this currently handles the case where this is >1 so why error out? I think |
||
return broadcasted_df[columns_in_order] | ||
|
||
def to_dataframe(self, dim_order: Sequence[Hashable] | None = None) -> pd.DataFrame: | ||
"""Convert this dataset into a pandas.DataFrame. | ||
|
@@ -7303,11 +7331,14 @@ def from_dataframe(cls, dataframe: pd.DataFrame, sparse: bool = False) -> Self: | |
"cannot convert a DataFrame with a non-unique MultiIndex into xarray" | ||
) | ||
|
||
# Cast to a NumPy array first, in case the Series is a pandas Extension | ||
# array (which doesn't have a valid NumPy dtype) | ||
# TODO: allow users to control how this casting happens, e.g., by | ||
# forwarding arguments to pandas.Series.to_numpy? | ||
arrays = [(k, np.asarray(v)) for k, v in dataframe.items()] | ||
arrays = [ | ||
ilan-gold marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(k, np.asarray(v)) | ||
for k, v in dataframe.items() | ||
if not is_extension_array_dtype(v) | ||
] | ||
extension_arrays = [ | ||
(k, v) for k, v in dataframe.items() if is_extension_array_dtype(v) | ||
] | ||
|
||
indexes: dict[Hashable, Index] = {} | ||
index_vars: dict[Hashable, Variable] = {} | ||
|
@@ -7321,6 +7352,8 @@ def from_dataframe(cls, dataframe: pd.DataFrame, sparse: bool = False) -> Self: | |
xr_idx = PandasIndex(lev, dim) | ||
indexes[dim] = xr_idx | ||
index_vars.update(xr_idx.create_variables()) | ||
arrays += [(k, np.asarray(v)) for k, v in extension_arrays] | ||
extension_arrays = [] | ||
else: | ||
index_name = idx.name if idx.name is not None else "index" | ||
dims = (index_name,) | ||
|
@@ -7334,7 +7367,9 @@ def from_dataframe(cls, dataframe: pd.DataFrame, sparse: bool = False) -> Self: | |
obj._set_sparse_data_from_dataframe(idx, arrays, dims) | ||
else: | ||
obj._set_numpy_data_from_dataframe(idx, arrays, dims) | ||
return obj | ||
for name, extension_array in extension_arrays: | ||
obj[name] = (dims, extension_array) | ||
return obj[dataframe.columns] if len(dataframe.columns) else obj | ||
|
||
def to_dask_dataframe( | ||
self, dim_order: Sequence[Hashable] | None = None, set_index: bool = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there's an ExtensionArray data var in a dataset and you call
ds.mean()
? Are we silently dropping or raising an error?I think we should raise a nice error and ask the user to drop it themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but this is different from the behavior with a numpy object array, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'd want to be careful about allowing Extension Arrays with ints, floats, datetimes etc. looking at the list here: https://pandas.pydata.org/docs/reference/api/pandas.array.html but this can be a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For now I will implement the drop and not error then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian implemented, but is there a test case for dropping non-numerics I can add this to? I couldn't find one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray/xarray/tests/test_dataset.py
Line 5459 in cf36559