Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/TST: fix and test for timezone drop in GroupBy.shift/bfill/ffill #27992

Merged
merged 7 commits into from
Sep 9, 2019
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where applying a timezone conversion lambda function would drop timezone information (:issue:`27496`)
- Bug in windowing over read-only arrays (:issue:`27766`)
- Fixed segfault in `pandas.core.groupby.DataFrameGroupBy.quantile` when an invalid quantile was passed (:issue:`27470`)
- Bug in :meth:`pandas.core.groupby.GroupBy.shift`, :meth:`pandas.core.groupby.GroupBy.bfill` and :meth:`pandas.core.groupby.GroupBy.ffill` where timezone information would be dropped (:issue:`xxxxx`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is not an existing issue for this bug, use the number of this PR.

-

Reshaping
Expand Down
12 changes: 7 additions & 5 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2220,26 +2220,28 @@ def _get_cythonized_result(
base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
values = obj._data._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure relying on underlying block values is the best way to go about this. Is it possible to just work with the obj here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my lack of understanding about underlying block values.
Should I leave the code using obj.values except for the last result = algorithms.take_nd(obj.values, result)?


if aggregate:
result_sz = ngroups
else:
result_sz = len(obj.values)
result_sz = len(values)

if not cython_dtype:
cython_dtype = obj.values.dtype
cython_dtype = values.dtype

result = np.zeros(result_sz, dtype=cython_dtype)
func = partial(base_func, result, labels)
inferences = None

if needs_values:
vals = obj.values
vals = values
if pre_processing:
vals, inferences = pre_processing(vals)
func = partial(func, vals)

if needs_mask:
mask = isna(obj.values).view(np.uint8)
mask = isna(values).view(np.uint8)
func = partial(func, mask)

if needs_ngroups:
Expand All @@ -2248,7 +2250,7 @@ def _get_cythonized_result(
func(**kwargs) # Call func to modify indexer values in place

if result_is_index:
result = algorithms.take_nd(obj.values, result)
result = algorithms.take_nd(values, result)

if post_processing:
result = post_processing(result, inferences)
Expand Down
112 changes: 112 additions & 0 deletions pandas/tests/groupby/test_timezone.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import pytest

from pandas import DataFrame, Timestamp
from pandas.util.testing import assert_frame_equal


@pytest.mark.parametrize(
"data, expected_shift, expected_bfill, expected_ffill",
[
(
{
"id": ["A", "B", "A", "B", "A", "B"],
"time": [
Timestamp("2019-01-01 12:00:00"),
Timestamp("2019-01-01 12:30:00"),
None,
None,
Timestamp("2019-01-01 14:00:00"),
Timestamp("2019-01-01 14:30:00"),
],
},
{
"time": [
None,
None,
Timestamp("2019-01-01 12:00:00"),
Timestamp("2019-01-01 12:30:00"),
None,
None,
]
},
{
"time": [
Timestamp("2019-01-01 12:00:00"),
Timestamp("2019-01-01 12:30:00"),
Timestamp("2019-01-01 14:00:00"),
Timestamp("2019-01-01 14:30:00"),
Timestamp("2019-01-01 14:00:00"),
Timestamp("2019-01-01 14:30:00"),
]
},
{
"time": [
Timestamp("2019-01-01 12:00:00"),
Timestamp("2019-01-01 12:30:00"),
Timestamp("2019-01-01 12:00:00"),
Timestamp("2019-01-01 12:30:00"),
Timestamp("2019-01-01 14:00:00"),
Timestamp("2019-01-01 14:30:00"),
]
},
),
(
{
"id": ["A", "B", "A", "B", "A", "B"],
"time": [
Timestamp("2019-01-01 12:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:30:00", tz="Asia/Tokyo"),
None,
None,
Timestamp("2019-01-01 14:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:30:00", tz="Asia/Tokyo"),
],
},
{
"time": [
None,
None,
Timestamp("2019-01-01 12:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:30:00", tz="Asia/Tokyo"),
None,
None,
]
},
{
"time": [
Timestamp("2019-01-01 12:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:30:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:30:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:30:00", tz="Asia/Tokyo"),
]
},
{
"time": [
Timestamp("2019-01-01 12:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:30:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 12:30:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:00:00", tz="Asia/Tokyo"),
Timestamp("2019-01-01 14:30:00", tz="Asia/Tokyo"),
]
},
),
],
)
def test_shift_bfill_ffill_tz(data, expected_shift, expected_bfill, expected_ffill):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this test in test_groupby.py

Also you can parameterize over shift, bfill, and ffill

# GHxxxxx: Check that timezone does not drop in shift, bfill, and ffill
df = DataFrame(data)

result = df.groupby("id").shift()
expected = DataFrame(expected_shift)
assert_frame_equal(result, expected)

result = df.groupby("id").bfill()
expected = DataFrame(expected_bfill)
assert_frame_equal(result, expected)

result = df.groupby("id").ffill()
expected = DataFrame(expected_ffill)
assert_frame_equal(result, expected)