Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/TST: fix and test for timezone drop in GroupBy.shift/bfill/ffill #27992

Merged
merged 7 commits into from
Sep 9, 2019

Conversation

noritada
Copy link
Contributor

@noritada noritada commented Aug 18, 2019

closes #19995

Timezone info is dropped in GroupBy.shift, bfill, and ffill
since index calculated by Cythonized functions are applied to
NumPy representation of values without timezone.

Could you please find and review this fix?

Many thanks,

-noritada

In [1]: import pandas as pd 
   ...: pd.__version__ 
Out[1]: '0.24.2'

In [2]: df = pd.DataFrame([ 
   ...:     ['a', pd.Timestamp('2019-01-01 00:00:00+09:00')], 
   ...:     ['b', pd.Timestamp('2019-01-01 00:05:00+09:00')], 
   ...:     ['a', None], 
   ...:     ['b', None], 
   ...:     ['a', pd.Timestamp('2019-01-01 00:20:00+09:00')], 
   ...:     ['b', pd.Timestamp('2019-01-01 00:25:00+09:00')], 
   ...: ], columns=['id', 'time']) 

In [3]: df['time'].shift(1) 
Out[3]: 
0                         NaT
1   2019-01-01 00:00:00+09:00
2   2019-01-01 00:05:00+09:00
3                         NaT
4                         NaT
5   2019-01-01 00:20:00+09:00
Name: time, dtype: datetime64[ns, pytz.FixedOffset(540)]

In [4]: df.groupby('id')['time'].shift(1) 
Out[4]: 
0                   NaT
1                   NaT
2   2018-12-31 15:00:00
3   2018-12-31 15:05:00
4                   NaT
5                   NaT
Name: time, dtype: datetime64[ns]

In [5]: df.groupby('id')['time'].bfill() 
Out[5]: 
0   2018-12-31 15:00:00
1   2018-12-31 15:05:00
2   2018-12-31 15:20:00
3   2018-12-31 15:25:00
4   2018-12-31 15:20:00
5   2018-12-31 15:25:00
Name: time, dtype: datetime64[ns]

In [6]: df.groupby('id')['time'].ffill() 
Out[6]: 
0   2018-12-31 15:00:00
1   2018-12-31 15:05:00
2   2018-12-31 15:00:00
3   2018-12-31 15:05:00
4   2018-12-31 15:20:00
5   2018-12-31 15:25:00
Name: time, dtype: datetime64[ns]
  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@@ -120,6 +120,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where applying a timezone conversion lambda function would drop timezone information (:issue:`27496`)
- Bug in windowing over read-only arrays (:issue:`27766`)
- Fixed segfault in `pandas.core.groupby.DataFrameGroupBy.quantile` when an invalid quantile was passed (:issue:`27470`)
- Bug in :meth:`pandas.core.groupby.GroupBy.shift`, :meth:`pandas.core.groupby.GroupBy.bfill` and :meth:`pandas.core.groupby.GroupBy.ffill` where timezone information would be dropped (:issue:`xxxxx`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is not an existing issue for this bug, use the number of this PR.

),
],
)
def test_shift_bfill_ffill_tz(data, expected_shift, expected_bfill, expected_ffill):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this test in test_groupby.py

Also you can parameterize over shift, bfill, and ffill

@mroeschke mroeschke added Bug Groupby Timezones Timezone data dtype labels Aug 18, 2019
@noritada
Copy link
Contributor Author

@mroeschke
Thank you very much for your reviews.
I've updated as per your suggestions.
Could you please check again?

Many thanks,

-noritada

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. One small comment otherwise LGTM.

@@ -1882,3 +1882,69 @@ def test_groupby_axis_1(group_name):
results = df.groupby(group_name, axis=1).sum()
expected = df.T.groupby(group_name).sum().T
assert_frame_equal(results, expected)


@pytest.mark.parametrize("tz", [None, "Asia/Tokyo"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a tz_naive_fixture you can use in the function signature that will parameterize over more timezones.

def test_shift_bfill_ffill_tz(tz_naive_fixture, op, expected):
    tz = tz_naive_fixture
    ...

@noritada
Copy link
Contributor Author

@mroeschke
Thank you so much again.
I should have found this function.
Test code updated again.

Many thanks,

-noritada

@@ -2220,26 +2220,28 @@ def _get_cythonized_result(
base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
values = obj._data._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure relying on underlying block values is the best way to go about this. Is it possible to just work with the obj here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my lack of understanding about underlying block values.
Should I leave the code using obj.values except for the last result = algorithms.take_nd(obj.values, result)?

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Can you update whatsnew for 0.25.2 as well?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2019 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you do an issue search to see if we have anything about this (IIRC we do). if so, pls reference these issues.

@@ -120,6 +120,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where applying a timezone conversion lambda function would drop timezone information (:issue:`27496`)
- Bug in windowing over read-only arrays (:issue:`27766`)
- Fixed segfault in `pandas.core.groupby.DataFrameGroupBy.quantile` when an invalid quantile was passed (:issue:`27470`)
- Bug in :meth:`pandas.core.groupby.GroupBy.shift`, :meth:`pandas.core.groupby.GroupBy.bfill` and :meth:`pandas.core.groupby.GroupBy.ffill` where timezone information would be dropped (:issue:`27992`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.25.2

@jreback jreback added this to the 0.25.2 milestone Sep 8, 2019
@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

also pls merge master

@noritada
Copy link
Contributor Author

noritada commented Sep 9, 2019

@WillAyd @TomAugspurger @jreback
Thanks for comments and sorry to be absent for 3 weeks.
I've merged master, move a whatsnew entry to v0.25.2 and document #19995 .

Could you please check this again?

Many thanks,

-noritada

@jreback jreback merged commit e0c63b4 into pandas-dev:master Sep 9, 2019
@jreback
Copy link
Contributor

jreback commented Sep 9, 2019

thanks @noritada

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 9, 2019

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 0.25.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 e0c63b4cfaa821dfe310f4a8a1f84929ced5f5bd
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #27992: BUG/TST: fix and test for timezone drop in GroupBy.shift/bfill/ffill'
  1. Push to a named branch :
git push YOURFORK 0.25.x:auto-backport-of-pr-27992-on-0.25.x
  1. Create a PR against branch 0.25.x, I would have named this PR:

"Backport PR #27992 on branch 0.25.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@TomAugspurger
Copy link
Contributor

@jreback I think the plan was for 0.25.2 to just be Python 3.8 compat and 0.25 regressions: #28110. I don't believe this was a regression in 0.25.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2019

ok sure np

alright then prob just need to move the actual release note

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shifting a datetime column with timezone after groupby loses the timezone.
5 participants