Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt out of bottleneck for nanmean #47716

Merged
merged 8 commits into from
Jul 18, 2022

Conversation

sebasv
Copy link
Contributor

@sebasv sebasv commented Jul 14, 2022

@pep8speaks
Copy link

pep8speaks commented Jul 14, 2022

Hello @sebasv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-17 13:11:07 UTC

@sebasv
Copy link
Contributor Author

sebasv commented Jul 14, 2022

I added 2 unit tests to validate that numerical precision is bounded by log(n). If anyone has suggestions how to improve these tests let me know

@mroeschke mroeschke added Dependencies Required and optional dependencies Reduction Operations sum, mean, min, max, etc. labels Jul 14, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that this should be disabled for integer types too? i.e. the precision loss is similar for integers and therefore should also be disabled?

clarify that there might be a performance decrease experienced from disabling `mean` for bottleneck

Co-authored-by: Matthew Roeschke <[email protected]>
@sebasv
Copy link
Contributor Author

sebasv commented Jul 14, 2022

@mroeschke I will verify this. Addition is pure for integers so depending on the bottleneck implementation it should be almost exact

@JMBurley
Copy link
Contributor

Glad to see this update. Should it be targeting pandas 1.5.0 or 1.4.4 release?

@mroeschke
Copy link
Member

Glad to see this update. Should it be targeting pandas 1.5.0 or 1.4.4 release?

1.5. Point releases are reserved for regression in behavior from recent releases.

@sebasv
Copy link
Contributor Author

sebasv commented Jul 15, 2022

Integer mean should also be opted out of. In the case of integer arrays it seems that bottleneck accumulates the sum in a float64 type, wich means that the problem persists there as well. I'm adding int types to the unit test as well

Example

>>> a = np.full((1_000_000_000, 1), 1_000_000_000_000, dtype=np.int64)
>>> bottleneck.nanmean(a)
1000000022905.4894
>>> np.nanmean(a)
1000000000000.0

@mroeschke mroeschke added this to the 1.5 milestone Jul 15, 2022
@mroeschke mroeschke merged commit cf4758f into pandas-dev:main Jul 18, 2022
@mroeschke
Copy link
Member

Thanks @sebasv and @JMBurley for review

@sebasv sebasv deleted the opt-out-of-bottleneck-for-mean branch July 18, 2022 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: np.mean(pd.Series) != np.mean(pd.Series.values)
4 participants