-
Notifications
You must be signed in to change notification settings - Fork 21
More stable algorithm for variance, standard deviation #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
flox/core.py
Outdated
@@ -1251,7 +1252,8 @@ def chunk_reduce( | |||
# optimize that out. | |||
previous_reduction: T_Func = "" | |||
for reduction, fv, kw, dt in zip(funcs, fill_values, kwargss, dtypes): | |||
if empty: | |||
# UGLY! but this is because the `var` breaks our design assumptions | |||
if empty and reduction is not var_chunk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code path is an "optimization" for chunks that don't contain any valid groups. so group_idx
is all -1
.
We will need to override full
in MultiArray. Look up what the like
kwarg does here, it dispatches to the appropriate array type.
The next issue will be that fill_value is a scalar like np.nan
but that doesn't work for all our intermediates (e.g. the "count").
- My first thought is that
MultiArray
will need to track a default fill_value per array. Forvar
, this can be initialized to(None, None, 0)
. IfNone
we use thefill_value
passed in; else the default. - The other way would be to hardcode some behaviour in
_initialize_aggregation
so thatagg.fill_value["intermediate"] = ( (fill_value, fill_value, 0), )
, and then multi-array can receive that tuple and do the "right thing".
The other place this will matter is in reindex_numpy
, which is executed at the combine step. I suspect the second tuple approach is the best.
This bit is hairy, and ill-defined. Let me know if you want me to work through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm partway through implementing something to work here.
- How do I trigger this code pathway without brute force overwriting
if empty:
withif True:
- When np.full is called,
like
is a np array not a MultiArray, because it's (I think) the chunk data and bypassing var_chunk (could also be an artefact of theif True
override above?). In a pinch, I guess I could add an elif that catches theempty and reduction is var_chunk
and co-erce that into a MultiArray, but it's also ugly so I'm hoping you might have better ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more, I may have misinterpreted what fill_value is used for. When is it needed for intermediates?
This is great progress! Now we reach some much harder parts. I pushed a commit to show where I think the "chunk" function should go and left a few comments. I think the next steps should be to
|
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
04fccab
to
d5c59e3
Compare
@@ -355,7 +354,7 @@ def var_chunk(group_idx, array, *, engine: str, axis=-1, size=None, fill_value=N | |||
engine=engine, | |||
axis=axis, | |||
size=size, | |||
fill_value=fill_value[2], # Unpack fill value bc it's currently defined for multiarray | |||
fill_value=0, # Unpack fill value bc it's currently defined for multiarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that this would overwrite whatever is passed through in fill_value when the aggregation is defined? And we're assuming that in no instance would a different value of fill_value be wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the concern is None[2] isn't a thing wouldn't it make more sense to have (None, None, None) be the default and keep the unpacking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the hardcoding is fine here. It's probably fine to just set fill_value=(np.nan,)
flox/aggregations.py
Outdated
den = multiarray.arrays[2] - ddof | ||
# preserve nans for groups with 0 obs; so these values are -ddof | ||
den[den < 0] = 0 | ||
return multiarray.arrays[0] / den |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this will throw a divide by zero warning for groups with zero obs? Is that the intended behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I was relying on it to set NaNs; but you're right; it's probably better to use a mask result[den < 0] = np.nan
@@ -163,6 +166,9 @@ def notnull(data): | |||
|
|||
|
|||
def isnull(data: Any): | |||
if isinstance(data, tuple) and len(data) == 3 and data == (0, 0, 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what are these lines (and associated lines above) doing?
I guess pd.isnull can't cope with three numbers as a single entity? In which case, why is only the fill_value something that needs checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it didn't like the tuple of 0s, so it's a hack
|
||
assert_equal(expected, no_offset, tol) | ||
assert_equal(expected_offset, with_offset, tol) | ||
if exponent < 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the line above is passing (assert_equal(expected_offset, with_offset, tol)
) then it looks like the if statement shouldn't be needed here at all. There's no reason for the numpy implementation to be wrong in the same way as the new flox implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh interesting! this is a numpy failure; great let's delete
Does var_combine need a skipna too? (not just var_chunk?) Happy to write this, just checking if my understanding is correct before I do so. |
Thank you! (for the code & the support)
Happy to write some more tests in a separate PR and merge this one if you're happy with that. I've left a couple minor questions above that we'd ideally address first (in particular, I think that var_combine needs a skipna keyword argument, but it's possible I've missed something in the implementation you've put). Also, I haven't done much manual testing of using var/std in different circumstances, and would be relying on the automated testing to catch problems. Automated testing/CI isn't something I've used before, so I don't know if this is reasonable or if I should poke it some more and make sure the new changes are robust to a bunch of different use cases. Happy to trust your assessment. |
Also, I'd be keen to leave some documentation somewhere summarizing where the maths came from that goes into var_chunk and var_combine. Is the best place to do this writing a small paragraph into a docstring at the start of var_chunk? Or adding something to the flox docs? Or another location? |
I would have thought so, but perhaps since the intermediates are using fill_value=0, the non-NaN skipping versions are fine.
Yes documenting in-code sounds right to me. A docs bit would be nice too if you're keen |
found another problem :) This assumption is broken with: import numpy as np
import xarray as xr
import flox
testdata = xr.DataArray(np.ones((4,2)),dims=('d0','d1')).chunk({'d0':2}).assign_coords({'g':xr.DataArray(np.arange(4)%2,dims='d0')})
testdata.groupby('g').var(('d0','d1')).load() In the examples I've tried so far, only one of the axis dimensions actually has a length greater than 1, so we could pick out that dimension for the cumsum/clip_first/clip_last? But I don't know what other situations might occur and whether that's actually a generic fix Interesting that none of the automated tests picked this up, might write something to fix that when writing your other suggested fixes. |
Ah yes, we will need to add that
will drop you in a debugger prompt that you can use to explore the problem. we'll need to generalize this to nD. What will happens is that we combine the intermediates and concatenate them over the axes being reduced over. And then reduce over those axes. At the end, the result will have size =1 along those axes; which are then squeezed out (see |
The property test failure indicates some kind of dtype-dependent behaviour: import numpy as np
import dask.array
numpy_array = np.array([5592407., 5592407., 5592407.], dtype=np.float32)
by = np.array([0, 0, 0], dtype=np.int8)
array = dask.array.from_array(numpy_array, chunks=1)
np.var(numpy_array) # 0.25!!!!
np.var(numpy_array, dtype=np.float64) # 0
np.var(numpy_array.compute()) # 0
groupby_reduce(numpy_array, by, func="var", engine="numpy")[0] # 0.1111!
groupby_reduce(numpy_array, by, func="var", engine="flox")[0] # 0.25!
groupby_reduce(numpy_array, by, func="var", engine="numbagg")[0] # 0.1111!
groupby_reduce(array, by, func="var", engine="numpy")[0].compute() # 0. Clearly we should be accumulating to float64 intermediates and then downcasting at the end if needed. |
Unfortunately I think we need to turn into float64 before calculating intermediates, such that the current dtypes parameter doesn't work? It's using agg.dtypes['numpy'] which is set to the final dtype here? Otherwise you end up with float32 being used for var_chunk calculations, which is the issue The following commit works, but I don't like promoting to float64 as a final dtype |
Just to make sure I understand, there are situations in which var_combine needs to be able to handle combining intermediates along multiple axes at once? If so, that's tricky to handle because the equation I'm using merges MultiArray sets of intermediates pairwise. I think the best way to handle it is probably to stack those axes? (or is that bad for memory?) Or possibly deal with dimensions one after the other? Or we could do something awful with a 2D cumulative sum that loops around for calculating the adjustment terms, but that'd pretty much guarantee that the code is unintelligible to anybody else. |
we could explicitly accumulate to float64 intermediates in
Yes I think a reshape would be fine; the axes you are reshaping will be contiguous so this will end up being a view of the original, so no extra memory. Just make sure to reshape back to the same dimensionality at the end. |
Jemma, please let me know how I can help here. I'm happy to tackle some of these failing tests if you prefer |
This reverts commit d77c132.
Thanks Deepak. I'm pretty busy the next few weeks, which is just to say if I drop off for a couple days it's not because I've lost interest. Back to normal 23rd September, but I'll try and get this across the line this week or early next. I've got an sense of how to implement the reshape/multiple axes for var_chunk, and am hoping to get to this soon. I think this will fix most of the failing tests? Regarding casting to float64, I'm not confident that I've thought through all the edge cases. ie, we probably wouldn't want to cast np.longdouble back to np.float64? Or is it a safe assumption that np.float64 is a good idea for intermediates no matter what the inputs were? If you have a solution that you're happy with, then feel free to fix this one, otherwise I'll get to it when I can. It's probably pretty obvious, but I've got basically no familiarity with pytest so I'm developing that familiarity while trying to use it here. I'm happy to keep working my way slowly through it, but I'm also happy for you to take on other failing tests. What do you see as the outstanding tasks to get this PR finalised? Is it just to address the causes of all the failing tests? I think we might have had a couple unresolved comment threads from code reviews that I'll try track down. Any suggestions for how you'd normally keep track of the last few things to do? |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
pytest is an acquired taste hehe, and it's used in complex ways here. Some tricks that will help:
No one's really using that, or if they are, we can fix it when they complain.
yes happy to merge when tests pass. Looks like you're just down to the property tests failing?! If so, I can clear those up. |
I think so? I wrapped the part of my code that combined along a single axis in a for loop that means it only has to handle one axis at a time. All the rest of the ideas I tried were nasty to implement.
Please :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fair.
I was deliberately trying to avoid triggering the "invalid value encountered in np divide" warning, though I concede it was a bit of a hack.
Do we want to force den == 0
also to be NaN? (as opposed to inf?)
Updated algorithm for nanvar, to use an adapted version of the Schubert and Gertz (2018) paper mentioned in #386, following discussion in #422