Skip to content

Conversation

jemmajeffree
Copy link

Updated algorithm for nanvar, to use an adapted version of the Schubert and Gertz (2018) paper mentioned in #386, following discussion in #422

flox/core.py Outdated
@@ -1251,7 +1252,8 @@ def chunk_reduce(
# optimize that out.
previous_reduction: T_Func = ""
for reduction, fv, kw, dt in zip(funcs, fill_values, kwargss, dtypes):
if empty:
# UGLY! but this is because the `var` breaks our design assumptions
if empty and reduction is not var_chunk:
Copy link
Collaborator

@dcherian dcherian Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code path is an "optimization" for chunks that don't contain any valid groups. so group_idx is all -1.
We will need to override full in MultiArray. Look up what the like kwarg does here, it dispatches to the appropriate array type.


The next issue will be that fill_value is a scalar like np.nan but that doesn't work for all our intermediates (e.g. the "count").

  1. My first thought is that MultiArray will need to track a default fill_value per array. For var, this can be initialized to (None, None, 0). If None we use the fill_value passed in; else the default.
  2. The other way would be to hardcode some behaviour in _initialize_aggregation so that agg.fill_value["intermediate"] = ( (fill_value, fill_value, 0), ), and then multi-array can receive that tuple and do the "right thing".

The other place this will matter is in reindex_numpy, which is executed at the combine step. I suspect the second tuple approach is the best.

This bit is hairy, and ill-defined. Let me know if you want me to work through it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm partway through implementing something to work here.

  • How do I trigger this code pathway without brute force overwriting if empty: with if True:
  • When np.full is called, like is a np array not a MultiArray, because it's (I think) the chunk data and bypassing var_chunk (could also be an artefact of the if True override above?). In a pinch, I guess I could add an elif that catches the empty and reduction is var_chunk and co-erce that into a MultiArray, but it's also ugly so I'm hoping you might have better ideas

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking some more, I may have misinterpreted what fill_value is used for. When is it needed for intermediates?

@dcherian
Copy link
Collaborator

This is great progress! Now we reach some much harder parts. I pushed a commit to show where I think the "chunk" function should go and left a few comments. I think the next steps should be to

  1. address those comments;
  2. add a new test to test_core.py with your reproducer (though modified to work with pure numpy arrays);
  3. implement np.full for MultiArray.
  4. dig a bit more in to the "fill value" bits. You'll see that test_groupby_reduce_all fails in a couple of places to do with fillvalue and setitem. This will take some work to fix, but basically it has to do with adding a "fill value" for groups that have no value up to this point.
  5. There's another confusing failure where the MultiArray only has 2 arrays instead of 3. I don't understand how that happens.

@@ -355,7 +354,7 @@ def var_chunk(group_idx, array, *, engine: str, axis=-1, size=None, fill_value=N
engine=engine,
axis=axis,
size=size,
fill_value=fill_value[2], # Unpack fill value bc it's currently defined for multiarray
fill_value=0, # Unpack fill value bc it's currently defined for multiarray
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that this would overwrite whatever is passed through in fill_value when the aggregation is defined? And we're assuming that in no instance would a different value of fill_value be wanted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concern is None[2] isn't a thing wouldn't it make more sense to have (None, None, None) be the default and keep the unpacking?

Copy link
Collaborator

@dcherian dcherian Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the hardcoding is fine here. It's probably fine to just set fill_value=(np.nan,)

den = multiarray.arrays[2] - ddof
# preserve nans for groups with 0 obs; so these values are -ddof
den[den < 0] = 0
return multiarray.arrays[0] / den
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this will throw a divide by zero warning for groups with zero obs? Is that the intended behaviour?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I was relying on it to set NaNs; but you're right; it's probably better to use a mask result[den < 0] = np.nan

@@ -163,6 +166,9 @@ def notnull(data):


def isnull(data: Any):
if isinstance(data, tuple) and len(data) == 3 and data == (0, 0, 0):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what are these lines (and associated lines above) doing?
I guess pd.isnull can't cope with three numbers as a single entity? In which case, why is only the fill_value something that needs checking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it didn't like the tuple of 0s, so it's a hack


assert_equal(expected, no_offset, tol)
assert_equal(expected_offset, with_offset, tol)
if exponent < 10:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the line above is passing (assert_equal(expected_offset, with_offset, tol)) then it looks like the if statement shouldn't be needed here at all. There's no reason for the numpy implementation to be wrong in the same way as the new flox implementation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting! this is a numpy failure; great let's delete

@jemmajeffree
Copy link
Author

Does var_combine need a skipna too? (not just var_chunk?)

Happy to write this, just checking if my understanding is correct before I do so.

@jemmajeffree
Copy link
Author

Very nice, it took very little work to push this across the line.

Thank you! (for the code & the support)

One more thing to do could be to add some dask chunking to the new test you've added.
I'm happy to merge asap unless there was more you wanted to do here.
Extending your current test to a property test could be fun if you're up for it . This could very easily be done in a new PR.

Happy to write some more tests in a separate PR and merge this one if you're happy with that.

I've left a couple minor questions above that we'd ideally address first (in particular, I think that var_combine needs a skipna keyword argument, but it's possible I've missed something in the implementation you've put).

Also, I haven't done much manual testing of using var/std in different circumstances, and would be relying on the automated testing to catch problems. Automated testing/CI isn't something I've used before, so I don't know if this is reasonable or if I should poke it some more and make sure the new changes are robust to a bunch of different use cases. Happy to trust your assessment.

@jemmajeffree jemmajeffree marked this pull request as ready for review August 19, 2025 06:40
@jemmajeffree
Copy link
Author

Also, I'd be keen to leave some documentation somewhere summarizing where the maths came from that goes into var_chunk and var_combine. Is the best place to do this writing a small paragraph into a docstring at the start of var_chunk? Or adding something to the flox docs? Or another location?

@dcherian
Copy link
Collaborator

dcherian commented Aug 19, 2025

Does var_combine need a skipna too? (not just var_chunk?)

I would have thought so, but perhaps since the intermediates are using fill_value=0, the non-NaN skipping versions are fine.

Also, I'd be keen to leave some documentation somewhere summarizing where the maths came from that goes into var_chunk and var_combine. Is the best place to do this writing a small paragraph into a docstring at the start of var_chunk? Or adding something to the flox docs? Or another location?

Yes documenting in-code sounds right to me. A docs bit would be nice too if you're keen

@jemmajeffree
Copy link
Author

jemmajeffree commented Aug 19, 2025

found another problem :)
When I wrote var_combine, I assumed that the combine only called things in one direction at once
https://github.com/jemmajeffree/flox/blob/d5c59e35e748ef960c4a50bf5f46d188b86c9cfa/flox/aggregations.py#L408
assert len(axis) == 1, "Assuming that the combine function is only in one direction at once" because the components need to be combined one at a time.

This assumption is broken with:

import numpy as np
import xarray as xr
import flox

testdata = xr.DataArray(np.ones((4,2)),dims=('d0','d1')).chunk({'d0':2}).assign_coords({'g':xr.DataArray(np.arange(4)%2,dims='d0')})

testdata.groupby('g').var(('d0','d1')).load()

In the examples I've tried so far, only one of the axis dimensions actually has a length greater than 1, so we could pick out that dimension for the cumsum/clip_first/clip_last? But I don't know what other situations might occur and whether that's actually a generic fix

Interesting that none of the automated tests picked this up, might write something to fix that when writing your other suggested fixes.

@dcherian
Copy link
Collaborator

Ah yes, we will need to add that

pytest "tests/test_core.py::test_cohorts_nd_by[flox-cohorts-None-nanvar]" --pdb

will drop you in a debugger prompt that you can use to explore the problem. we'll need to generalize this to nD.

What will happens is that we combine the intermediates and concatenate them over the axes being reduced over. And then reduce over those axes. At the end, the result will have size =1 along those axes; which are then squeezed out (see _squeeze_intermediates).

@dcherian
Copy link
Collaborator

dcherian commented Aug 20, 2025

The property test failure indicates some kind of dtype-dependent behaviour: pytest "tests/test_properties.py::test_groupby_reduce_numpy_vs_other" --pdb

import numpy as np
import dask.array

numpy_array = np.array([5592407., 5592407., 5592407.], dtype=np.float32)
by = np.array([0, 0, 0], dtype=np.int8)
array = dask.array.from_array(numpy_array, chunks=1)

np.var(numpy_array) # 0.25!!!!
np.var(numpy_array, dtype=np.float64) # 0
np.var(numpy_array.compute())  # 0 

groupby_reduce(numpy_array, by, func="var", engine="numpy")[0] # 0.1111!
groupby_reduce(numpy_array, by, func="var", engine="flox")[0] # 0.25!
groupby_reduce(numpy_array, by, func="var", engine="numbagg")[0] # 0.1111!
groupby_reduce(array, by, func="var", engine="numpy")[0].compute()  # 0.

Clearly we should be accumulating to float64 intermediates and then downcasting at the end if needed.

@jemmajeffree
Copy link
Author

jemmajeffree commented Aug 20, 2025

Clearly we should be accumulating to float64 intermediates and then downcasting at the end if needed.

Unfortunately I think we need to turn into float64 before calculating intermediates, such that the current dtypes parameter doesn't work? It's using agg.dtypes['numpy'] which is set to the final dtype here? Otherwise you end up with float32 being used for var_chunk calculations, which is the issue

The following commit works, but I don't like promoting to float64 as a final dtype

@jemmajeffree
Copy link
Author

Ah yes, we will need to add that

pytest "tests/test_core.py::test_cohorts_nd_by[flox-cohorts-None-nanvar]" --pdb
will drop you in a debugger prompt that you can use to explore the problem. we'll need to generalize this to nD.

What will happens is that we combine the intermediates and concatenate them over the axes being reduced over. And then reduce over those axes. At the end, the result will have size =1 along those axes; which are then squeezed out (see _squeeze_intermediates).

Just to make sure I understand, there are situations in which var_combine needs to be able to handle combining intermediates along multiple axes at once? If so, that's tricky to handle because the equation I'm using merges MultiArray sets of intermediates pairwise. I think the best way to handle it is probably to stack those axes? (or is that bad for memory?) Or possibly deal with dimensions one after the other? Or we could do something awful with a 2D cumulative sum that loops around for calculating the adjustment terms, but that'd pretty much guarantee that the code is unintelligible to anybody else.

@dcherian
Copy link
Collaborator

Unfortunately I think we need to turn into float64 before calculating intermediates,

we could explicitly accumulate to float64 intermediates in _var_chunk. chunk_reduce calls astype(dtype) on the result anyway. Also note that the cumsum bits in _var_combine are the likeliest source of problems here, so those should be accumulating to a 64bit type.

I think the best way to handle it is probably to stack those axes? (or is that bad for memory?)

Yes I think a reshape would be fine; the axes you are reshaping will be contiguous so this will end up being a view of the original, so no extra memory. Just make sure to reshape back to the same dimensionality at the end.

@dcherian
Copy link
Collaborator

Jemma, please let me know how I can help here. I'm happy to tackle some of these failing tests if you prefer

@jemmajeffree
Copy link
Author

Thanks Deepak. I'm pretty busy the next few weeks, which is just to say if I drop off for a couple days it's not because I've lost interest. Back to normal 23rd September, but I'll try and get this across the line this week or early next.

I've got an sense of how to implement the reshape/multiple axes for var_chunk, and am hoping to get to this soon. I think this will fix most of the failing tests?

Regarding casting to float64, I'm not confident that I've thought through all the edge cases. ie, we probably wouldn't want to cast np.longdouble back to np.float64? Or is it a safe assumption that np.float64 is a good idea for intermediates no matter what the inputs were? If you have a solution that you're happy with, then feel free to fix this one, otherwise I'll get to it when I can.

It's probably pretty obvious, but I've got basically no familiarity with pytest so I'm developing that familiarity while trying to use it here. I'm happy to keep working my way slowly through it, but I'm also happy for you to take on other failing tests.

What do you see as the outstanding tasks to get this PR finalised? Is it just to address the causes of all the failing tests? I think we might have had a couple unresolved comment threads from code reviews that I'll try track down. Any suggestions for how you'd normally keep track of the last few things to do?

@dcherian
Copy link
Collaborator

but I've got basically no familiarity with pytest so I'm developing that familiarity while trying to use it here. I'm happy to keep working my way slowly through it, but I'm also happy for you to take on other failing tests.

pytest is an acquired taste hehe, and it's used in complex ways here. Some tricks that will help:

  1. pip install pdbpp for pdb++, a better debugger.
  2. pytest --pdb will drop you in a debugger prompt at a failure. This will let you inspect the state, and run python statements. Usually I move up (type u for short), till I see code I am familiar with, and then move "down" with d to understand what's happening.
  3. To insert a debugger breakpoint use import pdb; pdb.set_trace() anywhere in the code. Run with pytest -s for it to wait for input at this prompt. So my usual trick is insert that breakpoint and then pytest -s --pdb TEST_NAME.

cast np.longdouble back to np.float64

No one's really using that, or if they are, we can fix it when they complain.

Is it just to address the causes of all the failing tests?

yes happy to merge when tests pass.

Looks like you're just down to the property tests failing?! If so, I can clear those up.

@jemmajeffree
Copy link
Author

Looks like you're just down to the property tests failing?!

I think so? I wrapped the part of my code that combined along a single axis in a for loop that means it only has to handle one axis at a time. All the rest of the ideas I tried were nasty to implement.

If so, I can clear those up.

Please :)

Copy link
Author

@jemmajeffree jemmajeffree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair.
I was deliberately trying to avoid triggering the "invalid value encountered in np divide" warning, though I concede it was a bit of a hack.
Do we want to force den == 0 also to be NaN? (as opposed to inf?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants