-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resampling with dynamic schema #2217
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on line 666 is now outdated, and reserve on line 657 is less useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on line 52 is now out of date. Worth adding a comment about why no else
is required on this if
elif np.issubdtype(dtype, np.str_): | ||
return 0 if aggregation == "count" else None | ||
elif pd.api.types.is_bool_dtype(dtype): | ||
return np.nan if aggregation == "mean" else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else False
?
|
||
assert aggregation in ALL_AGGREGATIONS | ||
if is_float_dtype(dtype): | ||
return 0 if aggregation == "count" else np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't sum
default to 0 for floats as well?
|
||
@pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str]) | ||
def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth also testing the case where the first segment has column b
and the second segment doesn't
|
||
@pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str]) | ||
def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth having a test where there are 3 row-slices in a bucket, with the first and last containing the aggregation column, but the middle one missing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are both branches of responsible_for_first_overlapping_bucket
covered in testing?
Reference Issues/PRs
This is a rework of #2201
What does this implement or fix?
This enables using the QueryBuilder to resample libraries where dynamic schema is enabled. This requires handling two cases which are different in comparison to static schema.
Handling missing columns
The aggregate method takes as an input an array of
std::optional
columns and empty optional marks a missing column. The first step of the aggregation is to determine the common type (which handles type propagation),generate_common_input_type
skips missing columns and computes the common type for all available columns. The result is astd::optional
type. If the optional is empty this means that all columns were skipped (because they were missing), at this point the aggregate function can terminate returning empty std::optional for the resulting column. The resampling clause will ignore empty optional columns returned by aggregate.Missing behavior
The current implementation has a flaws related to using date_range
date_range
is passed to the QueryBuilder and the date range does not contain some column (which is otherwise in the data frame) we will return either an empty dataframe (if no other columns are being aggregated) or a dataframe not containing that column. More sensible behavior would be to backfill the missing buckets with a default value based on the dtype and the aggregator.Any other comments?
Checklist
Checklist for code changes...