Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling with dynamic schema #2217

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vasil-pashov
Copy link
Collaborator

Reference Issues/PRs

This is a rework of #2201

What does this implement or fix?

This enables using the QueryBuilder to resample libraries where dynamic schema is enabled. This requires handling two cases which are different in comparison to static schema.

  1. Type propagation. It turns out that it works without any functional changes in the code.
  2. Missing columns. Dynamic schema allows for segments to contain only a subset of all columns in the symbol. This case needed additional care.

Handling missing columns
The aggregate method takes as an input an array of std::optional columns and empty optional marks a missing column. The first step of the aggregation is to determine the common type (which handles type propagation), generate_common_input_type skips missing columns and computes the common type for all available columns. The result is a std::optional type. If the optional is empty this means that all columns were skipped (because they were missing), at this point the aggregate function can terminate returning empty std::optional for the resulting column. The resampling clause will ignore empty optional columns returned by aggregate.

Missing behavior
The current implementation has a flaws related to using date_range

  1. If date_range is passed to the QueryBuilder and the date range does not contain some column (which is otherwise in the data frame) we will return either an empty dataframe (if no other columns are being aggregated) or a dataframe not containing that column. More sensible behavior would be to backfill the missing buckets with a default value based on the dtype and the aggregator.
  2. Type propagation takes into account only the segments using in the aggregation and the dtypes from the segment descriptor. This means that different date ranges can result in dataframes whose columns have different dtypes. More sensible behavior would be to use the timeseries descriptor.

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@vasil-pashov vasil-pashov added the minor Feature change, should increase minor version label Mar 5, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on line 666 is now outdated, and reserve on line 657 is less useful

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on line 52 is now out of date. Worth adding a comment about why no else is required on this if

elif np.issubdtype(dtype, np.str_):
return 0 if aggregation == "count" else None
elif pd.api.types.is_bool_dtype(dtype):
return np.nan if aggregation == "mean" else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else False?


assert aggregation in ALL_AGGREGATIONS
if is_float_dtype(dtype):
return 0 if aggregation == "count" else np.nan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't sum default to 0 for floats as well?


@pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str])
def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth also testing the case where the first segment has column b and the second segment doesn't


@pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str])
def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth having a test where there are 3 row-slices in a bucket, with the first and last containing the aggregation column, but the middle one missing it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both branches of responsible_for_first_overlapping_bucket covered in testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Feature change, should increase minor version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants