Resampling with dynamic schema #2217

vasil-pashov · 2025-03-05T09:29:25Z

Reference Issues/PRs

This is a rework of #2201

What does this implement or fix?

This enables using the QueryBuilder to resample libraries where dynamic schema is enabled. This requires handling two cases which are different in comparison to static schema.

Type propagation. It turns out that it works without any functional changes in the code.
Missing columns. Dynamic schema allows for segments to contain only a subset of all columns in the symbol. This case needed additional care.

Handling missing columns
The aggregate method takes as an input an array of std::optional columns and empty optional marks a missing column. The first step of the aggregation is to determine the common type (which handles type propagation), generate_common_input_type skips missing columns and computes the common type for all available columns. The result is a std::optional type. If the optional is empty this means that all columns were skipped (because they were missing), at this point the aggregate function can terminate returning empty std::optional for the resulting column. The resampling clause will ignore empty optional columns returned by aggregate.

Missing behavior
The current implementation has a flaws related to using date_range

If date_range is passed to the QueryBuilder and the date range does not contain some column (which is otherwise in the data frame) we will return either an empty dataframe (if no other columns are being aggregated) or a dataframe not containing that column. More sensible behavior would be to backfill the missing buckets with a default value based on the dtype and the aggregator.
Type propagation takes into account only the segments using in the aggregation and the dtypes from the segment descriptor. This means that different date ranges can result in dataframes whose columns have different dtypes. More sensible behavior would be to use the timeseries descriptor.

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

alexowens90 · 2025-03-05T10:55:26Z

cpp/arcticdb/processing/clause.cpp

Comment on line 666 is now outdated, and reserve on line 657 is less useful

alexowens90 · 2025-03-05T11:02:09Z

cpp/arcticdb/processing/sorted_aggregation.cpp

Comment on line 52 is now out of date. Worth adding a comment about why no else is required on this if

alexowens90 · 2025-03-05T11:07:31Z

python/tests/unit/arcticdb/version_store/test_resample.py

+    elif np.issubdtype(dtype, np.str_):
+        return 0 if aggregation == "count" else None
+    elif pd.api.types.is_bool_dtype(dtype):
+        return np.nan if aggregation == "mean" else 0


else False?

alexowens90 · 2025-03-05T11:11:39Z

python/tests/unit/arcticdb/version_store/test_resample.py

+
+    assert aggregation in ALL_AGGREGATIONS
+    if is_float_dtype(dtype):
+        return 0 if aggregation == "count" else np.nan


Doesn't sum default to 0 for floats as well?

alexowens90 · 2025-03-05T11:13:34Z

python/tests/unit/arcticdb/version_store/test_resample.py

+
+    @pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str])
+    def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype):


Worth also testing the case where the first segment has column b and the second segment doesn't

alexowens90 · 2025-03-05T11:15:20Z

python/tests/unit/arcticdb/version_store/test_resample.py

+
+    @pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str])
+    def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype):


Also worth having a test where there are 3 row-slices in a bucket, with the first and last containing the aggregation column, but the middle one missing it

alexowens90 · 2025-03-05T11:20:04Z

cpp/arcticdb/processing/clause.cpp

Are both branches of responsible_for_first_overlapping_bucket covered in testing?

vasil-pashov added 2 commits March 4, 2025 13:35

Add tests

f11f98f

Make resampling work using dynamic schema

a26f7f2

vasil-pashov requested review from alexowens90, willdealtry and poodlewars as code owners March 5, 2025 09:29

vasil-pashov mentioned this pull request Mar 5, 2025

Resampling using dynamic schema [Part 1] #2201

Closed

9 tasks

vasil-pashov added the minor Feature change, should increase minor version label Mar 5, 2025

alexowens90 reviewed Mar 5, 2025

View reviewed changes

cpp/arcticdb/processing/clause.cpp

Copy link

Collaborator

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on line 666 is now outdated, and reserve on line 657 is less useful

alexowens90 reviewed Mar 5, 2025

View reviewed changes

cpp/arcticdb/processing/clause.cpp

Copy link

Collaborator

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both branches of responsible_for_first_overlapping_bucket covered in testing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resampling with dynamic schema #2217

Resampling with dynamic schema #2217

vasil-pashov commented Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025

alexowens90 Mar 5, 2025


		@pytest.mark.parametrize("dtype", [np.float32, np.float64, np.uint32, np.int32, np.int64, bool, "datetime64[ns]", str])
		def test_bucket_spans_two_segments(self, lmdb_version_store_dynamic_schema_v1, dtype):

Resampling with dynamic schema #2217

Are you sure you want to change the base?

Resampling with dynamic schema #2217

Conversation

vasil-pashov commented Mar 5, 2025

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment

alexowens90 Mar 5, 2025

Choose a reason for hiding this comment