Conversation
| // Time grain of the rollup. If unspecified, defaults to the base metrics view's smallest_time_grain during validation. | ||
| TimeGrain time_grain = 5; |
There was a problem hiding this comment.
What is the reasoning for this default? I'm wondering if it would be safer to make it required. Worried about three problems:
- If it is misconfigured, you will get broken/empty results (fails silently instead of with a clear error)
- If not explicitly set, the
smallest_time_graindefaults to seconds by default, which is unlikely to be used for rollups smallest_time_grainis supposed to indicate the granularity of the raw table; but won't the rollup tables usually have a different granularity than the raw table?- We can add this default later if needed, but we cannot easily make it required later. So might be safer to make it required now.
There was a problem hiding this comment.
Its required and does not default to anything, the comment is out of date, see this in parse_metrics_view.go
if rollup.TimeGrain == "" {
return fmt.Errorf(`rollup[%d]: "time_grain" is required`, i)
}
will fix the comment.
| priority int | ||
| rt *runtime.Runtime | ||
| instanceID string | ||
| metricsViewName string |
There was a problem hiding this comment.
The reason we haven't passed the name in before is because executor is supposed to be isolated from the catalog/resolver calls. This has two advantages:
- It keeps logic clean/flattened, avoiding e.g. resolvers calling an executor that calls a resolver
- It enables the executor to safely be used for specs that are not valid in the catalog, which currently the metrics view reconciler uses to validate new specs
I see this is being used to call a resolver for time ranges. I understand that in practice it works since it avoids circular code paths, but given how complex this package is, I think we should try to stick to the above principles about the package being isolated to avoid circular code.
We had a similar problem earlier for effectively resolving time expressions. At that time, we introduced the BindQuery function, which solved the problem in a clean way:
- If the query is not bound, the executor directly fetches the timestamps (which ensures correctness, but doesn't have any caching)
- The caller can optionally bind the query using cached timestamps to speed things up. See an example here – it works out quite cleanly in practice:
Could we do something similar here? Or maybe just extend BindQuery to bind a full set of timestamps for all the base/rollup tables?
There was a problem hiding this comment.
ok let me have a look, I thought since we are already passing mv spec then passing its name should be fine but I get your point.
| whereDims := collectWhereDimensions(qry.Where) | ||
|
|
||
| // Determine whether the query has a non-zero time range | ||
| hasTimeRange := qry.TimeRange != nil && (!qry.TimeRange.Start.IsZero() || !qry.TimeRange.End.IsZero()) |
There was a problem hiding this comment.
There are other fields to check. Consider using qry.TimeRange.IsZero()
There was a problem hiding this comment.
At this point the query time range will be resolved as e.RewriteQueryTimeRanges is called before this and thus other fields will be unset and only start and end will be set. And actually we need it to be resolved as we need query start and end for further checks so this explicitly checks that only and fallbacks to base unless rollup table has same coverage as base.
But if you want I can change to use qry.TimeRange.IsZero() and add a comment that this rewrite should only be called after RewriteQueryTimeRanges or add explicit checks for it.
| eligible, reason, err := rollupEligible(rollup, qry, queryGrain, whereDims, e.metricsView.TimeDimension, e.metricsView.FirstDayOfWeek) | ||
| if err != nil { | ||
| candidateSpan.End() | ||
| return nil, err |
There was a problem hiding this comment.
If an error happens during a span, I think you'd expect it to be added to the span. Otherwise it looks like the error came after the span, which is confusing.
This applies in many of the cases below
| if hasTimeRange { | ||
| // Clamp query range to the base table's actual data range. | ||
| // This ensures a rollup isn't rejected when the query extends beyond both the base table and rollup. | ||
| effectiveStart := qry.TimeRange.Start | ||
| if !effectiveStart.IsZero() && baseMin.After(effectiveStart) { | ||
| effectiveStart = baseMin | ||
| } | ||
| effectiveEnd := qry.TimeRange.End | ||
| if !effectiveEnd.IsZero() && baseMax.Before(effectiveEnd) { | ||
| effectiveEnd = baseMax | ||
| } |
There was a problem hiding this comment.
What if the qry contains a Rill time expression? Would have expected it to resolve the time range to a fixed start/end using the normal code paths for that.
There was a problem hiding this comment.
Yes this rewrite is called after e.RewriteQueryTimeRanges and thus only start and end will be set. See the above related comment.
| // Clear rollups to prevent recursion | ||
| synth.Rollups = nil | ||
|
|
||
| return synth |
There was a problem hiding this comment.
Shouldn't it also consume the rollup's dimension/measure selectors to filter out the dimensions/measures that are not in the rollup? Otherwise, the synthetic spec contains invalid dimensions/measures that would fail if queried.
There was a problem hiding this comment.
Those things are already checked in rollupEligible method which is called much before this so rollups satisfying the selection criteria (that include dim/measures checks) will be present here and thus keeping it simple when creating this spec. What do you think?
| args["database_schema"] = databaseSchema | ||
| } | ||
|
|
||
| res, _, err := e.rt.Resolve(ctx, &runtime.ResolveOptions{ |
There was a problem hiding this comment.
My earlier comment applies to the resolver call here.
I'm not sure, but I was wondering if we should solve this problem by refactoring Executor.Timestamps to query the min/max/watermark timestamps for the base and all the rollup tables in the metrics view, and return them in one go? It would keep things simple and integrate nicely with the existing caching facilitated by BindQuery (externally of the package) and e.timestamps (internally in the package). It would also enable us to return time ranges for different grains to the UI, which in future it may need if we need to optimize rollup usage better.
The downside would be it runs some queries eagerly, which might seem like it could have performance implications, but I wonder if it's okay given a) the output will be cached, b) several of the time ranges are anyway needed for the subsequent rollup checks.
Just an idea, let me know what you think.
There was a problem hiding this comment.
Yeah there were couple of things -
- If rollup fails basic eligibility test then no need to fetch timestamps for it.
- Also the executor will have user's security policy so timestamps may not reflect actual boundaries which are needed for further eligibility checks.
- Directly calling
e.timestampsonly populates executor level cache which is request specific and cannot be used for subsequent requests, thats why creating a resolver can lead to usage of cached timestamps tillcache_timestamps_ttl_seconds.
| const defaultTimestampsCacheTTL = 5 * time.Minute | ||
|
|
||
| func init() { | ||
| runtime.RegisterResolverInitializer("metrics_timestamps", newMetricsTimestampsResolver) |
There was a problem hiding this comment.
We already have a metrics_time_range resolver, which sounds almost like the same as this. Would be nice if we could build on that. (Relates to me comment above also.)
There was a problem hiding this comment.
I have put a comment on this resolver as to how its different from metrics_time_range here. Let me paste it here-
// This has two main differences from the metrics_time_range resolver:
// 1. It bypasses row-level security because timestamps reflect physical data boundaries, not user-visible data.
// Access is restricted to internal callers and admins.
// 2. It implements cache_timestamps_ttl_seconds based caching for efficiency purposes.
So essentially no security based row policies used here as this is used for checking rollup coverage and eligibility and second it adds additional cache that we use to prevent flooding timestamp queries to olap specially for streaming models which has caching disabled by default. Its cache key also relies on metrics_cache_key resolver so it can utilize cache_key_ttl and cache_key_sql but just adds additional logic on top of it to prevent querying based on cache_timestamps_ttl_seconds.
Because of these reasons I created a separate resolver as adding all this logic to metrics_time_range will make the logic complicated. What do you think?
Checklist: