Skip to content

[PLAT-440] roll up routing in metrics view#9180

Open
pjain1 wants to merge 19 commits intomainfrom
rollup_mv
Open

[PLAT-440] roll up routing in metrics view#9180
pjain1 wants to merge 19 commits intomainfrom
rollup_mv

Conversation

@pjain1
Copy link
Copy Markdown
Member

@pjain1 pjain1 commented Apr 3, 2026

  • Add rollup table config to metrics view proto and YAML parser
  • Implement query routing: eligible rollups are selected based on grain derivability, dimension/measure coverage, timezone match, time range alignment, and time coverage
  • Prefer coarsest grain among eligible rollups; break ties by smallest data range
  • For no-time-range queries ("all data"), verify the rollup covers the base table's full range rather than skipping coverage checks

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@pjain1 pjain1 closed this Apr 3, 2026
@pjain1 pjain1 reopened this Apr 3, 2026
@pjain1 pjain1 changed the title roll up routing in metrics view [PLAT-440] roll up routing in metrics view Apr 5, 2026
@pjain1 pjain1 requested a review from begelundmuller April 6, 2026 05:32
@pjain1 pjain1 requested a review from begelundmuller April 13, 2026 16:00
Comment on lines +309 to +310
// Time grain of the rollup. If unspecified, defaults to the base metrics view's smallest_time_grain during validation.
TimeGrain time_grain = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for this default? I'm wondering if it would be safer to make it required. Worried about three problems:

  • If it is misconfigured, you will get broken/empty results (fails silently instead of with a clear error)
  • If not explicitly set, the smallest_time_grain defaults to seconds by default, which is unlikely to be used for rollups
  • smallest_time_grain is supposed to indicate the granularity of the raw table; but won't the rollup tables usually have a different granularity than the raw table?
  • We can add this default later if needed, but we cannot easily make it required later. So might be safer to make it required now.

Copy link
Copy Markdown
Member Author

@pjain1 pjain1 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its required and does not default to anything, the comment is out of date, see this in parse_metrics_view.go

  if rollup.TimeGrain == "" {
			return fmt.Errorf(`rollup[%d]: "time_grain" is required`, i)
		}

will fix the comment.

priority int
rt *runtime.Runtime
instanceID string
metricsViewName string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we haven't passed the name in before is because executor is supposed to be isolated from the catalog/resolver calls. This has two advantages:

  • It keeps logic clean/flattened, avoiding e.g. resolvers calling an executor that calls a resolver
  • It enables the executor to safely be used for specs that are not valid in the catalog, which currently the metrics view reconciler uses to validate new specs

I see this is being used to call a resolver for time ranges. I understand that in practice it works since it avoids circular code paths, but given how complex this package is, I think we should try to stick to the above principles about the package being isolated to avoid circular code.

We had a similar problem earlier for effectively resolving time expressions. At that time, we introduced the BindQuery function, which solved the problem in a clean way:

  • If the query is not bound, the executor directly fetches the timestamps (which ensures correctness, but doesn't have any caching)
  • The caller can optionally bind the query using cached timestamps to speed things up. See an example here – it works out quite cleanly in practice:
    e, err := executor.New(ctx, rt, instanceID, q.MetricsViewName, mv.ValidSpec, mv.Streaming, security, opts.Priority, userAttrs)

Could we do something similar here? Or maybe just extend BindQuery to bind a full set of timestamps for all the base/rollup tables?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me have a look, I thought since we are already passing mv spec then passing its name should be fine but I get your point.

whereDims := collectWhereDimensions(qry.Where)

// Determine whether the query has a non-zero time range
hasTimeRange := qry.TimeRange != nil && (!qry.TimeRange.Start.IsZero() || !qry.TimeRange.End.IsZero())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other fields to check. Consider using qry.TimeRange.IsZero()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the query time range will be resolved as e.RewriteQueryTimeRanges is called before this and thus other fields will be unset and only start and end will be set. And actually we need it to be resolved as we need query start and end for further checks so this explicitly checks that only and fallbacks to base unless rollup table has same coverage as base.

But if you want I can change to use qry.TimeRange.IsZero() and add a comment that this rewrite should only be called after RewriteQueryTimeRanges or add explicit checks for it.

Comment on lines +153 to +156
eligible, reason, err := rollupEligible(rollup, qry, queryGrain, whereDims, e.metricsView.TimeDimension, e.metricsView.FirstDayOfWeek)
if err != nil {
candidateSpan.End()
return nil, err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error happens during a span, I think you'd expect it to be added to the span. Otherwise it looks like the error came after the span, which is confusing.

This applies in many of the cases below

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will fix.

Comment on lines +192 to +202
if hasTimeRange {
// Clamp query range to the base table's actual data range.
// This ensures a rollup isn't rejected when the query extends beyond both the base table and rollup.
effectiveStart := qry.TimeRange.Start
if !effectiveStart.IsZero() && baseMin.After(effectiveStart) {
effectiveStart = baseMin
}
effectiveEnd := qry.TimeRange.End
if !effectiveEnd.IsZero() && baseMax.Before(effectiveEnd) {
effectiveEnd = baseMax
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the qry contains a Rill time expression? Would have expected it to resolve the time range to a fixed start/end using the normal code paths for that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this rewrite is called after e.RewriteQueryTimeRanges and thus only start and end will be set. See the above related comment.

Comment on lines +384 to +387
// Clear rollups to prevent recursion
synth.Rollups = nil

return synth
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it also consume the rollup's dimension/measure selectors to filter out the dimensions/measures that are not in the rollup? Otherwise, the synthetic spec contains invalid dimensions/measures that would fail if queried.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those things are already checked in rollupEligible method which is called much before this so rollups satisfying the selection criteria (that include dim/measures checks) will be present here and thus keeping it simple when creating this spec. What do you think?

args["database_schema"] = databaseSchema
}

res, _, err := e.rt.Resolve(ctx, &runtime.ResolveOptions{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My earlier comment applies to the resolver call here.

I'm not sure, but I was wondering if we should solve this problem by refactoring Executor.Timestamps to query the min/max/watermark timestamps for the base and all the rollup tables in the metrics view, and return them in one go? It would keep things simple and integrate nicely with the existing caching facilitated by BindQuery (externally of the package) and e.timestamps (internally in the package). It would also enable us to return time ranges for different grains to the UI, which in future it may need if we need to optimize rollup usage better.

The downside would be it runs some queries eagerly, which might seem like it could have performance implications, but I wonder if it's okay given a) the output will be cached, b) several of the time ranges are anyway needed for the subsequent rollup checks.

Just an idea, let me know what you think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there were couple of things -

  1. If rollup fails basic eligibility test then no need to fetch timestamps for it.
  2. Also the executor will have user's security policy so timestamps may not reflect actual boundaries which are needed for further eligibility checks.
  3. Directly calling e.timestamps only populates executor level cache which is request specific and cannot be used for subsequent requests, thats why creating a resolver can lead to usage of cached timestamps till cache_timestamps_ttl_seconds.

const defaultTimestampsCacheTTL = 5 * time.Minute

func init() {
runtime.RegisterResolverInitializer("metrics_timestamps", newMetricsTimestampsResolver)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a metrics_time_range resolver, which sounds almost like the same as this. Would be nice if we could build on that. (Relates to me comment above also.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put a comment on this resolver as to how its different from metrics_time_range here. Let me paste it here-

// This has two main differences from the metrics_time_range resolver:
//  1. It bypasses row-level security because timestamps reflect physical data boundaries, not user-visible data.
//     Access is restricted to internal callers and admins.
//  2. It implements cache_timestamps_ttl_seconds based caching for efficiency purposes.

So essentially no security based row policies used here as this is used for checking rollup coverage and eligibility and second it adds additional cache that we use to prevent flooding timestamp queries to olap specially for streaming models which has caching disabled by default. Its cache key also relies on metrics_cache_key resolver so it can utilize cache_key_ttl and cache_key_sql but just adds additional logic on top of it to prevent querying based on cache_timestamps_ttl_seconds.

Because of these reasons I created a separate resolver as adding all this logic to metrics_time_range will make the logic complicated. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants