catalog: console cluster utilization indexes#37348
Conversation
8affbd4 to
e4e8020
Compare
| .to_string(), | ||
| ) | ||
| .or_insert_with(|| "true".to_string()); | ||
|
|
There was a problem hiding this comment.
For turning on a feature flag in CI (e.g. in SLT), we usually add it to get_minimal_system_parameters and get_variable_system_parameters. Is there a reason for making this one a special case by turning it on here in sqllogictest.rs?
(Btw. I'm thinking I'll create a "feature flags" skill, because my Claude also often gets confused about how feature flags work.)
| "enable_console_cluster_utilization_24h_index", | ||
| false, | ||
| "Create the 24h console cluster utilization overview index on mz_catalog_server.", | ||
| ); |
There was a problem hiding this comment.
Would be good to mention in the doc comment that this takes effect only at an envd restart.
There was a problem hiding this comment.
Also, I think we don't have a precedent for feature flagging the existence of a builtin object. My Claude tells me this should work, but to be sure, could you please add a test that tries it what happens when this flag is turned on and off, and envd is restarted?
|
To fix the upgrade test failure, this will need a migration for 0dt Upgrade Smoke Test failure — missing
|
e4e8020 to
c4b186d
Compare
c4b186d to
6fe9248
Compare
The Console's cluster-detail page polls a replica-utilization rollup that,
for every timeframe except "Last 14 days", was recomputed ad-hoc on each
request: it builds the whole-fleet rollup (per-replica metrics aggregation,
five per-bucket arg-max top-1s, a multi-way join) and only filters to the one
selected cluster at the very end. That recompute is CPU-bound on
mz_catalog_server, so it serializes under concurrent users and its cost grows
linearly with the number of clusters in the deployment.
This adds two new indexed views alongside the existing 14-day overview so the
Console can read a maintained, per-cluster indexed lookup for every timeframe
it offers instead of recomputing:
- mz_console_cluster_utilization_overview_3h (1-minute buckets, 3h window)
- mz_console_cluster_utilization_overview_24h (5-minute buckets, 24h window)
- mz_console_cluster_utilization_overview (now 1-hour buckets, 14d;
previously 8-hour buckets)
All three share one SQL body (console_cluster_utilization_overview_sql) and
output relation so they stay in sync, and each is indexed on cluster_id in
mz_catalog_server. The view body is also cleaned up to drop a redundant
re-join of replica_history that the binning step never needed.
The two new views read only a 3h/24h window of metrics, so they are cheap to
maintain (~10MB and ~510MB of arrangements, ~1-10s hydration at 100 replicas).
The 14-day view dominates cost (input-bound on 14 days of 1-minute samples);
the 8h->1h change roughly doubles its arrangement memory but leaves hydration
unchanged.
The Console gates use of these views on the environment version, falling back
to the ad-hoc query on older environments.
Updates the catalog snapshot tests (oid, information_schema_tables,
mz_catalog_server_index_accounting, catalog_server_explain, autogenerated
mz_internal) and the catalog/indexes/explain-analyze testdrive files to cover
the new objects.
Adds ENABLE_CONSOLE_CLUSTER_UTILIZATION_RECENT_INDEX dyncfg + a bootstrap filter that skips gated builtin indexes when off (read from txn and system_parameter_defaults). Indexes are ephemeral, so flag-off drops the arrangement; default-on preserves current behavior. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Short cluster-detail windows (<=3h) now read an un-binned base view that the Console bins client-side, replacing the server-side five-pass top-k. This drops the maintained binned arrangement for that tier and lets the Console SUBSCRIBE to the live window. - Repoint mz_console_cluster_utilization_overview_3h to a new un-binned base (one row per replica metric sample, no date_bin or top-k). replica_history is deduped to one row per replica so the Console SUBSCRIBE upsert key (replica_id, occurred_at) stays unique even if a replica changed size. - Gate the 24h overview index behind enable_console_cluster_utilization_24h_index (default off) so a resource-constrained or self-managed environment does not build its maintained arrangement or stall an upgrade at hydration. The bootstrap gate uses the flag's effective value and falls back to the compiled default. Cloud enables it via system parameter. - sqllogictest enables the gated index in its test defaults so the catalog-server snapshots still cover it. Regenerates catalog_server_explain.slt and mz_catalog_server_index_accounting.slt.
6fe9248 to
f8a5cf7
Compare
Motivation
Adds maintained, cluster_id-indexed views on
mz_catalog_serverthat back the Console's cluster-detail replica-utilization charts. Aper-cluster readbecomes a point-lookup on the index instead of an ad-hoc query that recomputes the whole fleet's rollup and only filters to the viewed cluster at the end, so its cost no longer scales with deployment size on the single shared catalog-server replica.Fixes: CNS-104
Views
mz_console_cluster_utilization_overview: 14 days, 1-hour buckets.mz_console_cluster_utilization_overview_24h: 24 hours, 5-minute buckets.mz_console_cluster_utilization_overview_3h: last 3 hours, un-binned.The two binned views share one SQL body. The 3h view is un-binned: it returns raw per-(replica, sample) rows and the Console bins client-side, instead of a server-side five-pass top-k. That keeps the live-tier arrangement small and lets the Console SUBSCRIBE to it. replica_history is deduped to one row per replica so the (replica_id, occurred_at) key the SUBSCRIBE upserts on stays unique even if a replica ever had two sizes in history.
The per-timeframe views are cherry-picked from #37307.
Tests
Regenerated catalog_server_explain.slt, mz_catalog_server_index_accounting.slt, autogenerated/mz_internal.slt, oid.slt, and information_schema_tables.slt, and updated catalog.td, explain-analyze.td, and indexes.td. sqllogictest and the CI system-parameter defaults force-enable the gated index so the catalog-server snapshots cover it even though it ships off.