Skip to content

catalog: console cluster utilization indexes#37348

Open
leedqin wants to merge 3 commits into
MaterializeInc:mainfrom
leedqin:console-cluster-utilization-indexes
Open

catalog: console cluster utilization indexes#37348
leedqin wants to merge 3 commits into
MaterializeInc:mainfrom
leedqin:console-cluster-utilization-indexes

Conversation

@leedqin

@leedqin leedqin commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Motivation

Adds maintained, cluster_id-indexed views on mz_catalog_server that back the Console's cluster-detail replica-utilization charts. A per-cluster read becomes a point-lookup on the index instead of an ad-hoc query that recomputes the whole fleet's rollup and only filters to the viewed cluster at the end, so its cost no longer scales with deployment size on the single shared catalog-server replica.

Fixes: CNS-104

Views

mz_console_cluster_utilization_overview: 14 days, 1-hour buckets.
mz_console_cluster_utilization_overview_24h: 24 hours, 5-minute buckets.
mz_console_cluster_utilization_overview_3h: last 3 hours, un-binned.
The two binned views share one SQL body. The 3h view is un-binned: it returns raw per-(replica, sample) rows and the Console bins client-side, instead of a server-side five-pass top-k. That keeps the live-tier arrangement small and lets the Console SUBSCRIBE to it. replica_history is deduped to one row per replica so the (replica_id, occurred_at) key the SUBSCRIBE upserts on stays unique even if a replica ever had two sizes in history.

The per-timeframe views are cherry-picked from #37307.

Tests

Regenerated catalog_server_explain.slt, mz_catalog_server_index_accounting.slt, autogenerated/mz_internal.slt, oid.slt, and information_schema_tables.slt, and updated catalog.td, explain-analyze.td, and indexes.td. sqllogictest and the CI system-parameter defaults force-enable the gated index so the catalog-server snapshots cover it even though it ships off.

@leedqin leedqin requested a review from jubrad June 29, 2026 18:37
@leedqin leedqin requested review from a team and ggevay as code owners June 29, 2026 18:37
@leedqin leedqin force-pushed the console-cluster-utilization-indexes branch 3 times, most recently from 8affbd4 to e4e8020 Compare June 30, 2026 14:10
.to_string(),
)
.or_insert_with(|| "true".to_string());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For turning on a feature flag in CI (e.g. in SLT), we usually add it to get_minimal_system_parameters and get_variable_system_parameters. Is there a reason for making this one a special case by turning it on here in sqllogictest.rs?

(Btw. I'm thinking I'll create a "feature flags" skill, because my Claude also often gets confused about how feature flags work.)

Comment thread src/adapter-types/src/dyncfgs.rs Outdated
"enable_console_cluster_utilization_24h_index",
false,
"Create the 24h console cluster utilization overview index on mz_catalog_server.",
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention in the doc comment that this takes effect only at an envd restart.

@ggevay ggevay Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we don't have a precedent for feature flagging the existence of a builtin object. My Claude tells me this should work, but to be sure, could you please add a test that tries it what happens when this flag is turned on and off, and envd is restarted?

@ggevay

ggevay commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

To fix the upgrade test failure, this will need a migration for mz_indexes. Claude:

0dt Upgrade Smoke Test failure — missing mz_indexes migration step

The 0dt Upgrade Smoke Test failure is deterministic and caused by this PR (the ReadyToPromote timeout / DB-64 match is just the downstream symptom). The new generation panics at boot in update_fingerprints:

thread 'main' panicked at builtin_schema_migration.rs:1043:
fingerprint mismatch for builtin MaterializedView(mz_indexes ...)

Root cause: mz_indexes is a BuiltinMaterializedView whose SQL is generated by make_mz_indexes, which inlines the entire builtin-index set as a literal VALUES (...) list. Adding the mz_console_cluster_utilization_overview_3h_ind / ..._24h_ind builtins changes that VALUES list, hence mz_indexes's SQL text, hence its fingerprint. Because mz_indexes is a materialized view (durable persist state, not ephemeral like a plain view/index), a fingerprint change requires an explicit migration step. On a release→dev upgrade, plan_migration only selects steps with version > source_version, finds none for mz_indexes, and update_fingerprints panics. It only surfaces in upgrade tests because a fresh boot has no old persisted fingerprint to mismatch against — which is why the unit tests passed.

This is exactly the issue #37137 fixed for mz_cluster_replica_size_internal_ind (SQL-398); CI flagged it as a potential regression of that closed issue.

Fix: add a replacement step for mz_indexes at the current dev version in MIGRATION_STEPS (src/adapter/src/catalog/open/builtin_schema_migration.rs):

// Required because the console cluster-utilization 3h/24h builtin indexes
// were added. make_mz_indexes inlines the builtin-index set as VALUES, so any
// add/remove changes mz_indexes's SQL fingerprint and requires an explicit
// replacement step.
MigrationStep::replacement(
    "26.32.0-dev.0",
    CatalogItemType::MaterializedView,
    MZ_CATALOG_SCHEMA,
    "mz_indexes",
),

26.32.0-dev.0 matches the current version in src/environmentd/Cargo.toml and the existing mz_comments step at that version. Only mz_indexes is affected — the new views aren't inlined into any non-ephemeral MV, so they need no step.

@leedqin leedqin force-pushed the console-cluster-utilization-indexes branch from e4e8020 to c4b186d Compare June 30, 2026 15:32
@leedqin leedqin requested a review from def- June 30, 2026 15:47
@leedqin leedqin force-pushed the console-cluster-utilization-indexes branch from c4b186d to 6fe9248 Compare June 30, 2026 21:02
jubrad and others added 3 commits June 30, 2026 17:11
The Console's cluster-detail page polls a replica-utilization rollup that,
for every timeframe except "Last 14 days", was recomputed ad-hoc on each
request: it builds the whole-fleet rollup (per-replica metrics aggregation,
five per-bucket arg-max top-1s, a multi-way join) and only filters to the one
selected cluster at the very end. That recompute is CPU-bound on
mz_catalog_server, so it serializes under concurrent users and its cost grows
linearly with the number of clusters in the deployment.

This adds two new indexed views alongside the existing 14-day overview so the
Console can read a maintained, per-cluster indexed lookup for every timeframe
it offers instead of recomputing:

  - mz_console_cluster_utilization_overview_3h   (1-minute buckets, 3h window)
  - mz_console_cluster_utilization_overview_24h  (5-minute buckets, 24h window)
  - mz_console_cluster_utilization_overview       (now 1-hour buckets, 14d;
    previously 8-hour buckets)

All three share one SQL body (console_cluster_utilization_overview_sql) and
output relation so they stay in sync, and each is indexed on cluster_id in
mz_catalog_server. The view body is also cleaned up to drop a redundant
re-join of replica_history that the binning step never needed.

The two new views read only a 3h/24h window of metrics, so they are cheap to
maintain (~10MB and ~510MB of arrangements, ~1-10s hydration at 100 replicas).
The 14-day view dominates cost (input-bound on 14 days of 1-minute samples);
the 8h->1h change roughly doubles its arrangement memory but leaves hydration
unchanged.

The Console gates use of these views on the environment version, falling back
to the ad-hoc query on older environments.

Updates the catalog snapshot tests (oid, information_schema_tables,
mz_catalog_server_index_accounting, catalog_server_explain, autogenerated
mz_internal) and the catalog/indexes/explain-analyze testdrive files to cover
the new objects.
Adds ENABLE_CONSOLE_CLUSTER_UTILIZATION_RECENT_INDEX dyncfg + a bootstrap filter that skips gated builtin indexes when off (read from txn and system_parameter_defaults). Indexes are ephemeral, so flag-off drops the arrangement; default-on preserves current behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Short cluster-detail windows (<=3h) now read an un-binned base view that the
Console bins client-side, replacing the server-side five-pass top-k. This drops
the maintained binned arrangement for that tier and lets the Console SUBSCRIBE
to the live window.

- Repoint mz_console_cluster_utilization_overview_3h to a new un-binned base
  (one row per replica metric sample, no date_bin or top-k). replica_history is
  deduped to one row per replica so the Console SUBSCRIBE upsert key
  (replica_id, occurred_at) stays unique even if a replica changed size.
- Gate the 24h overview index behind enable_console_cluster_utilization_24h_index
  (default off) so a resource-constrained or self-managed environment does not
  build its maintained arrangement or stall an upgrade at hydration. The
  bootstrap gate uses the flag's effective value and falls back to the compiled
  default. Cloud enables it via system parameter.
- sqllogictest enables the gated index in its test defaults so the catalog-server
  snapshots still cover it.

Regenerates catalog_server_explain.slt and mz_catalog_server_index_accounting.slt.
@leedqin leedqin force-pushed the console-cluster-utilization-indexes branch from 6fe9248 to f8a5cf7 Compare June 30, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants