fix(charts): cap group-by time-series to top-N series to prevent OOM#2429
Conversation
High-cardinality group-by time charts could pull hundreds of thousands of series into a single tile (only ~60 were ever drawn), zero-filling a dense buckets×series matrix in memory and OOMing the browser tab. Add an opt-in query-level `seriesLimit`: when set on a group-by + granularity chart, renderChartConfig emits a TopGroups CTE that keeps only the top-N series by max value in any bucket and restricts the outer query to those groups. The time-chart display path sets seriesLimit=60 (matching HARD_LINES_LIMIT); alerts and other renderChartConfig consumers leave it unset, so their series evaluation is unchanged.
🦋 Changeset detectedLatest commit: 14e9921 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
Greptile SummaryThis PR caps group-by time-series queries to a configurable top-N series to prevent browser OOM on high-cardinality group-bys. It introduces a
Confidence Score: 5/5Safe to merge — the CTE is opt-in, the alert evaluation path is unchanged, and the implementation is backed by both unit and integration tests covering alias stripping, NULL exclusion, and multi-column group-bys. The core CTE logic is well-guarded and thoroughly tested. The only finding is that the chart editor SQL preview calls packages/app/src/components/DBEditTimeChartForm/utils.ts — the one call site not updated to pass the team series limit. Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as DBTimeChart / SearchTotalCountChart
participant API as api.useMe()
participant CU as convertToTimeChartConfig
participant RC as renderChartConfig
participant CH as ClickHouse
UI->>API: fetch team settings
API-->>UI: "{ seriesLimit: N (or undefined) }"
UI->>CU: convertToTimeChartConfig(config, N)
Note over CU: seriesLimit = max(1, N ?? 100)
CU-->>UI: config + seriesLimit
UI->>RC: renderChartConfig(config+seriesLimit)
Note over RC: renderSeriesLimitCte() emits<br/>__hdx_series_limit CTE<br/>(top-N by max per bucket)
RC-->>UI: SQL with WITH __hdx_series_limit AS (...)
UI->>CH: execute SQL
CH-->>UI: "<= N series"
Reviews (12): Last reviewed commit: "Merge branch 'main' into sao-paulo" | Re-trigger Greptile |
E2E Test Results✅ All tests passed • 199 passed • 3 skipped • 1330s
Tests ran across 4 shards in parallel. |
Add a `seriesLimit` team setting alongside the existing ClickHouse client settings so teams can tune the top-N series cap that bounds time-chart memory. `convertToTimeChartConfig` now reads `me.team.seriesLimit`, falling back to the default (60) when unset and floored at 1. No upper bound — teams may intentionally fetch more series than the 60 rendered at once (the surplus is available in the series selector). The setting flows through the existing team-settings pipeline (Zod schema, Mongoose model, /me, PATCH /team/clickhouse-settings, ClickhouseSettingForm); no controller or endpoint changes were needed.
karl-power
left a comment
There was a problem hiding this comment.
Just missing a changeset
…Config Replace the `baseWithClauses` rename and two nested ternaries with a plain `let withClauses`/`let where` plus a single `if (seriesCap)` fold. Behaviour is unchanged — the ranking CTE is still appended to the already-rendered WITH clause (rather than chartConfig.with, which would disable the main SELECT's materialized-column optimization). Byte-identical SQL; all snapshots unchanged.
A multi-column string group-by (e.g. "LogAttributes['cap'],ServiceName") made the series-cap ranking CTE emit an invalid two-argument toString() and a malformed NULL check, because renderSelectList returns a comma-joined string as a single expression. Split it into per-column expressions with splitAndTrimWithBracket (which respects []/()/quotes) so the NULL/empty filter applies per column. Array and single-column-string group-bys are unchanged (snapshots unchanged); covered by a renderChartConfig unit test and a real-CH integration test mirroring the failing Map-access case.
…es cap The series-cap ranking CTE previously dropped both NULL and empty-string group values. Empty-string groups (e.g. a missing Map key like LogAttributes['x'] -> '') are real data, so silently hiding them was surprising. Now only NULL components are excluded — which is the genuine technical need, since the outer `tuple(...) IN (...)` is NULL-unsafe (transform_null_in=0) and would otherwise waste a top-N slot on a group it can't match. Empty-string groups now compete for a slot like any other value.
If a group-by item carried an alias, renderSelectList appended ` AS "alias"`, which then leaked into tuple(...) and `(... IS NOT NULL)` in the series-cap CTE — both invalid SQL there (the outer GROUP BY tolerates aliases, these positions do not). Strip the alias before building the tuple and null filter, matching how the rank value is already rendered. No alias is set on group-bys today, so this is a defensive fix; covered by a new unit test.
…rray, metric gating) Add integration tests against real ClickHouse for the corner cases the recent series-cap fixes target but unit tests can't fully prove: - NULL group component excluded (NULL-unsafe `tuple() IN (...)` would otherwise waste the only top-N slot and yield an empty chart); - multi-column *array* group-by with an alias (proves the 2-col tuple()/IN executes and the alias is stripped inside the CTE yet preserved in output). Add a unit test that a metric source does not emit the series-cap CTE (gating).
Bumps DEFAULT_SERIES_LIMIT 60 -> 100. Since MAX_TIME_CHART_SERIES and HARD_LINES_LIMIT derive from it, this raises both the default query-time cap and the rendered-line cap (and ServicesDashboard's MAX_NUM_SERIES) to 100. Teams can still override the query cap via the seriesLimit setting.
b672cf2 to
7d363be
Compare
… tooltip Drop the explanatory comments added across the series-cap changes (keeping only the ones inside renderSeriesLimitCte), and simplify the "Time Chart Series Limit" tooltip to "Maximum number of series fetched per time chart."
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Summary
High-cardinality group-by time charts could pull hundreds of thousands of series into a single tile — only ~60 were ever drawn (
HARD_LINES_LIMIT), but every series was still fetched, JSON-parsed, and zero-filled into a dense buckets×series matrix in memory, OOMing the browser tab. This adds an opt-in query-levelseriesLimit: when set on a group-by + granularity chart,renderChartConfigemits a CTE (__hdx_series_limit) that keeps only the top-N series ranked by max value in any bucket and restricts the outer query to those groups (mirroring the existingaggFn=increaseTopGroups path). The time-chart display path (convertToTimeChartConfig) sets this cap, while alerts and otherrenderChartConfigconsumers leave it unset so their series evaluation is unchanged.The cap is also configurable per team: a new
seriesLimitteam setting (alongside the existing ClickHouse client settings) lets teams tune it, defaulting to 100 when unset and floored at 1 (no upper bound, so teams can intentionally fetch more series). The default and the rendered-line cap (HARD_LINES_LIMIT) move together; a team value only governs how many series are fetched, with any surplus available in the series selector. Verified end-to-end against real ClickHouse (50 series → top 5) plus unit tests; all existing SQL snapshots are unchanged.How to test on Vercel preview
Preview routes: /team
Steps:
References