feat: emit parent_query_id to link nested SPI queries#95
Open
JoshDreamland wants to merge 10 commits into
Open
feat: emit parent_query_id to link nested SPI queries#95JoshDreamland wants to merge 10 commits into
JoshDreamland wants to merge 10 commits into
Conversation
Each event now carries the queryid of its calling query (0 for top-level). Lets aggregations filter to top-level work with WHERE parent_query_id = 0, avoiding the double-counting that happens when plpgsql functions issue SPI statements that themselves emit events. The Event payload's prior `bool top_level` is replaced in-place by `uint64 parent_query_id`; static asserts in ring_entry.h continue to verify layout equivalence with the on-wire ring slot. Single statics for rusage_start, query_start_ts, and current_query_is_top_level are consolidated into a fixed-size query_stack[16] indexed by nesting_level. The previous single rusage_start was clobbered on nested SPI getrusage() calls, so nested CPU deltas were measuring just the inner portion and the outer's CPU was wrong; per-frame baselines fix that alongside parent linkage. The PG_TRY/nesting_level pattern in ExecutorRun/Finish/ProcessUtility is untouched. Schema gains a parent_query_id UInt64 column; migrations/001_add_parent_query_id.sql covers existing deployments. This supersedes #61 (which mixed the feature with broader refactor churn — vector→array→array+counter iterations, helper extraction, depth-cap retunes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> EOF )
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds parent_query_id to emitted events so nested SPI queries can be linked to their caller and excluded from “top-level only” aggregations, while also fixing nested CPU timing by switching from single baselines to a per-nesting frame stack.
Changes:
- Replaced
top_levelwithparent_query_idin the in-memory event/ring-entry payload. - Introduced a fixed-size
query_stackto track per-nesting queryid, CPU baseline, and start timestamp. - Exported and persisted
parent_query_idvia ClickHouse schema changes plus a migration.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/queue/ring_entry.h | Replaces top_level with parent_query_id in the ring buffer entry while preserving the block-copy prefix layout checks. |
| src/queue/event.h | Replaces top_level with parent_query_id in the public event struct. |
| src/hooks/hooks.c | Implements per-nesting query frames and populates parent_query_id during executor/utility/log event emission. |
| src/export/stats_exporter.cc | Adds a parent_query_id column to exported event rows. |
| docker/init/00-schema.sql | Adds parent_query_id column to the canonical ClickHouse schema. |
| migrations/001_add_parent_query_id.sql | Adds a migration for adding parent_query_id to existing ClickHouse tables. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+11
to
+12
| ALTER TABLE pg_stat_ch.events_raw | ||
| ADD COLUMN IF NOT EXISTS parent_query_id UInt64 DEFAULT 0; |
Comment on lines
888
to
892
| static void CaptureLogEvent(ErrorData* edata) { | ||
| PschEvent event; | ||
| InitBaseEvent(&event, GetCurrentTimestamp(), (nesting_level == 0), PSCH_CMD_UNKNOWN); | ||
| InitBaseEvent(&event, GetCurrentTimestamp(), GetParentQueryId(), PSCH_CMD_UNKNOWN); | ||
|
|
||
| UnpackSqlState(edata->sqlerrcode, event.err_sqlstate); |
clang-format wants the column-aligned trailing comments above parent_query_id to share its (wider) alignment column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot review on #95: 1. parent_query_id was UInt64 while query_id is Int64; comparisons/joins between the two would require explicit casts in ClickHouse. Switch the ClickHouse column, migration, and exporter to Int64. Keep the in-memory struct field as uint64 (matches PG's queryId type) and cast at append time the same way query_id already does. 2. At nesting_level >= PSCH_MAX_NESTING_DEPTH, the PschExecutorEnd frame pointer is NULL, leaving start_ts at 0 (the PG epoch). Any path that computed GetCurrentTimestamp() - start_ts would yield ~25 years of us. Fall back to GetCurrentTimestamp() so deltas are ~0 instead. In practice query_desc->totaltime supplies a real duration via instrumentation, so the fallback subtraction is only used when totaltime wasn't allocated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
028_parent_query_id.pl exercises the three semantic invariants by
inspecting actual exported rows in ClickHouse:
1. Top-level queries report parent_query_id = 0.
2. Nested SPI queries report parent_query_id = outer caller's query_id
(verified by self-join on the events_raw table).
3. A log/error event captured during nested execution reports
parent_query_id = outer caller (not the running query itself) and a
non-zero query_id (the running statement). This catches the
CaptureLogEvent off-by-one: nesting_level is bumped inside
ExecutorRun, so slot nesting_level - 1 holds the running query and
its caller lives at nesting_level - 2.
Filters key off distinctive table/function names rather than constants
or comments — those survive query normalization, where literals would
be replaced with $N placeholders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CaptureLogEvent off-by-one came from nesting_level moving in places
where the frame stack didn't — Run/Finish bumped/decremented it around
the body, while the frame slot was written in Start and only read in
End. Inside the body (where CaptureLogEvent fires) nesting_level was
one higher than it was at End, so the same "slot[nesting_level - 1]"
expression meant different things in each place.
Bind the two together: nesting_level moves exactly where a frame is
pushed or popped, and parent_query_id is captured into the frame at
push time. Now slot[nesting_level - 1] is the currently-active query
at every emit site — End, ProcessUtility, CaptureLogEvent — and parent
is read directly from frame->parent_query_id without offset arithmetic.
Mechanics:
- PushQueryFrame in PschExecutorStart bumps nesting_level after
writing the slot. PopQueryFrame in PschExecutorEnd decrements.
- PschExecutorRun and PschExecutorFinish no longer touch nesting_level
on the success path; each just wraps its chain in PG_TRY/PG_CATCH
so a longjmp out of the body pops the frame on the unwind and
keeps depth balanced. Each PG_CATCH on the unwind path decrements
once per level — three deep nesting with an error at the bottom
cascades through three PG_CATCH blocks for three pops, exactly
matching three pushes. No subxact callback needed.
- PschProcessUtility brackets its body with PG_TRY/PG_FINALLY in the
same function, so push at the top and PopQueryFrame in PG_FINALLY
balance regardless of how the body exits. ExecuteUtilityWithNesting
is gone.
- CaptureLogEvent now reads TopQueryFrame() and uses both its queryid
(the running query, attached as event.queryid for attribution) and
its parent_query_id (the running query's caller, attached as
event.parent_query_id with strict semantics — previously this slot
was misread as the running query's own id).
PeekQueryStack and its offset parameter are gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The totaltime allocation runs after standard_ExecutorStart returns, between our PG_END_TRY and the close of PschExecutorStart. InstrAlloc goes through MemoryContextAllocZero, which can ereport(ERROR) on OOM — that longjmp would unwind past our PG_END_TRY without firing the PG_CATCH, leaving the pushed frame unpopped. Move the allocation inside the existing PG_TRY block so the same PG_CATCH cleans up the frame. No behavior change on the happy path; new robustness against any future code added between the push and the function's return. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The push had been doing more conditional work than necessary to figure
out whether a parent existed. Convert nesting_level so that:
- -1 is the resting state (no active query),
- n in [0, PSCH_MAX_NESTING_DEPTH-1] is the slot index of the
currently-active frame on top of the stack,
- >= PSCH_MAX_NESTING_DEPTH is the overflow region.
Push now increments first and bails on overflow, so the cap check
collapses to a single comparison. The parent_query_id lookup still
needs one conditional ("do we have a parent at all"), but the previous
double-condition (positive AND within cap) is gone. TopQueryFrame /
PopQueryFrame benefit too: < 0 doubles as a nullity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups around the runaway-nesting case:
- PushQueryFrame now emits a WARNING when it returns NULL because the
cap was exceeded. Previously the overflow was silent — telemetry
just went missing for the frame. WARNING is loud enough to notice
if it happens in practice but doesn't fail the query.
- PschProcessUtility's start_ts fallback was still 0 (PG epoch) when
frame was NULL. Match PschExecutorEnd by falling back to
GetCurrentTimestamp() so the emitted event carries an approximately
correct ts_start rather than 1970.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3232e8e to
c632cdc
Compare
3 tasks
Earlier commits on this branch added parent_query_id to PschEvent and to the ClickHouse-native exporter, plus the ClickHouse schema migration, but arrow_batch.cc keeps its own hard-coded column list and was missed. The result: the production export path (OTel + Arrow IPC) silently dropped parent_query_id on every event, so downstream consumers saw 0 across the board. Caught via the OTel/Arrow quickstart-validate harness (see PR #96). Local script flips from 4/8 passing to all-green once this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
serprex
approved these changes
May 14, 2026
31eadfe to
18a6eae
Compare
18a6eae to
85266f0
Compare
t/028 already exercises parent_query_id linkage through the ClickHouse-native exporter. Add t/029 to exercise the same invariants through the Arrow/OTel export path (the production pathway), using pg_stat_ch.debug_arrow_dump_dir to capture each Arrow IPC batch to disk before the gRPC send (which we point at a non-existent collector so it fails harmlessly). Same trick as t/026_arrow_dump. t/029 guards specifically against the earlier surgical-PR regression where arrow_batch.cc was missing parent_query_id entirely (the ClickHouse-native exporter had it, so t/028 alone would have let the gap through). Test 3 in both files uses RAISE WARNING inside a nested SPI call to exercise CaptureLogEvent's queryid/parent_query_id assignment. We cannot use a caught ERROR here: errfinish PG_RE_THROWs at elog.c:539 for ERROR-level events without calling EmitErrorReport, so emit_log_hook only fires later from PostgresMain's top-level catch (after all frames have been popped via PG_CATCH unwinding) — or never at all, for errors caught by a plpgsql EXCEPTION block. WARNING-level events go through EmitErrorReport directly, so our hook fires while the inner SPI's frame is still on the stack, which is the only scenario where CaptureLogEvent's slot choice is observable. The assertions distinguish "inner SPI queryid" from "outer caller queryid" so a regression that attributes the log event to the wrong slot (the off-by-one we're guarding against) would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
85266f0 to
f52d76c
Compare
Contributor
Author
|
I missed stuff, because Arrow is a different code path with different implications. I think that's addressed but I'm feeling less sure of myself. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Assigns a
parent_query_idto events to avoid polluting metrics for nested queries (when plpgsql functions issue SPI statements that emit events within events).Changes
PschEvent::top_level→PschEvent::parent_query_id(same slot in event/ring_entry; static asserts still hold).rusage_start,query_start_ts,current_query_is_top_level) consolidated into one fixed-sizequery_stack[16]of{queryid, rusage_start, query_start_ts}indexed bynesting_level. The existing PG_TRY/nesting_levelpattern in ExecutorRun/Finish/ProcessUtility is unchanged.RecordUInt64+Appendfor the new column instats_exporter.cc.parent_query_id UInt64column indocker/init/00-schema.sql;migrations/001_add_parent_query_id.sqlfor existing deployments.Bug fix
The prior single static
rusage_startwas overwritten by nestedgetrusage()calls, so the outer query's CPU delta only measured the post-SPI portion. Per-frame baselines fix that alongside parent linkage.Test plan
mise run test:regressmise run test:isolation