Skip to content

adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355

Draft
aljoscha wants to merge 2 commits into
MaterializeInc:mainfrom
aljoscha:sql-425-linearized-oracle-read-off-loop
Draft

adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355
aljoscha wants to merge 2 commits into
MaterializeInc:mainfrom
aljoscha:sql-425-linearized-oracle-read-off-loop

Conversation

@aljoscha

@aljoscha aljoscha commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Motivation

Since we keep hitting environmentd stalls, this is one of the proactive
fixes from the SQL-42x batch.

EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized
timestamp-oracle read (oracle.read_ts().await) inline on the
single-threaded coordinator loop
(explain_timestamp.rs and
subscribe.rs). The oracle backing store can be slow, and an inline await
parks the loop, so a slow oracle wedged every other session for the whole
round-trip. The peek path does not have this problem: it does the read in its
own LinearizeTimestamp stage that spawns the round-trip off the loop
(peek_linearize_timestamp).

The issue includes a reproducer (coord-oracle-latency) that routes only the
timestamp-oracle connection through toxiproxy and injects latency. With the
old code, k concurrent EXPLAIN TIMESTAMPs serialize on the loop and scale
~kx the single-call latency (measured: 1x ≈ 500ms, 8x ≈ 8.7s at 500ms
injected latency). A normal peek reads off the loop and batches.

What this does

Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the existing peek pattern.
Each gains a LinearizeTimestamp stage that spawns the read_ts()
round-trip off the coordinator loop and threads the resulting
oracle_read_ts forward as plain data. The subsequent stages
(determine_timestamp / sequence_peek_timestamp and the read-hold
bookkeeping) stay on the loop and consume that value unchanged, so timestamp
selection is identical, just no longer blocking the loop.

The "should we linearize, and against which oracle" decision that was
duplicated across the inline oracle_read_ts helper and
peek_linearize_timestamp is factored into a single
Coordinator::linearized_read_ts_oracle helper that returns the oracle to
read from, if any. All three spawn sites (peek, subscribe, explain timestamp)
now share it, so this removes duplication rather than adding a fourth copy.

Correctness

This is timing-neutral with respect to strict serializability: the
linearized read still happens during the query's real-time interval (after
arrival, before response), against the same backing oracle. It is the same
call the peek path already makes off the loop. No caching or shared-atomic
shortcut is introduced (see the rejected optimizations in
doc/developer/guide-adapter.md). Read holds are still acquired on the loop
in the stage that runs after the spawned read returns, matching peek. The new
spawned reads are cancelable via the existing cancel_enabled/handle_spawn
wiring, also matching peek.

Tests

No new tests in this PR. Correctness of timestamp selection for EXPLAIN TIMESTAMP and SUBSCRIBE is covered by the existing functional suites. The
issue's coord-oracle-latency reproducer is a manual latency diagnostic (its
assertion is written to detect the bug, i.e. on-loop serialization); I'd
like to discuss whether to add an inverted-assertion version as a benchmark,
since latency assertions in a multi-container mzcompose tend to be flaky in
CI.

Fixes SQL-425


Local verification (reproducer)

Ran the issue's setup locally: environmentd with catalog/persist direct on CockroachDB and the timestamp oracle routed through toxiproxy with 500ms injected latency. Built main and this branch from the same tree and compared.

EXPLAIN TIMESTAMP (the issue's clean signal): clear improvement.

metric (8 concurrent, 500ms oracle latency) main this branch
wall time 4079ms (~7.9x of 1x) 1390ms (~2.7x)
read_ts backing batches for 8 reads n/a 8 ops -> 2 batches (reads now overlap/batch)
victim SELECT 1 latency while 8 sessions hammer it ~26 s ~3.5 s

The "victim" row is the direct test of the headline symptom ("every other session stalls"): a constant SELECT 1 (no oracle read) on a separate connection. On main the coordinator loop is wedged by the inline oracle reads and the victim stalls ~26s; with the fix the read is off-loop and the victim recovers to ~3.5s.

Scope note / separate issue. The reproducer also shows that statements which actually read data (SELECT peeks, and SUBSCRIBE) still stall the coordinator under a slow oracle through a different on-loop cost (frontier advancement / dataflow lifecycle), independent of the linearized read this PR moves off-loop:

  • SELECT peek victim stall ~8.5s (this path was already off-loop for the read and is unchanged by this PR).
  • SUBSCRIBE victim stall ~24.7s on both main and this branch.

This matches the issue's own note that "peeks have their own on-loop oracle costs too." This PR fixes the specific defect it names (the inline linearized read for EXPLAIN TIMESTAMP and SUBSCRIBE) and is fully effective for EXPLAIN TIMESTAMP. The broader "data reads stall the coordinator under a slow oracle" cost affects peeks too and should be tracked as a separate follow-up rather than folded in here.

…ESTAMP and SUBSCRIBE

EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle
read (`oracle.read_ts().await`) inline on the single-threaded coordinator
loop. The oracle backing store can be slow, and an inline await parks the
loop, so a slow oracle wedged every other session for the whole round-trip.
The peek path already avoids this: it does the read in its own
`LinearizeTimestamp` stage that spawns the round-trip off the loop.

Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a
`LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads
the resulting `oracle_read_ts` forward as plain data. The subsequent stages
(`determine_timestamp` / `sequence_peek_timestamp` and the read-hold
bookkeeping they do) stay on the loop and consume that value unchanged, so
timestamp selection is identical, just no longer blocking.

This is timing-neutral with respect to strict serializability: the linearized
read still happens during the query's real-time interval (after arrival,
before response), against the same backing oracle. It is the same call the
peek path already makes off the loop. No caching or shared-atomic shortcut is
introduced.

The "should we linearize, and against which oracle" decision that was
duplicated across the inline `oracle_read_ts` helper and
`peek_linearize_timestamp` is factored into a single
`Coordinator::linearized_read_ts_oracle` helper that returns the oracle to
read from, if any. All three spawn sites share it.

Fixes SQL-425
@aljoscha aljoscha added the ci-nightly PR CI control: also trigger Nightly label Jun 30, 2026
@aljoscha aljoscha marked this pull request as ready for review June 30, 2026 11:20
@aljoscha aljoscha requested a review from a team as a code owner June 30, 2026 11:20
@aljoscha aljoscha requested a review from def- June 30, 2026 11:20
@aljoscha aljoscha marked this pull request as draft June 30, 2026 11:30
When a SUBSCRIBE starts or stops we record it in the mz_subscriptions
introspection table. We did this with a synchronous group commit
(`execute`/`blocking`) on the coordinator loop, which acquires a write
timestamp from the timestamp oracle inline. When the oracle backend is
slow, that round trip blocks the loop and stalls every other session,
once per SUBSCRIBE start and once per stop.

Instead, defer both the insert and the retraction to a (batched) group
commit and wait for the write to become durable off the loop, in a
spawned task, before delivering the client-visible response.

On start, `implement_subscribe` hands the write-notify future back to its
callers, which await it before sending the SUBSCRIBE response. So
mz_subscriptions stays synchronously consistent: another session sees the
row immediately after the response.

On stop, `remove_active_compute_sink` returns the retraction's notify,
`drop_compute_sinks` threads it through, and `retire_compute_sinks`
awaits it before calling `sink.retire`. So a client that observes a
subscribe's retirement (a cancellation or dependency-dropped error) does
not then read a stale row. The controller collections are still dropped
on the loop. For sinks that write no introspection row (internal
subscribes, COPY TO) the notify is already resolved.

The cluster test reads mz_subscriptions from a separate session whose
read timestamp can lag the commit, so adapt it to poll.

`add_active_compute_sink` no longer commits inline, so it is no longer
async. Also remove the now-unused `BuiltinTableAppend::blocking`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-nightly PR CI control: also trigger Nightly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant