adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE by aljoscha · Pull Request #37355 · MaterializeInc/materialize

aljoscha · 2026-06-30T10:46:11Z

Motivation

Since we keep hitting environmentd stalls, this is one of the proactive
fixes from the SQL-42x batch.

EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized
timestamp-oracle read (oracle.read_ts().await) inline on the
single-threaded coordinator loop (explain_timestamp.rs and
subscribe.rs). The oracle backing store can be slow, and an inline await
parks the loop, so a slow oracle wedged every other session for the whole
round-trip. The peek path does not have this problem: it does the read in its
own LinearizeTimestamp stage that spawns the round-trip off the loop
(peek_linearize_timestamp).

The issue includes a reproducer (coord-oracle-latency) that routes only the
timestamp-oracle connection through toxiproxy and injects latency. With the
old code, k concurrent EXPLAIN TIMESTAMPs serialize on the loop and scale
~kx the single-call latency (measured: 1x ≈ 500ms, 8x ≈ 8.7s at 500ms
injected latency). A normal peek reads off the loop and batches.

What this does

Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the existing peek pattern.
Each gains a LinearizeTimestamp stage that spawns the read_ts()
round-trip off the coordinator loop and threads the resulting
oracle_read_ts forward as plain data. The subsequent stages
(determine_timestamp / sequence_peek_timestamp and the read-hold
bookkeeping) stay on the loop and consume that value unchanged, so timestamp
selection is identical, just no longer blocking the loop.

The "should we linearize, and against which oracle" decision that was
duplicated across the inline oracle_read_ts helper and
peek_linearize_timestamp is factored into a single
Coordinator::linearized_read_ts_oracle helper that returns the oracle to
read from, if any. All three spawn sites (peek, subscribe, explain timestamp)
now share it, so this removes duplication rather than adding a fourth copy.

Correctness

This is timing-neutral with respect to strict serializability: the
linearized read still happens during the query's real-time interval (after
arrival, before response), against the same backing oracle. It is the same
call the peek path already makes off the loop. No caching or shared-atomic
shortcut is introduced (see the rejected optimizations in
doc/developer/guide-adapter.md). Read holds are still acquired on the loop
in the stage that runs after the spawned read returns, matching peek. The new
spawned reads are cancelable via the existing cancel_enabled/handle_spawn
wiring, also matching peek.

Tests

No new tests in this PR. Correctness of timestamp selection for EXPLAIN TIMESTAMP and SUBSCRIBE is covered by the existing functional suites. The
issue's coord-oracle-latency reproducer is a manual latency diagnostic (its
assertion is written to detect the bug, i.e. on-loop serialization); I'd
like to discuss whether to add an inverted-assertion version as a benchmark,
since latency assertions in a multi-container mzcompose tend to be flaky in
CI.

Fixes SQL-425

Local verification (reproducer)

Ran the issue's setup locally: environmentd with catalog/persist direct on CockroachDB and the timestamp oracle routed through toxiproxy with 500ms injected latency. Built main and this branch from the same tree and compared.

EXPLAIN TIMESTAMP (the issue's clean signal): clear improvement.

metric (8 concurrent, 500ms oracle latency)	main	this branch
wall time	4079ms (~7.9x of 1x)	1390ms (~2.7x)
`read_ts` backing batches for 8 reads	n/a	8 ops -> 2 batches (reads now overlap/batch)
victim `SELECT 1` latency while 8 sessions hammer it	~26 s	~3.5 s

The "victim" row is the direct test of the headline symptom ("every other session stalls"): a constant SELECT 1 (no oracle read) on a separate connection. On main the coordinator loop is wedged by the inline oracle reads and the victim stalls ~26s; with the fix the read is off-loop and the victim recovers to ~3.5s.

Scope note / separate issue. The reproducer also shows that statements which actually read data (SELECT peeks, and SUBSCRIBE) still stall the coordinator under a slow oracle through a different on-loop cost (frontier advancement / dataflow lifecycle), independent of the linearized read this PR moves off-loop:

SELECT peek victim stall ~8.5s (this path was already off-loop for the read and is unchanged by this PR).
SUBSCRIBE victim stall ~24.7s on both main and this branch.

This matches the issue's own note that "peeks have their own on-loop oracle costs too." This PR fixes the specific defect it names (the inline linearized read for EXPLAIN TIMESTAMP and SUBSCRIBE) and is fully effective for EXPLAIN TIMESTAMP. The broader "data reads stall the coordinator under a slow oracle" cost affects peeks too and should be tracked as a separate follow-up rather than folded in here.

…ESTAMP and SUBSCRIBE EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle read (`oracle.read_ts().await`) inline on the single-threaded coordinator loop. The oracle backing store can be slow, and an inline await parks the loop, so a slow oracle wedged every other session for the whole round-trip. The peek path already avoids this: it does the read in its own `LinearizeTimestamp` stage that spawns the round-trip off the loop. Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a `LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads the resulting `oracle_read_ts` forward as plain data. The subsequent stages (`determine_timestamp` / `sequence_peek_timestamp` and the read-hold bookkeeping they do) stay on the loop and consume that value unchanged, so timestamp selection is identical, just no longer blocking. This is timing-neutral with respect to strict serializability: the linearized read still happens during the query's real-time interval (after arrival, before response), against the same backing oracle. It is the same call the peek path already makes off the loop. No caching or shared-atomic shortcut is introduced. The "should we linearize, and against which oracle" decision that was duplicated across the inline `oracle_read_ts` helper and `peek_linearize_timestamp` is factored into a single `Coordinator::linearized_read_ts_oracle` helper that returns the oracle to read from, if any. All three spawn sites share it. Fixes SQL-425

When a SUBSCRIBE starts or stops we record it in the mz_subscriptions introspection table. We did this with a synchronous group commit (`execute`/`blocking`) on the coordinator loop, which acquires a write timestamp from the timestamp oracle inline. When the oracle backend is slow, that round trip blocks the loop and stalls every other session, once per SUBSCRIBE start and once per stop. Instead, defer both the insert and the retraction to a (batched) group commit and wait for the write to become durable off the loop, in a spawned task, before delivering the client-visible response. On start, `implement_subscribe` hands the write-notify future back to its callers, which await it before sending the SUBSCRIBE response. So mz_subscriptions stays synchronously consistent: another session sees the row immediately after the response. On stop, `remove_active_compute_sink` returns the retraction's notify, `drop_compute_sinks` threads it through, and `retire_compute_sinks` awaits it before calling `sink.retire`. So a client that observes a subscribe's retirement (a cancellation or dependency-dropped error) does not then read a stale row. The controller collections are still dropped on the loop. For sinks that write no introspection row (internal subscribes, COPY TO) the notify is already resolved. The cluster test reads mz_subscriptions from a separate session whose read timestamp can lag the commit, so adapt it to poll. `add_active_compute_sink` no longer commits inline, so it is no longer async. Also remove the now-unused `BuiltinTableAppend::blocking`.

aljoscha added the ci-nightly PR CI control: also trigger Nightly label Jun 30, 2026

aljoscha marked this pull request as ready for review June 30, 2026 11:20

aljoscha requested a review from a team as a code owner June 30, 2026 11:20

aljoscha requested a review from def- June 30, 2026 11:20

aljoscha marked this pull request as draft June 30, 2026 11:30

aljoscha mentioned this pull request Jun 30, 2026

adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447) #37363

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355

adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355
aljoscha wants to merge 2 commits into
MaterializeInc:mainfrom
aljoscha:sql-425-linearized-oracle-read-off-loop

aljoscha commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aljoscha commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What this does

Correctness

Tests

Local verification (reproducer)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aljoscha commented Jun 30, 2026 •

edited

Loading