Skip to content

[codex] Add query progress reporting#18649

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:codex/query-progress
Open

[codex] Add query progress reporting#18649
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:codex/query-progress

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Jun 2, 2026

Summary

Adds query progress reporting for long-running Pinot queries across the broker, controller, server, V1 execution, V2 execution, query console, and Pinot CLI.

The progress model reports processed work units over total work units. V1 uses server segment progress; V2 estimates work from multi-stage operators and stage execution progress. The controller exposes progress by clientQueryId, the query console polls it while a query is running, and the CLI renders adaptive progress: SSE/simple responses stay one compact line, while MSE responses can render aggregate progress plus labeled component rows.

User impact

  • Query console now shows numeric query progress while a query is in RUNNING state.
  • MSE progress responses can include labeled detail rows, and both Query Console and Pinot CLI render stacked progress bars when those details are present.
  • pinot-cli supports --progress-interval-ms and config key progress-interval-ms.
  • CLI progress is disabled with --progress-interval-ms=0 and is only rendered for interactive terminals, so redirected output/logs stay clean.
  • README includes usage notes and V1/V2 quickstart sample queries.

Screenshot

Query console progress while a V2 quickstart query is running:

Query console progress bar showing 14.3% and 1/7 work units

Notes

The CLI injects a generated clientQueryId as a quoted query option so progress polling can correlate the client request with running query state.

For MSE, progress keeps missing or non-responsive workers as labeled unknown rows instead of dropping them from the aggregate denominator. That prevents a partially reported query from falsely showing 100% complete.

Validation

  • ./mvnw -pl pinot-controller,pinot-clients/pinot-cli -am -DskipTests -DskipITs -Dmaven.javadoc.skip=true compile
  • ./mvnw -pl pinot-broker -am -DskipTests -DskipITs -Dmaven.javadoc.skip=true compile
  • ./mvnw -pl pinot-spi -Dtest=QueryProgressStatsTest test
  • ./mvnw -pl pinot-query-runtime -am -Dtest=OpChainSchedulerServiceTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-core -am -Dtest=InstanceRequestHandlerTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-clients/pinot-cli -DskipTests -DskipITs -Dmaven.javadoc.skip=true package
  • spotless:apply, license:format, license:check, and checkstyle:check on affected modules
  • git diff --check
  • Local quickstart smoke test with query console and Pinot CLI progress query

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end query progress reporting for long-running Pinot queries, exposing a unified progress model (processed work units / total work units) across SSE (segment-based) and MSE (operator/stage-based) execution paths, and surfacing it via REST/gRPC, Query Console UI, and Pinot CLI.

Changes:

  • Introduces QueryProgressStats in pinot-spi, plus progress counters in QueryExecutionContext.
  • Implements progress tracking and retrieval across servers/brokers/controller (including new REST endpoints and a new gRPC Progress RPC for MSE).
  • Adds polling + rendering in Query Console and an interactive CLI progress line / progress bar.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pinot-spi/src/test/java/org/apache/pinot/spi/query/QueryProgressStatsTest.java Adds unit tests for percent calculation, aggregation, JSON round-trip, and execution context accumulation.
pinot-spi/src/main/java/org/apache/pinot/spi/query/QueryProgressStats.java New progress stats model with JSON support, aggregation, and derived percent.
pinot-spi/src/main/java/org/apache/pinot/spi/query/QueryExecutionContext.java Adds atomic progress counters and APIs to mutate/read progress.
pinot-server/src/main/java/org/apache/pinot/server/api/resources/QueryResource.java Adds server REST endpoint to fetch per-query progress and aggregate OFFLINE/REALTIME.
pinot-query-runtime/src/test/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerServiceTest.java Adds test coverage for progress tracking on completed op-chains.
pinot-query-runtime/src/main/java/org/apache/pinot/query/service/server/QueryServer.java Adds gRPC Progress RPC handler for MSE worker progress.
pinot-query-runtime/src/main/java/org/apache/pinot/query/service/dispatch/QueryDispatcher.java Adds broker-side dispatch logic to query MSE workers for progress and aggregate responses.
pinot-query-runtime/src/main/java/org/apache/pinot/query/service/dispatch/DispatchClient.java Adds client call implementation for the new gRPC progress RPC.
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/QueryRunner.java Exposes execution-context tracking and progress retrieval via OpChainSchedulerService.
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/plan/server/ServerPlanRequestUtils.java Plumbs a shared QueryExecutionContext into leaf-stage ServerQueryRequests for progress attribution.
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java Tracks execution contexts and increments processed work units on op-chain completion/failure.
pinot-core/src/test/java/org/apache/pinot/core/transport/InstanceRequestHandlerTest.java Updates tests for renamed/cached execution-context retrieval API.
pinot-core/src/main/java/org/apache/pinot/core/transport/InstanceRequestHandler.java Uses cached execution context and exposes server-side progress stats lookup.
pinot-core/src/main/java/org/apache/pinot/core/query/scheduler/QueryScheduler.java Uses cached execution context when opening QueryThreadContext.
pinot-core/src/main/java/org/apache/pinot/core/query/request/ServerQueryRequest.java Adds execution-context caching + setter to support shared context plumbing.
pinot-core/src/main/java/org/apache/pinot/core/query/executor/ServerQueryExecutorV1Impl.java Adds total segment accounting to drive SSE progress denominators.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/SortedGroupByCombineOperator.java Marks segments as processed during combine execution to advance progress.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/SequentialSortedGroupByCombineOperator.java Marks segments as processed for sequential sorted group-by combine.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/MinMaxValueBasedSelectionOrderByCombineOperator.java Marks segments as processed (including skipped segments) for progress accuracy.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/GroupByCombineOperator.java Marks processed segments during group-by combine.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/BaseSingleBlockCombineOperator.java Marks segments as processed when producing results blocks.
pinot-core/src/main/java/org/apache/pinot/core/operator/combine/BaseCombineOperator.java Adds shared helper to increment processed-segment progress via thread context.
pinot-controller/src/main/resources/app/requests/index.ts Adds Query Console API call for controller clientQueryId progress endpoint.
pinot-controller/src/main/resources/app/pages/Query.tsx Adds clientQueryId injection, progress polling, and progress UI (numbers + bar).
pinot-controller/src/main/resources/app/Models.ts Adds QueryProgressStats type to UI model definitions.
pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotRunningQueryResource.java Adds controller REST endpoint to fetch progress by clientQueryId by polling brokers.
pinot-common/src/main/proto/worker.proto Adds gRPC Progress RPC and request/response messages for MSE worker progress.
pinot-clients/pinot-cli/src/main/java/org/apache/pinot/cli/PinotCli.java Adds CLI progress polling/rendering, config + flag, and clientQueryId injection.
pinot-clients/pinot-cli/README.md Documents CLI/query-console progress behavior and usage examples.
pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/MultiStageBrokerRequestHandler.java Tracks MSE execution contexts and aggregates broker+server progress for MSE queries.
pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BrokerRequestHandlerDelegate.java Routes broker progress requests to MSE handler first, then SSE handler.
pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BrokerRequestHandler.java Extends broker handler interface with getQueryProgressStats(...).
pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java Implements SSE progress retrieval by polling servers’ new progress endpoint.
pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/BaseBrokerRequestHandler.java Adds default getQueryProgressStats(...) method stub + precondition for clientQueryId mapping.
pinot-broker/src/main/java/org/apache/pinot/broker/api/resources/PinotClientRequest.java Adds broker REST endpoint to fetch progress by internal requestId or clientQueryId.

Comment thread pinot-controller/src/main/resources/app/pages/Query.tsx Outdated
@xiangfu0 xiangfu0 force-pushed the codex/query-progress branch 3 times, most recently from 6cb4cc3 to d601f43 Compare June 2, 2026 06:25
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 37.26415% with 266 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.44%. Comparing base (edfbf69) to head (5cb38b9).
⚠️ Report is 31 commits behind head on master.

Files with missing lines Patch % Lines
.../pinot/query/service/dispatch/QueryDispatcher.java 0.00% 53 Missing ⚠️
...oller/api/resources/PinotRunningQueryResource.java 0.00% 47 Missing ⚠️
...sthandler/BaseSingleStageBrokerRequestHandler.java 0.00% 38 Missing ⚠️
...pinot/broker/api/resources/PinotClientRequest.java 0.00% 27 Missing ⚠️
...ache/pinot/server/api/resources/QueryResource.java 0.00% 26 Missing ⚠️
...apache/pinot/query/service/server/QueryServer.java 16.66% 25 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java 4.00% 24 Missing ⚠️
...r/requesthandler/BrokerRequestHandlerDelegate.java 0.00% 6 Missing ⚠️
...e/pinot/core/transport/InstanceRequestHandler.java 85.18% 2 Missing and 2 partials ⚠️
...e/pinot/query/service/dispatch/DispatchClient.java 0.00% 4 Missing ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18649      +/-   ##
============================================
+ Coverage     64.39%   64.44%   +0.04%     
  Complexity     1291     1291              
============================================
  Files          3364     3372       +8     
  Lines        207935   208973    +1038     
  Branches      32467    32638     +171     
============================================
+ Hits         133906   134675     +769     
- Misses        63255    63499     +244     
- Partials      10774    10799      +25     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.44% <37.26%> (+0.04%) ⬆️
temurin 64.44% <37.26%> (+0.04%) ⬆️
unittests 64.44% <37.26%> (+0.04%) ⬆️
unittests1 56.92% <61.90%> (+0.11%) ⬆️
unittests2 37.06% <7.07%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the codex/query-progress branch from d601f43 to 1940943 Compare June 2, 2026 09:17
@xiangfu0 xiangfu0 marked this pull request as ready for review June 2, 2026 10:48
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal issue; see inline comment.

@xiangfu0 xiangfu0 force-pushed the codex/query-progress branch 2 times, most recently from 5dda15c to 44ae161 Compare June 2, 2026 20:22
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 high-signal issue; see inline comment.

@gortiz gortiz self-requested a review June 3, 2026 15:22
@gortiz
Copy link
Copy Markdown
Contributor

gortiz commented Jun 4, 2026

I'll take a look today, but remember I also been working on #18458, which should help to produce more precise reports for MSE

@gortiz
Copy link
Copy Markdown
Contributor

gortiz commented Jun 4, 2026

This is a really useful feature — having progress while a query runs is something users ask for constantly. A few thoughts on how it interacts with #18458 (SubmitWithStream bidi stats), which I think is worth addressing before merge since the two PRs modify some of the same infrastructure.

Merge conflict in OpChainSchedulerService

Both PRs add fields and lifecycle logic to OpChainSchedulerService. Concretely:

This PR adds:

  • _executionContextCache (Guava Cache<Long, QueryExecutionContext>, time-based eviction)
  • _completedProgressStatsCache (Guava Cache<Long, QueryProgressStats>, time-based eviction)
  • trackExecutionContext() / retainCompletedProgressStatsIfFinished() in the FutureCallback

#18458 adds:

  • _executionContextByRequest (ConcurrentMap<Long, QueryExecutionContext>, ref-counted)
  • _activeOpChainsByRequest (ConcurrentMap<Long, AtomicInteger>, reference counter)
  • decrementActiveOpChains() in the same FutureCallback
  • OpChainCompletionListener — a per-request callback that fires on op-chain completion with the full MultiStageQueryStats payload

The _executionContextByRequest in #18458 is a strictly better version of _executionContextCache here: it keeps the context alive for exactly as long as op-chains are running (ref-counted) rather than until a timer fires. A Guava time-based cache can either evict a live context prematurely (returning null for a running query) or retain a completed context longer than needed. The ref-counted map avoids both failure modes.

I'd suggest deferring this PR until after #18458 merges, then replacing _executionContextCache with _executionContextByRequest and building the progress lifecycle on top of OpChainCompletionListener — which brings me to the next point.

OpChainCompletionListener enables better MSE progress

The current MSE progress model counts op-chains as work units. That means progress only moves when an op-chain finishes. In a typical pipeline, leaf scan op-chains finish early while join and aggregation op-chains run for the full query duration:

time ───────────────────────────────────────────────────────────────►
  leaf-0:  ████░░░░░░░░░░░░░░░░░░░░░░░   finishes at ~30% of wall-clock
  leaf-1:  ██████░░░░░░░░░░░░░░░░░░░░░   finishes at ~40%
  join-0:  ░░░░░░████████████████████░   runs for nearly the whole query
  join-1:  ░░░░░░████████████████████░   runs for nearly the whole query
  agg-0:   ░░░░░░░░░░░░░░░░░░█████████   runs near the end

With op-chain counting: progress reads 2/5 = 40% for most of the query, then 5/5 = 100% in rapid succession. The bar sits still for the vast majority of the query duration.

OpChainCompletionListener (from #18458) fires with the actual MultiStageQueryStats — including rows scanned, CPU time, rows emitted. This opens up a much better model:

// At query start: use leaf segment count as the denominator (exact, known upfront)
ctx.addTotalWorkUnits(totalLeafSegments);

// In OpChainCompletionListener (fires per op-chain, with stats):
if (isLeafStage(opChainId)) {
    long rowsScanned = stats.get(LeafOperator.StatKey.NUM_DOCS_SCANNED);
    ctx.addProcessedWorkUnits(rowsScanned);
}
// Non-leaf op-chains don't contribute — they're bounded by what the leaves produce

This makes progress increase smoothly as leaf segments are scanned, which is both more accurate and more informative. It also naturally fixes the double-counting issue where addTotalSegmentsToProcess (in ServerQueryExecutorV1Impl) calls addTotalWorkUnits on the same context that QueryServer.submitInternal already called addTotalWorkUnits(opChainCount) on.

Rows-per-second as the primary signal (optional)

A related idea worth considering: rather than a percentage (which requires a reliable denominator), expose rowsPerSecond alongside processedRows. This is useful even when the total is unknown:

Scanning... 42.3M rows  |  1.2M rows/s  |  ~35s remaining

This is the model that both Trino and ClickHouse have converged on:

  • Trino CLI (StatusPrinter.java) computes rows/s and bytes/s from each polling response and displays them at every tick:
    0:13 [6.45M rows, 560MB] [473K rows/s, 41.1MB/s] [=========>> ] 20%
    The REST API does not have dedicated throughput fields — rates are derived client-side from processedRows / elapsedTimeMillis. No server changes were needed to add this.

  • ClickHouse HTTP interface streams Progress packets (read_rows, read_bytes, elapsed_ns) as the query executes, and clickhouse-client computes and displays:
    Progress: 5.3M rows, 2.4GB (234K rows/s., 234MB/s.)
    The server provides the raw counters; the rate is computed at the display layer.

Both approaches show that rows/s is valuable even without a perfect denominator. When a percentage is available (Trino has progressPercentage, ClickHouse has total_rows_to_read), it appears alongside the throughput; when not, the throughput alone is shown.

For Pinot, the simplest path is the Trino approach: add rowsProcessed and elapsedMs to QueryProgressStats, then compute rowsPerSecond in the CLI/UI from successive responses. No server changes needed for V1. When totalWorkUnits is known, ETA follows from (total - processed) / rowsPerSecond. When it isn't, rows/s alone tells the user whether the query is making progress and at what speed — arguably more actionable than a percentage built on a plan-cardinality estimate that may be off by an order of magnitude.

@gortiz
Copy link
Copy Markdown
Contributor

gortiz commented Jun 4, 2026

The polling chain this PR introduces is functional but has a cost-multiplier property that becomes significant at scale. Flagging it here as a design discussion point rather than a blocker, since fixing it is a larger change that can be done in a follow-up.

The problem with polling

When a client calls GET /clientQuery/{id}/progress, the chain is:

client → controller (fan-out to all brokers)
       → broker     (fan-out HTTP to all servers, SSE; or gRPC Progress RPC, MSE)
       → servers    (Guava cache lookup, return JSON)

For a 30-second query with a 1-second poll interval and 3 servers:

30 client ticks
  → 30 × N controller→broker calls (N = brokers; fan-out to find the right one)
  → 30 × 3 broker→server calls
  = 90+ network calls whose only content is a tiny JSON payload

At 100 concurrent queries: ~9,000 extra calls/minute. Each tick also requires the server to have live progress state accessible at any time (the Guava caches in OpChainSchedulerService and InstanceRequestHandler). The caches are sized and timed to stay alive long enough to answer the next poll — which introduces the eviction races noted in other comments.

A push alternative

The natural fix is to flip the direction: client opens one persistent connection, broker pushes events as they arrive.

Client                  Broker                       Servers
  │                       │                             │
  │── GET /query/X/stream ►│                             │
  │  (SSE, stays open)    │                             │
  │                       │◄─ OpChainComplete (gRPC) ───│  ← already flowing via #18458
  │◄── data: {rows:12M} ──│                             │
  │                       │◄─ OpChainComplete (gRPC) ───│
  │◄── data: {rows:34M} ──│                             │
  │                       │◄─ OpChainComplete (gRPC) ───│
  │◄── data: {rows:100M} ─│                             │
  │◄── event: complete ───│  (stream closes)            │

Cost for the same 30-second query:

1 client connection (open once, reused throughout)
0 new controller→broker calls
0 new broker→server calls  ← #18458's SubmitWithStream already delivers this data

The broker SSE endpoint just fans out events it already holds in StreamingQuerySession. No additional Guava caches on servers. No eviction races. No controller fan-out per tick.

Why this is feasible with #18458 in place

#18458 introduces a long-lived gRPC bidi channel (SubmitWithStream) between broker and servers that stays open for the query duration. The broker's StreamingQuerySession already accumulates per-op-chain stats as they complete. The missing piece is an outbound channel from broker to client. SSE provides exactly that with standard JAX-RS (SseEventSink):

// In StreamingQuerySession, when OpChainComplete arrives (already called by #18458):
public void onOpChainComplete(...) {
    mergeStats(...);                     // existing #18458 logic
    broadcastProgressSnapshot();         // new: push to SSE subscribers
}

// New broker endpoint:
@GET @Path("query/{id}/progress/stream") @Produces(SERVER_SENT_EVENTS)
public void streamProgress(@PathParam("id") long queryId, @Context SseEventSink sink) {
    _queryDispatcher.subscribeProgressStream(queryId, sink);
}

Connection cleanup is handled automatically: when the SSE connection drops, sink.isClosed() returns true and the subscriber is removed on the next push attempt. When the query completes, the broker sends a final event with complete: true and closes the stream.

What this PR should do now

This is a non-trivial change that I wouldn't block the current PR on. But it would be worth:

  1. Keeping the polling endpoint as-is (it's correct and useful for languages/clients that can't hold persistent connections)
  2. Adding a GET /query/{id}/progress/stream SSE endpoint alongside it in a follow-up
  3. Having the CLI and Query Console prefer SSE when available

The main thing to avoid is designing the server-side state (Guava caches, eviction timings) in a way that makes it hard to remove when the push path lands. The ref-counted _executionContextByRequest from #18458 is already the right shape for that.

@xiangfu0 xiangfu0 force-pushed the codex/query-progress branch from 44ae161 to 5cb38b9 Compare June 4, 2026 23:39
@xiangfu0
Copy link
Copy Markdown
Contributor Author

xiangfu0 commented Jun 4, 2026

Addressed the adaptive progress display path in the latest push.

  • SSE/simple progress responses still render as one compact CLI/UI row.
  • MSE progress can now carry labeled detail rows, so CLI and Query Console render a top-level aggregate plus component bars when details are present.
  • Missing or non-responsive MSE workers are retained as unknown rows, so the aggregate no longer shrinks the denominator or falsely reaches 100%.

I kept the rows/s metric and broker-pushed progress stream as follow-up scope. The current shape keeps the aggregate fields backward compatible while allowing richer MSE status when the response includes details.

Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one correctness issue; see inline comment.

}
}
if (!serverProgressStats.isEmpty()) {
return QueryProgressStats.aggregate(serverProgressStats);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still returns a partial aggregate over only the SSE servers that replied with 200. If one targeted server times out or is temporarily unreachable, its unfinished work disappears from the denominator and the broker can report inflated progress, including 100%, even though the query is still blocked on that server. The MSE path now avoids that by treating missing servers as unknown progress; SSE needs the same treatment here (or retained last-known totals) instead of returning a partial aggregate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants