Skip to content

Conversation

@benchaplin
Copy link
Contributor

Resolves #134151, #130821.

Background

A bug was introduced by #121885 due to the following code, which handles batched query exceptions due to a batched partial reduction failure:

@Override
public void handleException(TransportException e) {
Exception cause = (Exception) ExceptionsHelper.unwrapCause(e);
logger.debug("handling node search exception coming from [" + nodeId + "]", cause);
if (e instanceof SendRequestTransportException || cause instanceof TaskCancelledException) {
// two possible special cases here where we do not want to fail the phase:
// failure to send out the request -> handle things the same way a shard would fail with unbatched execution
// as this could be a transient failure and partial results we may have are still valid
// cancellation of the whole batched request on the remote -> maybe we timed out or so, partial results may
// still be valid
onNodeQueryFailure(e, request, routing);
} else {
// Remote failure that wasn't due to networking or cancellation means that the data node was unable to reduce
// its local results. Failure to reduce always fails the phase without exception so we fail the phase here.
if (results instanceof QueryPhaseResultConsumer queryPhaseResultConsumer) {
queryPhaseResultConsumer.failure.compareAndSet(null, cause);
}
onPhaseFailure(getName(), "", cause);
}
}

Raising a phase failure in this way leads to a couple issues:

  1. It can be called more than once (as seen in [Search] Exceptions in datanodes leading to assertFirstRun() failures #134151).
  2. The subsequent freeing of contexts can miss concurrent in-flight queries, resulting in open contexts after the failure (as seen in [CI] SearchWithRejectionsIT testOpenContextsAfterRejections failing #130821).

Solution

Problem 1 could be resolved with a simple flag, as proposed in #131085. Problem 2 could be resolved with some careful use of the same flag to clean contexts upon receiving stale query results. However, in the interest of stability, I propose a solution that more closely resembles how a reduction failure is handled by a non-batched query phase. In non-batched, a reduction failure is held in the QueryPhaseResultConsumer until shard fanout is complete. Only later, during final reduction at the beginning of the fetch phase, do we fail the search.

Fast failure + proper task cancellation are worthy goals for the future. I am tracking these as follow-up improvements for after the release of batched query execution.

This PR:

  1. Alters a batched query request to respond with shard results in the case of a reduction failure on the data node (the failure is now conditionally included in the NodeQueryResponse).
  2. Removes the early phase failure on the coord node. The coord's QueryPhaseResultConsumer will hold onto the failure and fail eventually during the fetch phase, same as non-batched.

@benchaplin benchaplin added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.1.7 labels Oct 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @benchaplin, I've created a changelog YAML for you.

this.results = in.readArray(i -> i.readBoolean() ? new QuerySearchResult(i) : i.readException(), Object[]::new);
this.mergeResult = QueryPhaseResultConsumer.MergeResult.readFrom(in);
this.topDocsStats = SearchPhaseController.TopDocsStats.readFrom(in);
boolean hasReductionFailure = in.readBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're changing the shape of this message, do we need to create a new transport version or is that taken care of for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe I do, once I learn how 😂

Copy link
Contributor

@chrisparrinello chrisparrinello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benchaplin benchaplin added auto-backport Automatically create backport pull requests when merged v9.2.1 labels Oct 22, 2025
@benchaplin benchaplin marked this pull request as draft October 22, 2025 21:50
@benchaplin benchaplin marked this pull request as ready for review October 28, 2025 16:48
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of minor comments. LGTM otherwise

);
}

private void writeSuccessfulResponse(RecyclerBytesStreamOutput out) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to refactor this serialization code? Moving it around makes it more difficult to eye ball it somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disregard this previous comment. I understand why you did things the way you did. It's fine as-is. As for the review, I simply trust that the successful writing is a plain copy of the previous code we had, with no changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, same as the previous code except for the additional out.writeBoolean(false);, telling us there's no reduction failure.

}
out.writeBoolean(true); // does have a reduction failure
out.writeException(reductionFailure);
releaseAllResultsContexts();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in a finally block ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of an IOException, the caller releases contexts in a catch block. So I think this is alright, else we'll be releasing twice?

NodeQueryResponse.writeMergeResult(out, mergeResult, queryPhaseResultConsumer.topDocsStats);
}

private void writeReductionFailureResponse(RecyclerBytesStreamOutput out, Exception reductionFailure) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment about where the corresponding read code for this can be found? future readers may be looking for it and not easily finding it, due to the fact that we write to an opaque bytes transport response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

if (failure != null) {
handleMergeFailure(failure, channelListener, namedWriteableRegistry);
releaseAllResultsContexts();
channelListener.onFailure(failure);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if I am missing something: is this doing the same that handleMergeFailure was previously doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the exact same. handleMergeFailure doesn't make sense to keep in the new serialization scheme, so I decided to remove the method.

out.writeBoolean(true);
writeTopDocs(out, topDocsAndMaxScore);
} else {
assert isPartiallyReduced();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failure might occur during the final data node reduction. In this case, due to the central change of this PR, we still send back QuerySearchResults to sit on the coord node. Therefore we can no longer assert that a QuerySearchResult has been reduced when serializing it.

@benchaplin benchaplin merged commit a01ab1e into elastic:main Nov 5, 2025
34 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts
The backport operation could not be completed due to the following error:
An unhandled error occurred. Please consult the logs

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 136889

benchaplin added a commit to benchaplin/elasticsearch that referenced this pull request Nov 5, 2025
benchaplin added a commit to benchaplin/elasticsearch that referenced this pull request Nov 6, 2025
elasticsearchmachine pushed a commit that referenced this pull request Nov 6, 2025
* Backport #136889

* Fix

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Nov 6, 2025
* Backport #136889

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 6, 2025
elastic#121885 attempted to shortcut a phase failure caused by a reduction
failure on the data node by failing the query phase in the batched
query action response listener. Before batching the query phase, we
did not fail the phase immediately upon a reduction failure. We held
on to the failure and continued querying all shards, only failing
during final reduction at the beginning of the fetch phase.

I can't think of anything inherently wrong with this approach,
besides the fact that the phase cannot be failed multiple times
(elastic#134151). However certain cleanup aspects of the code (specifically
releasing reader contexts and query search results, see: elastic#130821,
elastic#122707) rely on the assumption that all shards are queried before
failing the phase.

This commit reworks batched requests to fail in the same way: only
after all shards are queried. To do this, we must include results in
transport response even when a reduction failure occurred.
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.7 v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Search] Exceptions in datanodes leading to assertFirstRun() failures

5 participants