Remove early phase failure in batched #136889

benchaplin · 2025-10-21T16:11:13Z

Background

A bug was introduced by #121885 due to the following code, which handles batched query exceptions due to a batched partial reduction failure:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java

Lines 525 to 544 in bd35649

    
           @Override 
        
           public void handleException(TransportException e) { 
        
               Exception cause = (Exception) ExceptionsHelper.unwrapCause(e); 
        
               logger.debug("handling node search exception coming from [" + nodeId + "]", cause); 
        
               if (e instanceof SendRequestTransportException || cause instanceof TaskCancelledException) { 
        
                   // two possible special cases here where we do not want to fail the phase: 
        
                   // failure to send out the request -> handle things the same way a shard would fail with unbatched execution 
        
                   // as this could be a transient failure and partial results we may have are still valid 
        
                   // cancellation of the whole batched request on the remote -> maybe we timed out or so, partial results may 
        
                   // still be valid 
        
                   onNodeQueryFailure(e, request, routing); 
        
               } else { 
        
                   // Remote failure that wasn't due to networking or cancellation means that the data node was unable to reduce 
        
                   // its local results. Failure to reduce always fails the phase without exception so we fail the phase here. 
        
                   if (results instanceof QueryPhaseResultConsumer queryPhaseResultConsumer) { 
        
                       queryPhaseResultConsumer.failure.compareAndSet(null, cause); 
        
                   } 
        
                   onPhaseFailure(getName(), "", cause); 
        
               } 
        
           }

Raising a phase failure in this way leads to a couple issues:

It can be called more than once (as seen in [Search] Exceptions in datanodes leading to assertFirstRun() failures #134151).
The subsequent freeing of contexts can miss concurrent in-flight queries, resulting in open contexts after the failure (as seen in [CI] SearchWithRejectionsIT testOpenContextsAfterRejections failing #130821).

Solution

Problem 1 could be resolved with a simple flag, as proposed in #131085. Problem 2 could be resolved with some careful use of the same flag to clean contexts upon receiving stale query results. However, in the interest of stability, I propose a solution that more closely resembles how a reduction failure is handled by a non-batched query phase. In non-batched, a reduction failure is held in the QueryPhaseResultConsumer until shard fanout is complete. Only later, during final reduction at the beginning of the fetch phase, do we fail the search.

Fast failure + proper task cancellation are worthy goals for the future. I am tracking these as follow-up improvements for after the release of batched query execution.

This PR:

Alters a batched query request to respond with shard results in the case of a reduction failure on the data node (the failure is now conditionally included in the NodeQueryResponse).
Removes the early phase failure on the coord node. The coord's QueryPhaseResultConsumer will hold onto the failure and fail eventually during the fetch phase, same as non-batched.

elasticsearchmachine · 2025-10-21T16:11:38Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-10-21T16:11:39Z

Hi @benchaplin, I've created a changelog YAML for you.

chrisparrinello · 2025-10-21T16:55:23Z