Address CompoundRetrieverBuilder Failure Handling #136732

john-wagster · 2025-10-17T02:39:22Z

exploring options for improving error handling for CompoundRetrieverBuilder particularly in the case where a response of sub retriever is 2xx but some shard failed to retrieve data.

addresses: #136529

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java

john-wagster · 2025-10-17T02:44:24Z

@pmpailis curious if this is roughly what you were thinking in terms of improving error handling and if you had other thoughts about how this should behave or additional checks. What else would be nice to have here? I thought about enhancing the error message handling within the if (false == failures.isEmpty()) { block to include a list of the failure messages as well. How do you feel about that?

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java

pmpailis · 2025-10-17T07:17:44Z

Thanks @john-wagster for picking this up! ❤️ Yeah, this is pretty much what I was thinking as well; only minor comments on the type of exception that we would throw (to avoid constantly 5xx).

Will take a look at adding a test case as well.

…of this logic

elasticsearchmachine · 2025-10-17T23:04:32Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-10-17T23:05:08Z

Hi @john-wagster, I've created a changelog YAML for you.

john-wagster · 2025-10-17T23:12:41Z

@pmpailis I created a mock test which helped with cleaning up the code. Let me know what you think about that and the current state of the code. I also realized i have no idea what version labels this should actually be applied to so I just put them all on here. I think that's everything that's currently supported??? Thoughts on what releases this should target would be welcome.

pmpailis · 2025-10-20T09:06:50Z

Thanks @john-wagster ! This looks really nice! In addition to using mocks, we could also use a custom query through a test plugin and using it as part of the integration tests (e..g in LinearRetrieverIT#nodePlugins, RRFRetrieverBuilderIT#nodePlugins, etc).

E.g.

private static class ShardFailingQueryBuilder extends AbstractQueryBuilder<ShardFailingQueryBuilder> {
        private static final String NAME = "shard_failing_query";

        private static ShardFailingQueryBuilder fromXContent(XContentParser parser) {
            return new ShardFailingQueryBuilder();
        }

        ShardFailingQueryBuilder() {}

        ShardFailingQueryBuilder(StreamInput in) throws IOException {
            super(in);
        }

        @Override
        public String getWriteableName() {
            return NAME;
        }

        @Override
        public TransportVersion getMinimalSupportedVersion() {
            return TransportVersion.current();
        }

        @Override
        protected void doWriteTo(StreamOutput out) throws IOException {

        }

        @Override
        protected void doXContent(XContentBuilder builder, Params params) throws IOException {
            builder.startObject(NAME);
            builder.endObject();
        }

        @Override
        protected Query doToQuery(SearchExecutionContext context) throws IOException {
            if(frequently() && context.getShardId() % 2 == 0) {
                throw new IllegalArgumentException("simulated failure");
            }else{
                return new MatchAllDocsQuery();
            }
        }

        @Override
        protected boolean doEquals(ShardFailingQueryBuilder other) {
            return true;
        }

        @Override
        protected int doHashCode() {
            return 0;
        }
    }

    public static class FailingQueryPlugin extends Plugin implements SearchPlugin {
        public FailingQueryPlugin() {
        }

        @Override
        public List<QuerySpec<?>> getQueries() {
            return List.of(new QuerySpec<QueryBuilder>(ShardFailingQueryBuilder.NAME, ShardFailingQueryBuilder::new, ShardFailingQueryBuilder::fromXContent));
        }
    }

And then have a test like:

    public void testLinearInnerRetrieverPartialSearchErrors() {
        final int rankWindowSize = 100;
        SearchSourceBuilder source = new SearchSourceBuilder();
        StandardRetrieverBuilder standard0 = new StandardRetrieverBuilder(new ShardFailingQueryBuilder());
        StandardRetrieverBuilder standard1 = new StandardRetrieverBuilder(new MatchAllQueryBuilder());
        source.retriever(
            new LinearRetrieverBuilder(
                Arrays.asList(
                    new CompoundRetrieverBuilder.RetrieverSource(standard0, null),
                    new CompoundRetrieverBuilder.RetrieverSource(standard1, null)
                ),
                rankWindowSize
            )
        );
        SearchRequestBuilder req = client().prepareSearch(INDEX).setSource(source);
        var resp = req.get();

elasticsearchmachine · 2025-10-20T14:10:35Z

Hi @john-wagster, I've updated the changelog YAML for you.

benwtrent · 2025-10-20T14:31:17Z

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java

-                            innerRetrievers.get(i).retriever().setRankDocs(rankDocs);
-                            topDocs.add(rankDocs);
+                            if (item.getResponse().getFailedShards() > 0) {
+                                statusCode = handleShardFailures(item.getResponse(), statusCode, failures);


Hey, by default, we allow partial results and return a 2xx. Does this break that? Meaning, if there is a failed shard, do we still return a 2xx by default?

We should:

Allow partial results

If partial results are desired, we should indicate in the final result that some shards failed, and return 2xx

If partial results are NOT desired, we should return something other than 2xx

If all shards failed, we should not return a 2xx and indicate the failure

john-wagster added 2 commits October 16, 2025 21:36

added a check for shard failures

888b6e5

spotless

169d4f5

john-wagster added the WIP label Oct 17, 2025

elasticsearchmachine added the v9.3.0 label Oct 17, 2025

john-wagster commented Oct 17, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java Show resolved Hide resolved

pmpailis reviewed Oct 17, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java Outdated Show resolved Hide resolved

pmpailis reviewed Oct 17, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/search/retriever/CompoundRetrieverBuilder.java Outdated Show resolved Hide resolved

john-wagster added 3 commits October 17, 2025 14:57

passing back message and status code from each failed shard instead

4ed8ca2

added a mock test for now to at least get some coverage / validation …

ababf26

…of this logic

spotless

aa7f868

john-wagster requested a review from pmpailis October 17, 2025 23:03

Merge branch 'main' into better_retriever_errors_on_shard_failure

211ce84

john-wagster marked this pull request as ready for review October 17, 2025 23:03

john-wagster added :Search Relevance/Search Catch all for Search Relevance and removed WIP labels Oct 17, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025

john-wagster added >bug and removed Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Oct 17, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025

Update docs/changelog/136732.yaml

c1f8cd5

john-wagster added v9.2.1 v9.1.6 and removed Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Oct 17, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025

john-wagster added v8.20.0 v8.18.9 labels Oct 17, 2025

john-wagster added the v8.19.6 label Oct 17, 2025

john-wagster added auto-backport Automatically create backport pull requests when merged v9.0.9 v8.17.11 v8.18.9 and removed v8.17.11 v8.18.9 labels Oct 17, 2025

add issue

30597dc

john-wagster removed v9.0.9 v8.18.9 labels Oct 20, 2025

Update docs/changelog/136732.yaml

c0c467b

benwtrent reviewed Oct 20, 2025

View reviewed changes

elasticsearchmachine added v9.1.7 v8.19.7 and removed v9.1.6 v8.19.6 labels Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Address CompoundRetrieverBuilder Failure Handling #136732

Address CompoundRetrieverBuilder Failure Handling #136732

john-wagster commented Oct 17, 2025

Uh oh!

Uh oh!

john-wagster commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

pmpailis commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

john-wagster commented Oct 17, 2025

Uh oh!

pmpailis commented Oct 20, 2025

Uh oh!

elasticsearchmachine commented Oct 20, 2025

Uh oh!

benwtrent Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Address CompoundRetrieverBuilder Failure Handling #136732

Are you sure you want to change the base?

Address CompoundRetrieverBuilder Failure Handling #136732

Conversation

john-wagster commented Oct 17, 2025

Uh oh!

Uh oh!

john-wagster commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

pmpailis commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

john-wagster commented Oct 17, 2025

Uh oh!

pmpailis commented Oct 20, 2025

Uh oh!

elasticsearchmachine commented Oct 20, 2025

Uh oh!

benwtrent Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants