Skip to content

Conversation

@john-wagster
Copy link
Contributor

exploring options for improving error handling for CompoundRetrieverBuilder particularly in the case where a response of sub retriever is 2xx but some shard failed to retrieve data.

addresses: #136529

@john-wagster
Copy link
Contributor Author

@pmpailis curious if this is roughly what you were thinking in terms of improving error handling and if you had other thoughts about how this should behave or additional checks. What else would be nice to have here? I thought about enhancing the error message handling within the if (false == failures.isEmpty()) { block to include a list of the failure messages as well. How do you feel about that?

@pmpailis
Copy link
Contributor

Thanks @john-wagster for picking this up! ❤️ Yeah, this is pretty much what I was thinking as well; only minor comments on the type of exception that we would throw (to avoid constantly 5xx).

Will take a look at adding a test case as well.

@john-wagster john-wagster requested a review from pmpailis October 17, 2025 23:03
@john-wagster john-wagster marked this pull request as ready for review October 17, 2025 23:03
@john-wagster john-wagster added :Search Relevance/Search Catch all for Search Relevance and removed WIP labels Oct 17, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@john-wagster john-wagster added >bug and removed Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Oct 17, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @john-wagster, I've created a changelog YAML for you.

@john-wagster john-wagster added v9.2.1 v9.1.6 and removed Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Oct 17, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 17, 2025
@john-wagster
Copy link
Contributor Author

@pmpailis I created a mock test which helped with cleaning up the code. Let me know what you think about that and the current state of the code. I also realized i have no idea what version labels this should actually be applied to so I just put them all on here. I think that's everything that's currently supported??? Thoughts on what releases this should target would be welcome.

@john-wagster john-wagster added auto-backport Automatically create backport pull requests when merged v9.0.9 v8.17.11 v8.18.9 and removed v8.17.11 v8.18.9 labels Oct 17, 2025
@pmpailis
Copy link
Contributor

Thanks @john-wagster ! This looks really nice! In addition to using mocks, we could also use a custom query through a test plugin and using it as part of the integration tests (e..g in LinearRetrieverIT#nodePlugins, RRFRetrieverBuilderIT#nodePlugins, etc).

E.g.

private static class ShardFailingQueryBuilder extends AbstractQueryBuilder<ShardFailingQueryBuilder> {
        private static final String NAME = "shard_failing_query";

        private static ShardFailingQueryBuilder fromXContent(XContentParser parser) {
            return new ShardFailingQueryBuilder();
        }

        ShardFailingQueryBuilder() {}

        ShardFailingQueryBuilder(StreamInput in) throws IOException {
            super(in);
        }

        @Override
        public String getWriteableName() {
            return NAME;
        }

        @Override
        public TransportVersion getMinimalSupportedVersion() {
            return TransportVersion.current();
        }

        @Override
        protected void doWriteTo(StreamOutput out) throws IOException {

        }

        @Override
        protected void doXContent(XContentBuilder builder, Params params) throws IOException {
            builder.startObject(NAME);
            builder.endObject();
        }

        @Override
        protected Query doToQuery(SearchExecutionContext context) throws IOException {
            if(frequently() && context.getShardId() % 2 == 0) {
                throw new IllegalArgumentException("simulated failure");
            }else{
                return new MatchAllDocsQuery();
            }
        }

        @Override
        protected boolean doEquals(ShardFailingQueryBuilder other) {
            return true;
        }

        @Override
        protected int doHashCode() {
            return 0;
        }
    }

    public static class FailingQueryPlugin extends Plugin implements SearchPlugin {
        public FailingQueryPlugin() {
        }

        @Override
        public List<QuerySpec<?>> getQueries() {
            return List.of(new QuerySpec<QueryBuilder>(ShardFailingQueryBuilder.NAME, ShardFailingQueryBuilder::new, ShardFailingQueryBuilder::fromXContent));
        }
    }

And then have a test like:

    public void testLinearInnerRetrieverPartialSearchErrors() {
        final int rankWindowSize = 100;
        SearchSourceBuilder source = new SearchSourceBuilder();
        StandardRetrieverBuilder standard0 = new StandardRetrieverBuilder(new ShardFailingQueryBuilder());
        StandardRetrieverBuilder standard1 = new StandardRetrieverBuilder(new MatchAllQueryBuilder());
        source.retriever(
            new LinearRetrieverBuilder(
                Arrays.asList(
                    new CompoundRetrieverBuilder.RetrieverSource(standard0, null),
                    new CompoundRetrieverBuilder.RetrieverSource(standard1, null)
                ),
                rankWindowSize
            )
        );
        SearchRequestBuilder req = client().prepareSearch(INDEX).setSource(source);
        var resp = req.get();          

@elasticsearchmachine
Copy link
Collaborator

Hi @john-wagster, I've updated the changelog YAML for you.

innerRetrievers.get(i).retriever().setRankDocs(rankDocs);
topDocs.add(rankDocs);
if (item.getResponse().getFailedShards() > 0) {
statusCode = handleShardFailures(item.getResponse(), statusCode, failures);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, by default, we allow partial results and return a 2xx. Does this break that? Meaning, if there is a failed shard, do we still return a 2xx by default?

We should:

  • Allow partial results
  • If partial results are desired, we should indicate in the final result that some shards failed, and return 2xx
  • If partial results are NOT desired, we should return something other than 2xx
  • If all shards failed, we should not return a 2xx and indicate the failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.19.7 v8.20.0 v9.1.7 v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants