Skip to content

Conversation

jpountz
Copy link
Collaborator

@jpountz jpountz commented Apr 10, 2025

While applications often display only 20 to 100 hits, their candidate retrieval phase may retrieve many more hits, before more reranking and selection happen. I would like to increase the topN in Lucene's nightly benchmarks to better reflect the sort of values that is effectively used nowadays.

This is a big change. With topN=100, disjunctive queries run as conjunctive queries in practice because the scorer quickly figures out that only documents that contain all terms may be competitive. This is no longer the case at topN=1,000.

While applications often display only 20 to 100 hits, their candidate retrieval
phase may retrieve many more hits, before more reranking and selection happen.
I would like to increase the topN in Lucene's nightly benchmarks to better
reflect the sort of values that is effectively used nowadays.

This is a big change. With topN=100, disjunctive queries run as conjunctive
queries in practice because the scorer quickly figures out that only documents
that contain all terms may be competitive. This is no longer the case at
topN=1,000.
@jpountz
Copy link
Collaborator Author

jpountz commented Apr 10, 2025

This change is potentially controversial, I'm keen to get opinions.

@jpountz
Copy link
Collaborator Author

jpountz commented Apr 24, 2025

@rmuir @mikemccand I wonder if you have an opinion on this.

@rmuir
Copy link
Collaborator

rmuir commented Apr 24, 2025

I don't know the impact it would take on benchmark times, but I know this is already substantial, defer to Mike on that. Obviously a (potentially huge?) part of the time is just indexing which is not impacted.

Not sure if it would be possible to separate out "normal topN" from "big topN" either? It would be nice to track both use-cases, traditional search performance used by e.g. embedded search cases with a reasonable top-N, but also more heavy-duty re-rank stuff?

I dont know the logic well enough to know if its possible or could be done in reasonable time? For the "big topN" maybe we wouldn't need to rerun all the search benchies, just some of them? I definitely don't think the increased topN would provide help for e.g. multitermqueries or similar.

Copy link

github-actions bot commented May 9, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 9, 2025
@mikemccand
Copy link
Owner

Oh no sorry I'm only now catching up on this.

I'm not worried about the added runtime (I think lol). @jpountz how about we try the change (merge to main) and see what we learn?

@github-actions github-actions bot removed the Stale label May 10, 2025
@jpountz
Copy link
Collaborator Author

jpountz commented May 11, 2025

I'm not worried about slowdowns either, time is dominated by slow queries like sloppy phrase queries or faceting queries. I guess that the main downside is losing coverage for the case when dynamic pruning is easy because k is small.

For reference, this is motivated by the fact that Lucene does very well at TOP_10 but less well at TOP_1000 at https://tantivy-search.github.io/bench/.

@jpountz
Copy link
Collaborator Author

jpountz commented May 11, 2025

I'll merge once apache/lucene#14630 is resolved.

@uschindler
Copy link

In general, the "huge topN" may be often used in Elasticsearch`s typical use case, but a real full text search as used by most Lucene users won't pull 1000 topN.

So as Robert suggested: Let's have 2 benchmarks for that.

@jpountz
Copy link
Collaborator Author

jpountz commented May 12, 2025

@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com?

@mikemccand
Copy link
Owner

@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com?

It is indeed quite high, on the order of 2-3,000 hits pulled from the first phase (matching + simple relevance)... and then we (Amazon product search) do additional phases of more costly ranking + whittling down. It (topN for phase 0 search) does depend on how many shards are in the index.

Running all the query tasks, for 20 JVM invocations, takes ~2.5 hours now. We could double-run every task for two topN and maybe the added slowdown could be OK?

I'll merge once apache/lucene#14630 is resolved.

+1

@rmuir
Copy link
Collaborator

rmuir commented May 12, 2025

maybe at least we could run just one or two queries with small top-N to make sure that WAND-type optos still work?

not sure how hard it would be, but that's the "interesting" piece that it would be sad to lose. I feel like top-N size might be boring for a ton of the queries run here (Wildcards etc)

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants