Increase topN to 1,000. #357

jpountz · 2025-04-10T12:24:22Z

While applications often display only 20 to 100 hits, their candidate retrieval phase may retrieve many more hits, before more reranking and selection happen. I would like to increase the topN in Lucene's nightly benchmarks to better reflect the sort of values that is effectively used nowadays.

This is a big change. With topN=100, disjunctive queries run as conjunctive queries in practice because the scorer quickly figures out that only documents that contain all terms may be competitive. This is no longer the case at topN=1,000.

While applications often display only 20 to 100 hits, their candidate retrieval phase may retrieve many more hits, before more reranking and selection happen. I would like to increase the topN in Lucene's nightly benchmarks to better reflect the sort of values that is effectively used nowadays. This is a big change. With topN=100, disjunctive queries run as conjunctive queries in practice because the scorer quickly figures out that only documents that contain all terms may be competitive. This is no longer the case at topN=1,000.

jpountz · 2025-04-10T12:24:56Z

This change is potentially controversial, I'm keen to get opinions.

jpountz · 2025-04-24T20:04:18Z

@rmuir @mikemccand I wonder if you have an opinion on this.

rmuir · 2025-04-24T23:03:28Z

I don't know the impact it would take on benchmark times, but I know this is already substantial, defer to Mike on that. Obviously a (potentially huge?) part of the time is just indexing which is not impacted.

Not sure if it would be possible to separate out "normal topN" from "big topN" either? It would be nice to track both use-cases, traditional search performance used by e.g. embedded search cases with a reasonable top-N, but also more heavy-duty re-rank stuff?

I dont know the logic well enough to know if its possible or could be done in reasonable time? For the "big topN" maybe we wouldn't need to rerun all the search benchies, just some of them? I definitely don't think the increased topN would provide help for e.g. multitermqueries or similar.

github-actions · 2025-05-09T00:09:42Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

mikemccand · 2025-05-09T20:41:18Z

Oh no sorry I'm only now catching up on this.

I'm not worried about the added runtime (I think lol). @jpountz how about we try the change (merge to main) and see what we learn?

jpountz · 2025-05-11T06:19:52Z

I'm not worried about slowdowns either, time is dominated by slow queries like sloppy phrase queries or faceting queries. I guess that the main downside is losing coverage for the case when dynamic pruning is easy because k is small.

For reference, this is motivated by the fact that Lucene does very well at TOP_10 but less well at TOP_1000 at https://tantivy-search.github.io/bench/.

jpountz · 2025-05-11T20:29:30Z

I'll merge once apache/lucene#14630 is resolved.

uschindler · 2025-05-12T15:11:08Z

In general, the "huge topN" may be often used in Elasticsearch`s typical use case, but a real full text search as used by most Lucene users won't pull 1000 topN.

So as Robert suggested: Let's have 2 benchmarks for that.

jpountz · 2025-05-12T15:34:57Z

@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com?

mikemccand · 2025-05-12T15:52:12Z

@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com?

It is indeed quite high, on the order of 2-3,000 hits pulled from the first phase (matching + simple relevance)... and then we (Amazon product search) do additional phases of more costly ranking + whittling down. It (topN for phase 0 search) does depend on how many shards are in the index.

Running all the query tasks, for 20 JVM invocations, takes ~2.5 hours now. We could double-run every task for two topN and maybe the added slowdown could be OK?

I'll merge once apache/lucene#14630 is resolved.

+1

rmuir · 2025-05-12T16:02:34Z

maybe at least we could run just one or two queries with small top-N to make sure that WAND-type optos still work?

not sure how hard it would be, but that's the "interesting" piece that it would be sad to lose. I feel like top-N size might be boring for a ton of the queries run here (Wildcards etc)

github-actions · 2025-05-27T00:09:59Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions bot added the Stale label May 9, 2025

github-actions bot removed the Stale label May 10, 2025

gf2121 mentioned this pull request May 26, 2025

Move HitQueue in TopScoreDocCollector to a LongHeap apache/lucene#14714

Merged

github-actions bot added the Stale label May 27, 2025

RamakrishnaChilaka mentioned this pull request Aug 31, 2025

Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors apache/lucene#15140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase topN to 1,000. #357

Increase topN to 1,000. #357

Uh oh!

jpountz commented Apr 10, 2025

Uh oh!

jpountz commented Apr 10, 2025

Uh oh!

jpountz commented Apr 24, 2025

Uh oh!

rmuir commented Apr 24, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

mikemccand commented May 9, 2025

Uh oh!

jpountz commented May 11, 2025

Uh oh!

jpountz commented May 11, 2025

Uh oh!

uschindler commented May 12, 2025

Uh oh!

jpountz commented May 12, 2025

Uh oh!

mikemccand commented May 12, 2025

Uh oh!

rmuir commented May 12, 2025

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

Uh oh!

Increase topN to 1,000. #357

Are you sure you want to change the base?

Increase topN to 1,000. #357

Uh oh!

Conversation

jpountz commented Apr 10, 2025

Uh oh!

jpountz commented Apr 10, 2025

Uh oh!

jpountz commented Apr 24, 2025

Uh oh!

rmuir commented Apr 24, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

mikemccand commented May 9, 2025

Uh oh!

jpountz commented May 11, 2025

Uh oh!

jpountz commented May 11, 2025

Uh oh!

uschindler commented May 12, 2025

Uh oh!

jpountz commented May 12, 2025

Uh oh!

mikemccand commented May 12, 2025

Uh oh!

rmuir commented May 12, 2025

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

Uh oh!