-
Notifications
You must be signed in to change notification settings - Fork 136
Increase topN to 1,000. #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
While applications often display only 20 to 100 hits, their candidate retrieval phase may retrieve many more hits, before more reranking and selection happen. I would like to increase the topN in Lucene's nightly benchmarks to better reflect the sort of values that is effectively used nowadays. This is a big change. With topN=100, disjunctive queries run as conjunctive queries in practice because the scorer quickly figures out that only documents that contain all terms may be competitive. This is no longer the case at topN=1,000.
This change is potentially controversial, I'm keen to get opinions. |
@rmuir @mikemccand I wonder if you have an opinion on this. |
I don't know the impact it would take on benchmark times, but I know this is already substantial, defer to Mike on that. Obviously a (potentially huge?) part of the time is just indexing which is not impacted. Not sure if it would be possible to separate out "normal topN" from "big topN" either? It would be nice to track both use-cases, traditional search performance used by e.g. embedded search cases with a reasonable top-N, but also more heavy-duty re-rank stuff? I dont know the logic well enough to know if its possible or could be done in reasonable time? For the "big topN" maybe we wouldn't need to rerun all the search benchies, just some of them? I definitely don't think the increased topN would provide help for e.g. multitermqueries or similar. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Oh no sorry I'm only now catching up on this. I'm not worried about the added runtime (I think lol). @jpountz how about we try the change (merge to |
I'm not worried about slowdowns either, time is dominated by slow queries like sloppy phrase queries or faceting queries. I guess that the main downside is losing coverage for the case when dynamic pruning is easy because k is small. For reference, this is motivated by the fact that Lucene does very well at TOP_10 but less well at TOP_1000 at https://tantivy-search.github.io/bench/. |
I'll merge once apache/lucene#14630 is resolved. |
In general, the "huge topN" may be often used in Elasticsearch`s typical use case, but a real full text search as used by most Lucene users won't pull 1000 topN. So as Robert suggested: Let's have 2 benchmarks for that. |
@mikemccand I'm curious if you're allowed to share how many candidate hits are fetched from Lucene before being fed to rescorers on amazon.com? |
It is indeed quite high, on the order of 2-3,000 hits pulled from the first phase (matching + simple relevance)... and then we (Amazon product search) do additional phases of more costly ranking + whittling down. It ( Running all the query tasks, for 20 JVM invocations, takes ~2.5 hours now. We could double-run every task for two
+1 |
maybe at least we could run just one or two queries with small top-N to make sure that WAND-type optos still work? not sure how hard it would be, but that's the "interesting" piece that it would be sad to lose. I feel like top-N size might be boring for a ton of the queries run here (Wildcards etc) |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
While applications often display only 20 to 100 hits, their candidate retrieval phase may retrieve many more hits, before more reranking and selection happen. I would like to increase the topN in Lucene's nightly benchmarks to better reflect the sort of values that is effectively used nowadays.
This is a big change. With topN=100, disjunctive queries run as conjunctive queries in practice because the scorer quickly figures out that only documents that contain all terms may be competitive. This is no longer the case at topN=1,000.