-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Move HitQueue in TopScoreDocCollector to a LongHeap #14714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea makes sense to me, I wouldn't have expected such a speedup when collecting the top 1000 hits, it's great. It's a bit ugly to pass null
as a HitQueue
in the constructor of TopScoreDocCollector
. Can we only keep method signatures on TopDocsCollector
and move the current impls to some other class?
Thanks for the suggestion!
FWIW passing a null PQ is mentioned in lucene/lucene/core/src/java/org/apache/lucene/search/TopDocsCollector.java Lines 25 to 28 in 6b3c3e4
I agree it is ugly to copy the large topDocs(int start, int howMany) so i was looking to extract PQ logics to a protected method so that it can be minimum override, but i'm not sure if we should touch (public methods signature not changed but extended classes could be affected) this public API class as this seems not to break the original intention of the design. In case you did not notice the java doc, i'd like to ask your suggestion again :)
|
I wasn't aware of this indeed. OK for passing null then, I agree that there may be sub classes that rely on this API in the wild. |
* Score is non-negative float so wo use floatToRawIntBits instead of {@link | ||
* NumericUtils#floatToSortableInt}. We do not assert score >= 0 here to allow pass negative float | ||
* to indicate totally non-competitive, e.g. {@link #LEAST_COMPETITIVE_CODE}. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit too subtle to my taste, could we either not have to deal with negative scores at all, or use NumericUtils#floatToSortableInt? FWIW, I believe that LEAST_COMPETITIVE_CODE
could use a score of 0 since Integer.MAX_VALUE is not an allowed doc ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I believe that LEAST_COMPETITIVE_CODE could use a score of 0 since Integer.MAX_VALUE is not an allowed doc ID?
The issue of using encode(Integer.MAX_VALUE , 0f)
is that topScore will be decoded as 0
topScore = DocScoreEncoder.toIntScore(topCode); |
Then score 0 will not be competitive as we are comparing score only.
if (score <= topScore) { |
I agree this contract is bit tricky, maybe we should just use NumericUtils#floatToSortableInt
.
final int docBase = context.docBase; | ||
final ScoreDoc after = this.after; | ||
final float afterScore; | ||
final int afterScore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it actually help to track scores as sortable ints rather than floats? I had assumed we'd only encode them if they're between the after score and the top score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made them int because all scores in collector are only used to compare and the only computation is Math.nextUp
which needs to convert float to int :)
I'll revert these change as this does not seem to make sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it this way, it looks simpler.
For reference, I was just looking at profiles of the Tantivy benchmark (https://tantivy-search.github.io/bench/) and it tends to run queries with fewer hits in total compared to Lucene's nightly benchmarks, so the overhead of reordering the heap is higher (since a higher percentage of hits trigger a reordering of the heap). So I expect this change to help, and not only term queries.
super(new HitQueue(numHits, true)); | ||
super(null); | ||
this.heap = new LongHeap(numHits); | ||
IntStream.range(0, numHits).forEach(_ -> heap.push(DocScoreEncoder.LEAST_COMPETITIVE_CODE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a for loop would be a bit more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in a follow-up we can add a new ctor parameter to LongHeap
so that it accepts an initial value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old ctor used to create an empty heap so I added a new ctor.
Since the top-k heap appears to be a bottleneck for some queries, we could look into whether a radix heap would perform better than a binary heap in a follow-up. |
+1, that would be an interesting exploration! I rechecked benchmark against newest main (topN=1000), i plan to merge soon.
|
There seems to be some good speedups with topN=100 already: https://benchmarks.mikemccandless.com/Term.html. |
This tries to encode
ScoreDoc#score
andScoreDoc#doc
to a comparable long and use aLongHeap
instead ofHitQueue
. This seems to help apparently when i increasetopN = 1000
(mikemccand/luceneutil#357).Luceneutil (Baseline contains #14709.)
This branch is checkout from #14709 so there is some unrelated diff, i'll mark this draft until #14709 merged.