-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors #15140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors #15140
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Fascinating! Did you just confirm that 3 performs better than 2, or were you also able to confirm that 3 works better than 4 or 5? |
Thanks for checking the PR, Adrian! So far I’ve only benchmarked 2-ary vs 3-ary. I haven’t run comparisons against 4 or 5 yet, but I’ll add those runs and share the results here |
Hmm I tried to reproduce with wikibigall on my machine (AMD Ryzen 9 3900X), but luceneutil reports no speedup (nor slowdown). |
I ran additional benchmarks for arity=4 and arity=5 on an i3.8xlarge EC2 instance with With arity: 4
With arity: 5
|
Reran benchmarks with arity 3 Run-1
Run-2
|
Thank you for running the cross-check, Adrian. I’ll re-run the same wikibigall suite on a second (graviton) instance I have access to. |
Ran benchmarks on arity-3
arity-4
arity-5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, these benchmark results are quite convincing. I'm slightly uneasy about introducing loops that are unlikely to be unrolled, I could see it cause performance degradations in cases that we are not benchmarking here. Maybe keep LongHeap
binary for now and introduce a TernaryLongHeap
class? We can share code via static void upHeap(long[] heap, int i, int arity)
helpers, and passing constants as the arity should help the compiler unroll the loops when it gets inlined? Let's only move top-hits collection to TernaryLongHeap
for now and look into whether it also makes sense for vector search in a follow-up? What do you think?
Thanks @jpountz for reviewing this PR. I agree that it makes sense to keep We can then revisit vector search in a follow-up once we have benchmarks for that workload to validate whether it benefits from the same change. |
5dff1f1
to
f6d0f92
Compare
I have implemented the following changes:
I re-ran benchmarks on c8g.8xlarge (graviton) instance. Please re-review @jpountz..
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments but the change looks good to me overall!
lucene/core/src/java/org/apache/lucene/util/TernaryLongHeap.java
Outdated
Show resolved
Hide resolved
f6d0f92
to
ff1ea3e
Compare
2fc2a61
to
9c51295
Compare
Thanks for reviewing again @jpountz. I've addressed all the minor comments. Please re-review the PR. |
![]() https://benchmarks.mikemccandless.com/Term.html Latest runs shows improved performance for Term query |
This is great! Would you like to open a PR against luceneutil to add an annotation? I'm looking forward to seeing whether this can help with vector search as well. cc @benwtrent @msokolov |
Did we do any testing with smaller topN (say 20 or 100)? I suspect we wouldn't see any improvement there, and might even see some loss. If that's true, we might want to gate this optimization on the value of N. |
But I don't want to be negative Nelly here, this is a cool idea, and I love that it helps! I do suspect though that it won't help much with KNN search since the typical values for K are smaller |
With oversampling (3x - 5x) and quantized scoring (so the dominating cost of floating point ops goes away), I have seen other administrivia of HNSW searching be more and more the cause of latency. It would be interesting to see the latency for Like @msokolov I wouldn't expect anything on regular float32 (though, you never know until you benchmark it!) |
The nightly benchmark runs with topN as 100, we do see improvement in perf (see ref). I will run the benchmark with topN as 20, to check, if we observe regression. |
I ran benchmarks with arity=3 and topN=20 on wikibigall on i3.8xlarge The results show no statistically significant regressions with topN as 20.
|
Are they? I would expect some users to do vector search with k=1000 and then re-rank them. |
For this benchmark, I switched Please find the results below (source:
|
thanks for checking w/smaller K/N - glad there was no regression there |
Description
This PR updates LongHeap from a fixed 2-ary heap to a 3-ary heap (the code is generic with n-ary Heap). The change improves cache locality and reduces heap operations for larger heaps, which improves overall query performance.
Benchmarks
Luceneutil has been run with TopN as 1000 (see mikemccand/luceneutil#357) Benchmark machine: i3.8xlarge