-
Notifications
You must be signed in to change notification settings - Fork 136
Fix pre-filter performance testing to truly indicate cost #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix pre-filter performance testing to truly indicate cost #451
Conversation
I wonder if you should override |
(Said otherwise, I agree that we're currently over-estimating the performance of pre-filtering by enabling a rare optimization that is extremely effective, but in my opinion your patch is making us under-estimate the performance of pre-filtering by disabling an important optimization that kicks in in the standard case?) |
This makes me wonder if we could somehow benchmark pre-filtering against a |
So, I am doing a run now for an IntField, doing a very big |
Maybe the thing to do is to know the percentage provided in the indexer and just randomly set a "true/false" field. |
LOL, doing a big termset query increases latency to I will try a simple term "true/false" that is randomly distributed to see how that does. |
OK, doing a
|
I wonder if eager evaluation of very dense filters scales logarithmically like HNSW search does. I would expect not? It seems like even if we grab chunks of docs at a time, but anything that costs a non-trivial amount will end up harming throughput. I am not 100% sure how to best reflect this easily in Lucene Util... |
It's still linear unfortunately. But performance of loading filters based on postings lists into bit sets should be ~3x better since Lucene 10.2 (cf. annotations HS, HX and HY at https://benchmarks.mikemccandless.com/CountOrHighHigh.html). I agree it's not easy to reflect. If someone uses a slow filter like a PhraseQuery, performance would be disastrous. But wouldn't we expect filters to be term queries most of the time? E.g. a filter on a |
I actually think it is helpful to have a benchmark that is focused on the performance of HNSW with a filter, that does not include the creation of the filter, since I would expect to see use cases where filtering is done in a context where the filter is a frequently-applied filter that can be cached |
For sure, In fact, I want to test a range filter over IDs to see how it behaves...let me do this test actually... My concern is that we are leaving significant performance on the ground by eager evaluating filters that end up matching many millions of vectors when we only need to do thousands of vector ops :/. I have seen this dramatically impact performance in real datasets and use cases when switching to a postfilter improved throughput 5-10x. Requiring the user to know if a filter should be pre/post is sort of silly.
No doubt, maybe just caching is the answer...that would be simpler than shoehorning a bunch of logic into the vector query. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
We haven't actually been measuring the true pre-filter performance for Lucene kNN search.
Utilizing a
BitSetIterator
trips a VERY important short cut that by-passes the actual iteration of the scorer, allowing for the cost of fully realizing a filter to never trigger.Here I am wrapping the iterator. Note the significant difference when filtering at 95%.
baseline
This PR (more accurate cost analysis)
In this particular run, the majority of the time was spent simply putting the filter into a bitset, which I suspect is more realistic as most typical user's pre-filters aren't simply bitset iterators.
For the curious, here is the jfr of the more realistic run: baseline_and_candidate_pre_filter_test.zip