Skip to content

Conversation

benwtrent
Copy link
Collaborator

We haven't actually been measuring the true pre-filter performance for Lucene kNN search.

Utilizing a BitSetIterator trips a VERY important short cut that by-passes the actual iteration of the scorer, allowing for the cost of fully realizing a filter to never trigger.

Here I am wrapping the iterator. Note the significant difference when filtering at 95%.

baseline

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.761        0.266   0.254        0.955  100000    10      20       16        100         no      0.00      Infinity            0.09             1          297.50       292.969      292.969       HNSW

This PR (more accurate cost analysis)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.761        1.006   0.835        0.830  100000    10      20       16        100         no      0.00      Infinity            0.13             1          297.50       292.969      292.969       HNSW

In this particular run, the majority of the time was spent simply putting the filter into a bitset, which I suspect is more realistic as most typical user's pre-filters aren't simply bitset iterators.

For the curious, here is the jfr of the more realistic run: baseline_and_candidate_pre_filter_test.zip

@jpountz
Copy link
Collaborator

jpountz commented Aug 27, 2025

I wonder if you should override intoBitSet to delegate to the wrapped iterator. This would copy bits in batches instead of one-by-one. This is something that not only happens when the iterator is a BitSetIterator but also a PostingsEnum that stores dense blocks as bit sets.

@jpountz
Copy link
Collaborator

jpountz commented Aug 27, 2025

(Said otherwise, I agree that we're currently over-estimating the performance of pre-filtering by enabling a rare optimization that is extremely effective, but in my opinion your patch is making us under-estimate the performance of pre-filtering by disabling an important optimization that kicks in in the standard case?)

@jpountz
Copy link
Collaborator

jpountz commented Aug 27, 2025

This makes me wonder if we could somehow benchmark pre-filtering against a TermQuery as a filter to make it more realistic.

@benwtrent
Copy link
Collaborator Author

This makes me wonder if we could somehow benchmark pre-filtering against a TermQuery as a filter to make it more realistic.

So, I am doing a run now for an IntField, doing a very big newSetQuery over all the ids matching the filter (yeah, its expensive...but its a real query).

@benwtrent
Copy link
Collaborator Author

Maybe the thing to do is to know the percentage provided in the indexer and just randomly set a "true/false" field.

@benwtrent
Copy link
Collaborator Author

LOL, doing a big termset query increases latency to 70.758 from 2.490 (I upped my test data to be 1M float32 vectors).

I will try a simple term "true/false" that is randomly distributed to see how that does.

@benwtrent
Copy link
Collaborator Author

OK, doing a true/false dense filter is much cheaper and creating the bit set (not very extensible to Lucene util :/)

2.796 vs 2.116, this is over 1M docs. JFR shows about 20% of the time is building the bit set, even on this very cheap filter.

@benwtrent
Copy link
Collaborator Author

I wonder if eager evaluation of very dense filters scales logarithmically like HNSW search does.

I would expect not? It seems like even if we grab chunks of docs at a time, but anything that costs a non-trivial amount will end up harming throughput.

I am not 100% sure how to best reflect this easily in Lucene Util...

@jpountz
Copy link
Collaborator

jpountz commented Aug 28, 2025

It's still linear unfortunately. But performance of loading filters based on postings lists into bit sets should be ~3x better since Lucene 10.2 (cf. annotations HS, HX and HY at https://benchmarks.mikemccandless.com/CountOrHighHigh.html).

I agree it's not easy to reflect. If someone uses a slow filter like a PhraseQuery, performance would be disastrous. But wouldn't we expect filters to be term queries most of the time? E.g. a filter on a category field, a filter on a tenant_id field, or a query on something like in_stock:true (which could match most of the index). I could imagine filters on ranges as well (e.g. filtering recent data). I'd expect these two (term and range queries) to cover a vast majority of use-cases?

@msokolov
Copy link
Collaborator

I actually think it is helpful to have a benchmark that is focused on the performance of HNSW with a filter, that does not include the creation of the filter, since I would expect to see use cases where filtering is done in a context where the filter is a frequently-applied filter that can be cached

@benwtrent
Copy link
Collaborator Author

I could imagine filters on ranges as well (e.g. filtering recent data). I'd expect these two (term and range queries) to cover a vast majority of use-cases?

For sure, In fact, I want to test a range filter over IDs to see how it behaves...let me do this test actually...

My concern is that we are leaving significant performance on the ground by eager evaluating filters that end up matching many millions of vectors when we only need to do thousands of vector ops :/.

I have seen this dramatically impact performance in real datasets and use cases when switching to a postfilter improved throughput 5-10x. Requiring the user to know if a filter should be pre/post is sort of silly.

since I would expect to see use cases where filtering is done in a context where the filter is a frequently-applied filter that can be cached

No doubt, maybe just caching is the answer...that would be simpler than shoehorning a bunch of logic into the vector query.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants