Fix pre-filter performance testing to truly indicate cost #451

benwtrent · 2025-08-27T20:32:54Z

We haven't actually been measuring the true pre-filter performance for Lucene kNN search.

Utilizing a BitSetIterator trips a VERY important short cut that by-passes the actual iteration of the scorer, allowing for the cost of fully realizing a filter to never trigger.

Here I am wrapping the iterator. Note the significant difference when filtering at 95%.

baseline

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.761        0.266   0.254        0.955  100000    10      20       16        100         no      0.00      Infinity            0.09             1          297.50       292.969      292.969       HNSW

This PR (more accurate cost analysis)

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.761        1.006   0.835        0.830  100000    10      20       16        100         no      0.00      Infinity            0.13             1          297.50       292.969      292.969       HNSW

In this particular run, the majority of the time was spent simply putting the filter into a bitset, which I suspect is more realistic as most typical user's pre-filters aren't simply bitset iterators.

For the curious, here is the jfr of the more realistic run: baseline_and_candidate_pre_filter_test.zip

jpountz · 2025-08-27T20:49:35Z

I wonder if you should override intoBitSet to delegate to the wrapped iterator. This would copy bits in batches instead of one-by-one. This is something that not only happens when the iterator is a BitSetIterator but also a PostingsEnum that stores dense blocks as bit sets.

jpountz · 2025-08-27T21:53:28Z

(Said otherwise, I agree that we're currently over-estimating the performance of pre-filtering by enabling a rare optimization that is extremely effective, but in my opinion your patch is making us under-estimate the performance of pre-filtering by disabling an important optimization that kicks in in the standard case?)

jpountz · 2025-08-27T21:57:37Z

This makes me wonder if we could somehow benchmark pre-filtering against a TermQuery as a filter to make it more realistic.

benwtrent · 2025-08-28T13:09:33Z

This makes me wonder if we could somehow benchmark pre-filtering against a TermQuery as a filter to make it more realistic.

So, I am doing a run now for an IntField, doing a very big newSetQuery over all the ids matching the filter (yeah, its expensive...but its a real query).

benwtrent · 2025-08-28T13:10:02Z

Maybe the thing to do is to know the percentage provided in the indexer and just randomly set a "true/false" field.

benwtrent · 2025-08-28T13:40:56Z

LOL, doing a big termset query increases latency to 70.758 from 2.490 (I upped my test data to be 1M float32 vectors).

I will try a simple term "true/false" that is randomly distributed to see how that does.

benwtrent · 2025-08-28T14:24:31Z

OK, doing a true/false dense filter is much cheaper and creating the bit set (not very extensible to Lucene util :/)

2.796 vs 2.116, this is over 1M docs. JFR shows about 20% of the time is building the bit set, even on this very cheap filter.

benwtrent · 2025-08-28T18:51:00Z

I wonder if eager evaluation of very dense filters scales logarithmically like HNSW search does.

I would expect not? It seems like even if we grab chunks of docs at a time, but anything that costs a non-trivial amount will end up harming throughput.

I am not 100% sure how to best reflect this easily in Lucene Util...

jpountz · 2025-08-28T19:35:31Z

It's still linear unfortunately. But performance of loading filters based on postings lists into bit sets should be ~3x better since Lucene 10.2 (cf. annotations HS, HX and HY at https://benchmarks.mikemccandless.com/CountOrHighHigh.html).

I agree it's not easy to reflect. If someone uses a slow filter like a PhraseQuery, performance would be disastrous. But wouldn't we expect filters to be term queries most of the time? E.g. a filter on a category field, a filter on a tenant_id field, or a query on something like in_stock:true (which could match most of the index). I could imagine filters on ranges as well (e.g. filtering recent data). I'd expect these two (term and range queries) to cover a vast majority of use-cases?

msokolov · 2025-08-28T20:47:11Z

I actually think it is helpful to have a benchmark that is focused on the performance of HNSW with a filter, that does not include the creation of the filter, since I would expect to see use cases where filtering is done in a context where the filter is a frequently-applied filter that can be cached

benwtrent · 2025-08-28T21:07:55Z

I could imagine filters on ranges as well (e.g. filtering recent data). I'd expect these two (term and range queries) to cover a vast majority of use-cases?

For sure, In fact, I want to test a range filter over IDs to see how it behaves...let me do this test actually...

My concern is that we are leaving significant performance on the ground by eager evaluating filters that end up matching many millions of vectors when we only need to do thousands of vector ops :/.

I have seen this dramatically impact performance in real datasets and use cases when switching to a postfilter improved throughput 5-10x. Requiring the user to know if a filter should be pre/post is sort of silly.

since I would expect to see use cases where filtering is done in a context where the filter is a frequently-applied filter that can be cached

No doubt, maybe just caching is the answer...that would be simpler than shoehorning a bunch of logic into the vector query.

github-actions · 2025-09-12T00:09:09Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Fix pre-filter performance testing to truly indicate cost

1d148b9

benwtrent mentioned this pull request Aug 27, 2025

Improve kNN behavior on permissive filters apache/lucene#15132

Open

github-actions bot added the Stale label Sep 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix pre-filter performance testing to truly indicate cost #451

Fix pre-filter performance testing to truly indicate cost #451

Uh oh!

benwtrent commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

jpountz commented Aug 28, 2025

Uh oh!

msokolov commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

Uh oh!

Fix pre-filter performance testing to truly indicate cost #451

Are you sure you want to change the base?

Fix pre-filter performance testing to truly indicate cost #451

Uh oh!

Conversation

benwtrent commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

jpountz commented Aug 27, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

jpountz commented Aug 28, 2025

Uh oh!

msokolov commented Aug 28, 2025

Uh oh!

benwtrent commented Aug 28, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

Uh oh!