Implement BlockedBloomFilter #127

k-jingyang · 2025-04-19T02:40:33Z

Solves #78

It's still WIP, but wanted to raise the PR earlier for feedback, if any

especially about organising src/bloom into src/bloom/blocked and src/bloom/standard

k-jingyang · 2025-04-19T03:01:10Z

Sharing the benchmark on my PC

$ cargo bench -- "bloom filter"

     Running benches/bloom.rs (target/release/deps/bloom-2e83c48a73000131)
bloom filter add key    time:   [411.51 ns 412.41 ns 413.53 ns]
                        change: [-6.7615% -5.9719% -5.1901%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

bloom filter contains key, true positive (1%)
                        time:   [41.095 ns 41.566 ns 42.132 ns]
                        change: [-15.853% -12.870% -9.9857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

bloom filter contains key, true positive (0.1%)
                        time:   [48.389 ns 49.241 ns 50.302 ns]
                        change: [-15.548% -11.948% -7.9643%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

bloom filter contains key, true positive (0.01%)
                        time:   [56.502 ns 57.368 ns 58.473 ns]
                        change: [-17.297% -13.599% -9.7068%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (0.0009999999%)
                        time:   [62.768 ns 63.557 ns 64.553 ns]
                        change: [-17.250% -14.014% -10.462%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

bloom filter add key - blocked bloom filter
                        time:   [408.43 ns 411.52 ns 415.83 ns]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (1%) - blocked bloom filter
                        time:   [35.358 ns 35.803 ns 36.400 ns]
                        change: [-19.207% -15.541% -11.769%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (0.1%) - blocked bloom filter
                        time:   [43.504 ns 43.904 ns 44.432 ns]
                        change: [-16.757% -13.042% -8.3987%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

bloom filter contains key, true positive (0.01%) - blocked bloom filter
                        time:   [47.467 ns 48.095 ns 48.989 ns]
                        change: [-16.809% -13.030% -9.5106%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe

bloom filter contains key, true positive (0.0009999999%) - blocked bloom filter
                        time:   [50.597 ns 50.977 ns 51.449 ns]
                        change: [-17.290% -13.593% -9.2939%] (p = 0.00 < 0.05)
                        Performance has improved.

marvin-j97 · 2025-04-19T11:53:44Z

Would you mind basing this on the 3.0.0 branch? I don't plan on adding blocked bloom filters in V2 anyway, and I already refactored the module tree to allow multiple filter types (src/segment/filter) in there.

k-jingyang · 2025-04-19T14:28:39Z

Would you mind basing this on the 3.0.0 branch? I don't plan on adding blocked bloom filters in V2 anyway, and I already refactored the module tree to allow multiple filter types (src/segment/filter) in there.

Sure, will do. Thanks!

k-jingyang · 2025-04-20T04:51:19Z

I've updated the MR to be based on the 3.0.0 branch. Here's the benchmarks. Noted that the benchmarking parameters changed compared to my above benchmark, hence the difference in the standard bloom filter benchmarks.

$ cargo bench -- ".+ bloom filter"

     Running benches/bloom.rs (target/release/deps/bloom-9b9f9b6ef85c0cd2)
standard bloom filter add key
                        time:   [533.32 ns 535.43 ns 538.80 ns]
                        change: [-5.1674% -3.9522% -2.7998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

standard bloom filter contains key, true positive (1%)
                        time:   [145.00 ns 145.68 ns 146.52 ns]
                        change: [-2.8564% -0.8346% +1.2524%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe

standard bloom filter contains key, true positive (0.1%)
                        time:   [187.06 ns 192.66 ns 199.79 ns]
                        change: [-5.0150% -3.0765% -0.9774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

standard bloom filter contains key, true positive (0.01%)
                        time:   [249.35 ns 250.61 ns 252.20 ns]
                        change: [-2.2533% -1.1636% -0.0419%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

standard bloom filter contains key, true positive (0.0009999999%)
                        time:   [275.90 ns 276.61 ns 277.49 ns]
                        change: [-8.7734% -5.6394% -2.8336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

blocked bloom filter add key
                        time:   [498.51 ns 502.83 ns 508.08 ns]
                        change: [-1.3098% -0.2746% +0.7744%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

blocked bloom filter contains key, true positive (1%)
                        time:   [77.661 ns 78.615 ns 79.697 ns]
                        change: [-6.7851% -3.2400% +0.3533%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

blocked bloom filter contains key, true positive (0.1%)
                        time:   [101.62 ns 102.36 ns 103.20 ns]
                        change: [-6.3367% -3.3385% -0.3909%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

blocked bloom filter contains key, true positive (0.01%)
                        time:   [120.55 ns 124.92 ns 130.13 ns]
                        change: [+2.7154% +6.9422% +11.441%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

blocked bloom filter contains key, true positive (0.0009999999%)
                        time:   [125.38 ns 130.16 ns 135.27 ns]
                        change: [-0.6506% +2.7169% +5.9467%] (p = 0.10 > 0.05)
                        No change in performance detected.

k-jingyang · 2025-04-20T05:28:15Z

The next steps in the PR will be:

An AMQFilter trait for use in the segment src/segment/inner.rs
A generic builder/factory to decode the appropriate filter based on Reader
- I'm assuming that each filter type can encode different variables, thus decode implementation will have to be provided by each filter type

Please correct me if I'm wrong, or if you have other ideas.

marvin-j97 · 2025-04-20T10:12:19Z

Sounds good to me

One more thing is comparing the FPR of the bloom filters. Blocked should have slightly higher FPR.

…terBuilder

k-jingyang · 2025-04-24T04:06:29Z

src/segment/filter/mod.rs

+pub struct AMQFilterBuilder {}
+
+impl AMQFilterBuilder {
+    pub fn decode_from<R: Read>(reader: &mut R) -> Result<Box<dyn AMQFilter + Sync>, DecodeError> {


I couldn't use impl Decode, because we're not returning a AMQFilterBuilder here.

Otherwise, the method signature is similar

k-jingyang · 2025-04-24T04:07:37Z

src/segment/filter/standard_bloom/mod.rs

+#[allow(clippy::len_without_is_empty)]
+impl StandardBloomFilter {
+    // To be used by AMQFilter after magic bytes and filter type have been read and parsed
+    pub(super) fn decode_from<R: Read>(reader: &mut R) -> Result<Self, DecodeError> {


method signature is similar to decode_from in Decode. Changed to it (super) visibility because it's only to be used by AMQFilterBuilder

Although, it feels slightly weird to reuse DecodeError here. Not sure if we should use another type of error instead

k-jingyang · 2025-04-24T04:12:07Z

src/segment/filter/mod.rs

+    }
+}
+
+pub trait AMQFilter: Sync + Send {


Adding Sync + Send here, otherwise we would get errors from blob_drop_after_flush

src/segment/filter/mod.rs

src/segment/filter/blocked_bloom/mod.rs

marvin-j97 · 2025-04-25T00:28:51Z

src/segment/filter/blocked_bloom/builder.rs

+            h1 = h1.wrapping_add(h2);
+            h2 = h2.wrapping_add(i);


These lines should probably be moved to the end of the loop iteration.

Same for filter (reader)

This was deliberate, to add variance between the choice of block and the first bit set in the block.

An edge case would be if num_of_blocks == cache_line_bytes. This would cause us to always set bit X of block X as the first bit.

That is porobably not an issue though because blocks are always logically isolated. It's probably also quite an edge case.

src/segment/filter/blocked_bloom/builder.rs

src/segment/filter/mod.rs

k-jingyang · 2025-04-27T14:26:57Z

Still pending measuring FPR for BlockedBloomFilter

k-jingyang force-pushed the main branch from beab201 to 60bbfe7 Compare April 19, 2025 15:21

marvin-j97 changed the base branch from main to 3.0.0 April 19, 2025 20:06

k-jingyang force-pushed the main branch from 60bbfe7 to 9b8be97 Compare April 20, 2025 04:52

marvin-j97 marked this pull request as ready for review April 20, 2025 13:25

k-jingyang added 2 commits April 24, 2025 12:00

feat: implement blocked bloom

38b2322

chore: use AMQFilter trait and decode StandardBloomFilter from AMQFil…

00b1d6e

…terBuilder

k-jingyang force-pushed the main branch from 825b084 to 00b1d6e Compare April 24, 2025 04:04

k-jingyang commented Apr 24, 2025

View reviewed changes

src/segment/filter/mod.rs Outdated Show resolved Hide resolved

feat: encode and decode BlockedBloomFilter

7e2bad7

k-jingyang commented Apr 24, 2025

View reviewed changes

src/segment/filter/blocked_bloom/mod.rs Show resolved Hide resolved

marvin-j97 reviewed Apr 25, 2025

View reviewed changes

src/segment/filter/blocked_bloom/mod.rs Outdated Show resolved Hide resolved

fix: use wrapping_mul

f16abbe

marvin-j97 requested changes Apr 26, 2025

View reviewed changes

marvin-j97 reviewed Apr 26, 2025

View reviewed changes

src/segment/filter/blocked_bloom/builder.rs Outdated Show resolved Hide resolved

marvin-j97 reviewed Apr 26, 2025

View reviewed changes

src/segment/filter/blocked_bloom/builder.rs Outdated Show resolved Hide resolved

k-jingyang added 2 commits April 27, 2025 17:52

fix: fix bits to bytes calculation from m

e161dd3

fix: fix bits calculation

b9e7d1c

k-jingyang force-pushed the main branch from 9276b1c to b9e7d1c Compare April 27, 2025 10:08

feat: use enum dispatch for filter type

dce7d01

k-jingyang force-pushed the main branch from 815953c to 77b6e7c Compare April 27, 2025 14:22

k-jingyang commented Apr 27, 2025

View reviewed changes

src/segment/filter/mod.rs Show resolved Hide resolved

chore: formatting

610a090

k-jingyang force-pushed the main branch from 77b6e7c to 610a090 Compare April 27, 2025 14:31

chore: rename

3e051f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BlockedBloomFilter #127

Implement BlockedBloomFilter #127

k-jingyang commented Apr 19, 2025 •

edited

Loading

k-jingyang commented Apr 19, 2025

marvin-j97 commented Apr 19, 2025

k-jingyang commented Apr 19, 2025 •

edited

Loading

k-jingyang commented Apr 20, 2025 •

edited

Loading

k-jingyang commented Apr 20, 2025 •

edited

Loading

marvin-j97 commented Apr 20, 2025

k-jingyang Apr 24, 2025 •

edited

Loading

k-jingyang Apr 24, 2025

k-jingyang Apr 24, 2025

k-jingyang Apr 24, 2025

marvin-j97 Apr 25, 2025

k-jingyang Apr 27, 2025

marvin-j97 Apr 28, 2025

k-jingyang commented Apr 27, 2025 •

edited

Loading

Implement BlockedBloomFilter #127

Are you sure you want to change the base?

Implement BlockedBloomFilter #127

Conversation

k-jingyang commented Apr 19, 2025 • edited Loading

k-jingyang commented Apr 19, 2025

marvin-j97 commented Apr 19, 2025

k-jingyang commented Apr 19, 2025 • edited Loading

k-jingyang commented Apr 20, 2025 • edited Loading

k-jingyang commented Apr 20, 2025 • edited Loading

marvin-j97 commented Apr 20, 2025

k-jingyang Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

k-jingyang Apr 24, 2025

Choose a reason for hiding this comment

k-jingyang Apr 24, 2025

Choose a reason for hiding this comment

k-jingyang Apr 24, 2025

Choose a reason for hiding this comment

marvin-j97 Apr 25, 2025

Choose a reason for hiding this comment

k-jingyang Apr 27, 2025

Choose a reason for hiding this comment

marvin-j97 Apr 28, 2025

Choose a reason for hiding this comment

k-jingyang commented Apr 27, 2025 • edited Loading

k-jingyang commented Apr 19, 2025 •

edited

Loading

k-jingyang commented Apr 19, 2025 •

edited

Loading

k-jingyang commented Apr 20, 2025 •

edited

Loading

k-jingyang commented Apr 20, 2025 •

edited

Loading

k-jingyang Apr 24, 2025 •

edited

Loading

k-jingyang commented Apr 27, 2025 •

edited

Loading