Skip to content

Implement BlockedBloomFilter #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: 3.0.0
Choose a base branch
from
Open

Conversation

k-jingyang
Copy link

@k-jingyang k-jingyang commented Apr 19, 2025

Solves #78

It's still WIP, but wanted to raise the PR earlier for feedback, if any

  • especially about organising src/bloom into src/bloom/blocked and src/bloom/standard

@k-jingyang
Copy link
Author

Sharing the benchmark on my PC

$ cargo bench -- "bloom filter"

     Running benches/bloom.rs (target/release/deps/bloom-2e83c48a73000131)
bloom filter add key    time:   [411.51 ns 412.41 ns 413.53 ns]
                        change: [-6.7615% -5.9719% -5.1901%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

bloom filter contains key, true positive (1%)
                        time:   [41.095 ns 41.566 ns 42.132 ns]
                        change: [-15.853% -12.870% -9.9857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

bloom filter contains key, true positive (0.1%)
                        time:   [48.389 ns 49.241 ns 50.302 ns]
                        change: [-15.548% -11.948% -7.9643%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

bloom filter contains key, true positive (0.01%)
                        time:   [56.502 ns 57.368 ns 58.473 ns]
                        change: [-17.297% -13.599% -9.7068%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (0.0009999999%)
                        time:   [62.768 ns 63.557 ns 64.553 ns]
                        change: [-17.250% -14.014% -10.462%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

bloom filter add key - blocked bloom filter
                        time:   [408.43 ns 411.52 ns 415.83 ns]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (1%) - blocked bloom filter
                        time:   [35.358 ns 35.803 ns 36.400 ns]
                        change: [-19.207% -15.541% -11.769%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) high mild
  11 (11.00%) high severe

bloom filter contains key, true positive (0.1%) - blocked bloom filter
                        time:   [43.504 ns 43.904 ns 44.432 ns]
                        change: [-16.757% -13.042% -8.3987%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

bloom filter contains key, true positive (0.01%) - blocked bloom filter
                        time:   [47.467 ns 48.095 ns 48.989 ns]
                        change: [-16.809% -13.030% -9.5106%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe

bloom filter contains key, true positive (0.0009999999%) - blocked bloom filter
                        time:   [50.597 ns 50.977 ns 51.449 ns]
                        change: [-17.290% -13.593% -9.2939%] (p = 0.00 < 0.05)
                        Performance has improved.

@marvin-j97
Copy link
Contributor

Would you mind basing this on the 3.0.0 branch? I don't plan on adding blocked bloom filters in V2 anyway, and I already refactored the module tree to allow multiple filter types (src/segment/filter) in there.

@k-jingyang
Copy link
Author

k-jingyang commented Apr 19, 2025

Would you mind basing this on the 3.0.0 branch? I don't plan on adding blocked bloom filters in V2 anyway, and I already refactored the module tree to allow multiple filter types (src/segment/filter) in there.

Sure, will do. Thanks!

@marvin-j97 marvin-j97 changed the base branch from main to 3.0.0 April 19, 2025 20:06
@k-jingyang
Copy link
Author

k-jingyang commented Apr 20, 2025

I've updated the MR to be based on the 3.0.0 branch. Here's the benchmarks. Noted that the benchmarking parameters changed compared to my above benchmark, hence the difference in the standard bloom filter benchmarks.

$ cargo bench -- ".+ bloom filter"

     Running benches/bloom.rs (target/release/deps/bloom-9b9f9b6ef85c0cd2)
standard bloom filter add key
                        time:   [533.32 ns 535.43 ns 538.80 ns]
                        change: [-5.1674% -3.9522% -2.7998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

standard bloom filter contains key, true positive (1%)
                        time:   [145.00 ns 145.68 ns 146.52 ns]
                        change: [-2.8564% -0.8346% +1.2524%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe

standard bloom filter contains key, true positive (0.1%)
                        time:   [187.06 ns 192.66 ns 199.79 ns]
                        change: [-5.0150% -3.0765% -0.9774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

standard bloom filter contains key, true positive (0.01%)
                        time:   [249.35 ns 250.61 ns 252.20 ns]
                        change: [-2.2533% -1.1636% -0.0419%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

standard bloom filter contains key, true positive (0.0009999999%)
                        time:   [275.90 ns 276.61 ns 277.49 ns]
                        change: [-8.7734% -5.6394% -2.8336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

blocked bloom filter add key
                        time:   [498.51 ns 502.83 ns 508.08 ns]
                        change: [-1.3098% -0.2746% +0.7744%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

blocked bloom filter contains key, true positive (1%)
                        time:   [77.661 ns 78.615 ns 79.697 ns]
                        change: [-6.7851% -3.2400% +0.3533%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

blocked bloom filter contains key, true positive (0.1%)
                        time:   [101.62 ns 102.36 ns 103.20 ns]
                        change: [-6.3367% -3.3385% -0.3909%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

blocked bloom filter contains key, true positive (0.01%)
                        time:   [120.55 ns 124.92 ns 130.13 ns]
                        change: [+2.7154% +6.9422% +11.441%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

blocked bloom filter contains key, true positive (0.0009999999%)
                        time:   [125.38 ns 130.16 ns 135.27 ns]
                        change: [-0.6506% +2.7169% +5.9467%] (p = 0.10 > 0.05)
                        No change in performance detected.

@k-jingyang
Copy link
Author

k-jingyang commented Apr 20, 2025

The next steps in the PR will be:

  1. An AMQFilter trait for use in the segment src/segment/inner.rs
  2. A generic builder/factory to decode the appropriate filter based on Reader
    • I'm assuming that each filter type can encode different variables, thus decode implementation will have to be provided by each filter type

Please correct me if I'm wrong, or if you have other ideas.

@marvin-j97
Copy link
Contributor

Sounds good to me

One more thing is comparing the FPR of the bloom filters. Blocked should have slightly higher FPR.

@marvin-j97 marvin-j97 marked this pull request as ready for review April 20, 2025 13:25
pub struct AMQFilterBuilder {}

impl AMQFilterBuilder {
pub fn decode_from<R: Read>(reader: &mut R) -> Result<Box<dyn AMQFilter + Sync>, DecodeError> {
Copy link
Author

@k-jingyang k-jingyang Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't use impl Decode, because we're not returning a AMQFilterBuilder here.

Otherwise, the method signature is similar

#[allow(clippy::len_without_is_empty)]
impl StandardBloomFilter {
// To be used by AMQFilter after magic bytes and filter type have been read and parsed
pub(super) fn decode_from<R: Read>(reader: &mut R) -> Result<Self, DecodeError> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method signature is similar to decode_from in Decode. Changed to it (super) visibility because it's only to be used by AMQFilterBuilder

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, it feels slightly weird to reuse DecodeError here. Not sure if we should use another type of error instead

}
}

pub trait AMQFilter: Sync + Send {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding Sync + Send here, otherwise we would get errors from blob_drop_after_flush

Comment on lines 103 to 104
h1 = h1.wrapping_add(h2);
h2 = h2.wrapping_add(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines should probably be moved to the end of the loop iteration.

Same for filter (reader)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was deliberate, to add variance between the choice of block and the first bit set in the block.

An edge case would be if num_of_blocks == cache_line_bytes. This would cause us to always set bit X of block X as the first bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is porobably not an issue though because blocks are always logically isolated. It's probably also quite an edge case.

@k-jingyang
Copy link
Author

k-jingyang commented Apr 27, 2025

Still pending measuring FPR for BlockedBloomFilter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants