Skip to content

Commit 2bdc9c1

Browse files
hiltontjalamb
andauthored
docs: add sizing explanation to bloom filter docs in parquet (#5705)
* docs: add sizing explanation to bloom filter docs in parquet Added documentation detailing the sizing of bloom filters in the parquet crate. * docs: fix doc comment typo * docs: clarify doc comment on bloom filters in metadata Updated the bloom filter module doc comment to clarify that the metadata stores the offset/length of the bloom filter, and not the bloom filter in its entirety. Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent b3f06f6 commit 2bdc9c1

File tree

1 file changed

+55
-1
lines changed
  • parquet/src/bloom_filter

1 file changed

+55
-1
lines changed

parquet/src/bloom_filter/mod.rs

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,61 @@
1616
// under the License.
1717

1818
//! Bloom filter implementation specific to Parquet, as described
19-
//! in the [spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md).
19+
//! in the [spec][parquet-bf-spec].
20+
//!
21+
//! # Bloom Filter Size
22+
//!
23+
//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its bloom filter
24+
//! implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF
25+
//! is stored in the metadata for each row group in the parquet file. The size of each filter is
26+
//! initialized using a calculation based on the desired number of distinct values (NDV) and false
27+
//! positive probability (FPP). The FPP for a SBBF can be approximated as<sup>[1][bf-formulae]</sup>:
28+
//!
29+
//! ```text
30+
//! f = (1 - e^(-k * n / m))^k
31+
//! ```
32+
//!
33+
//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and `m` the total number
34+
//! of bits in the bloom filter. This can be re-arranged to determine the total number of bits
35+
//! required to achieve a given FPP and NDV:
36+
//!
37+
//! ```text
38+
//! m = -k * n / ln(1 - f^(1/k))
39+
//! ```
40+
//!
41+
//! SBBFs use eight hash functions to cleanly fit in SIMD lanes<sup>[2][sbbf-paper]</sup>, therefore
42+
//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` blocks that
43+
//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as:
44+
//!
45+
//! ```text
46+
//! b = NP2(m/8) / 32
47+
//! ```
48+
//!
49+
//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to be represented as bytes.
50+
//!
51+
//! Here is a table of calculated sizes for various FPP and NDV:
52+
//!
53+
//! | NDV | FPP | b | Size (KB) |
54+
//! |-----------|-----------|---------|-----------|
55+
//! | 10,000 | 0.1 | 256 | 8 |
56+
//! | 10,000 | 0.01 | 512 | 16 |
57+
//! | 10,000 | 0.001 | 1,024 | 32 |
58+
//! | 10,000 | 0.0001 | 1,024 | 32 |
59+
//! | 100,000 | 0.1 | 4,096 | 128 |
60+
//! | 100,000 | 0.01 | 4,096 | 128 |
61+
//! | 100,000 | 0.001 | 8,192 | 256 |
62+
//! | 100,000 | 0.0001 | 16,384 | 512 |
63+
//! | 100,000 | 0.00001 | 16,384 | 512 |
64+
//! | 1,000,000 | 0.1 | 32,768 | 1,024 |
65+
//! | 1,000,000 | 0.01 | 65,536 | 2,048 |
66+
//! | 1,000,000 | 0.001 | 65,536 | 2,048 |
67+
//! | 1,000,000 | 0.0001 | 131,072 | 4,096 |
68+
//! | 1,000,000 | 0.00001 | 131,072 | 4,096 |
69+
//! | 1,000,000 | 0.000001 | 262,144 | 8,192 |
70+
//!
71+
//! [parquet-bf-spec]: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
72+
//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719
73+
//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf
2074
2175
use crate::data_type::AsBytes;
2276
use crate::errors::ParquetError;

0 commit comments

Comments
 (0)