|
16 | 16 | // under the License.
|
17 | 17 |
|
18 | 18 | //! Bloom filter implementation specific to Parquet, as described
|
19 |
| -//! in the [spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md). |
| 19 | +//! in the [spec][parquet-bf-spec]. |
| 20 | +//! |
| 21 | +//! # Bloom Filter Size |
| 22 | +//! |
| 23 | +//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its bloom filter |
| 24 | +//! implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF |
| 25 | +//! is stored in the metadata for each row group in the parquet file. The size of each filter is |
| 26 | +//! initialized using a calculation based on the desired number of distinct values (NDV) and false |
| 27 | +//! positive probability (FPP). The FPP for a SBBF can be approximated as<sup>[1][bf-formulae]</sup>: |
| 28 | +//! |
| 29 | +//! ```text |
| 30 | +//! f = (1 - e^(-k * n / m))^k |
| 31 | +//! ``` |
| 32 | +//! |
| 33 | +//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and `m` the total number |
| 34 | +//! of bits in the bloom filter. This can be re-arranged to determine the total number of bits |
| 35 | +//! required to achieve a given FPP and NDV: |
| 36 | +//! |
| 37 | +//! ```text |
| 38 | +//! m = -k * n / ln(1 - f^(1/k)) |
| 39 | +//! ``` |
| 40 | +//! |
| 41 | +//! SBBFs use eight hash functions to cleanly fit in SIMD lanes<sup>[2][sbbf-paper]</sup>, therefore |
| 42 | +//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` blocks that |
| 43 | +//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as: |
| 44 | +//! |
| 45 | +//! ```text |
| 46 | +//! b = NP2(m/8) / 32 |
| 47 | +//! ``` |
| 48 | +//! |
| 49 | +//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to be represented as bytes. |
| 50 | +//! |
| 51 | +//! Here is a table of calculated sizes for various FPP and NDV: |
| 52 | +//! |
| 53 | +//! | NDV | FPP | b | Size (KB) | |
| 54 | +//! |-----------|-----------|---------|-----------| |
| 55 | +//! | 10,000 | 0.1 | 256 | 8 | |
| 56 | +//! | 10,000 | 0.01 | 512 | 16 | |
| 57 | +//! | 10,000 | 0.001 | 1,024 | 32 | |
| 58 | +//! | 10,000 | 0.0001 | 1,024 | 32 | |
| 59 | +//! | 100,000 | 0.1 | 4,096 | 128 | |
| 60 | +//! | 100,000 | 0.01 | 4,096 | 128 | |
| 61 | +//! | 100,000 | 0.001 | 8,192 | 256 | |
| 62 | +//! | 100,000 | 0.0001 | 16,384 | 512 | |
| 63 | +//! | 100,000 | 0.00001 | 16,384 | 512 | |
| 64 | +//! | 1,000,000 | 0.1 | 32,768 | 1,024 | |
| 65 | +//! | 1,000,000 | 0.01 | 65,536 | 2,048 | |
| 66 | +//! | 1,000,000 | 0.001 | 65,536 | 2,048 | |
| 67 | +//! | 1,000,000 | 0.0001 | 131,072 | 4,096 | |
| 68 | +//! | 1,000,000 | 0.00001 | 131,072 | 4,096 | |
| 69 | +//! | 1,000,000 | 0.000001 | 262,144 | 8,192 | |
| 70 | +//! |
| 71 | +//! [parquet-bf-spec]: https://github.com/apache/parquet-format/blob/master/BloomFilter.md |
| 72 | +//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719 |
| 73 | +//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf |
20 | 74 |
|
21 | 75 | use crate::data_type::AsBytes;
|
22 | 76 | use crate::errors::ParquetError;
|
|
0 commit comments