Optimize ArrowBytesViewMap: fuse hash+insert, shrink entries, remove generic#21628
Optimize ArrowBytesViewMap: fuse hash+insert, shrink entries, remove generic#21628Dandandan wants to merge 1 commit intoapache:mainfrom
Conversation
|
run benchmarks |
54e93bd to
6bf1405
Compare
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (6bf1405) to 29c5dd5 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
6bf1405 to
0cfd0e3
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (0cfd0e3) to 29c5dd5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (0cfd0e3) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (0cfd0e3) to 29c5dd5 (merge-base) diff using: tpch File an issue against this benchmark runner |
0cfd0e3 to
6aebd5c
Compare
…generic Three optimizations to ArrowBytesViewMap used by GROUP BY and COUNT DISTINCT on StringView/BinaryView columns: 1. Fuse hash computation with hash table probe: instead of a two-pass approach (batch create_hashes then per-element probe), compute the hash and fetch non-inline bytes once per element. This keeps the input string data cache-hot for the immediately-following equality comparison, avoiding a redundant pointer chase into the input array's data buffers. 2. Shrink hash table entries from 32 bytes (Entry<V> with u128 view + u64 hash + V payload) to 16 bytes (usize index + u64 hash). Views are looked up via the index into the existing views vec on demand. 3. Remove the V generic parameter entirely. The only two usages were ArrowBytesViewMap<()> (set) and ArrowBytesViewMap<usize> (group-by), where the usize payload was always the insertion-order index -- which is already implicit in the views vec position. Simplify the two-callback API (make_payload_fn + observe_payload_fn) to a single observe_fn(usize). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6aebd5c to
ab814f6
Compare
|
run benchmarks |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (ab814f6) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (ab814f6) to 29c5dd5 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (ab814f6) to 29c5dd5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
run benchmark clickbench_extended |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (ab814f6) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-arrow-bytes-view-map (ab814f6) to 29c5dd5 (merge-base) diff using: clickbench_extended File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_extended — base (merge-base)
clickbench_extended — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Performance optimization, no issue.
Rationale for this change
ArrowBytesViewMapis on the hot path for GROUP BY and COUNT DISTINCT onStringView/BinaryViewcolumns. This PR applies three optimizations to improve cache efficiency and reduce memory overhead.What changes are included in this PR?
Fuse hash computation with hash table probe: Instead of a two-pass approach (batch
create_hashesthen per-element probe), compute the hash and fetch non-inline bytes once per element. For non-inlined views (>12 bytes), the string data is kept cache-hot for the immediately-following equality comparison, avoiding a redundant pointer chase into the input array's data buffers.Shrink hash table entries from 32 to 16 bytes:
Entry<V>(u128 view + u64 hash + V payload) →(usize, u64)(index + hash). Views are looked up via the index into the existingviewsvec on demand. This means more entries fit per cache line during hash table probing.Remove the
Vgeneric parameter: The only two usages wereArrowBytesViewMap<()>(set) andArrowBytesViewMap<usize>(group-by), where theusizepayload was always the insertion-order index — which is already implicit in theviewsvec position. The two-callback API (make_payload_fn+observe_payload_fn) is simplified to a singleobserve_fn(usize).Are these changes tested?
Covered by existing tests (8 unit tests in
binary_view_map+ 26group_valuestests all pass).Are there any user-facing changes?
No user-facing changes.
ArrowBytesViewMapAPI is simplified but it is not part of the public API.🤖 Generated with Claude Code