Hyperloglog ARM NEON SIMD optimization #1859

xbasel · 2025-03-18T11:50:31Z

Add ARM NEON optimization for HyperLogLog

Implement two NEON optmized functions for converting between raw and
dense representations in HyperLogLog:
1. hllMergeDenseNEON
2. hllDenseCompressNEON
  These functions process 16 registers in each iteration.
Utilize existing SIMD test in hyperloglog.tcl (previously added for
AVX2 optimization) to validate NEON implementation

Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}

+-------------------+-----------+-----------+---------------+
|      Metric       |  Before   |   After   | Improvement % |
+-------------------+-----------+-----------+---------------+
| Throughput (k rps)|    7.42   |   76.98   |    937.47%    |
+-------------------+-----------+-----------+---------------+
| Latency (msec)    |           |           |               |
|   avg             |   6.686   |   0.595   |     91.10%    |
|   min             |   0.520   |   0.152   |     70.77%    |
|   p50             |   7.799   |   0.599   |     92.32%    |
|   p95             |   8.039   |   0.767   |     90.46%    |
|   p99             |   8.111   |   0.807   |     90.05%    |
|   max             |   9.263   |   1.463   |     84.21%    |
+-------------------+-----------+-----------+---------------+

Hardware:

CPU: Graviton 3
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-63
Memory: 256 GB

Command stats:
Before:

cmdstat_pfmerge:calls=1000002,usec=126327984,**usec_per_call=126.33**,rejected_calls=0,failed_calls=0

After:

cmdstat_pfmerge:calls=1000002,usec=8588205,**usec_per_call=8.59**,rejected_calls=0,failed_calls=0

Improved by ~14.7x.

Functional testing command:

./runtest --single unit/hyperloglog --only "PFMERGE results with simd"  --loops 10000  --fastfail

The SIMD test randomizes input and comapres scalar vs simd results.

codecov · 2025-03-18T12:06:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.03%. Comparing base (aa88453) to head (4c45315).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1859      +/-   ##
============================================
- Coverage     71.09%   71.03%   -0.07%     
============================================
  Files           123      123              
  Lines         65671    65671              
============================================
- Hits          46692    46649      -43     
- Misses        18979    19022      +43

Files with missing lines	Coverage Δ
src/hyperloglog.c	`92.23% <100.00%> (ø)`

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t} +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ Hardware: CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB Signed-off-by: xbasel <[email protected]>

xbasel marked this pull request as draft March 18, 2025 11:50

xbasel mentioned this pull request Mar 18, 2025

[NEW] Implement ARM NEON and SVE2 optimizations for Hyperloglog #1860

Open

xbasel force-pushed the hll_neon branch 5 times, most recently from d5cc649 to b2c857e Compare March 18, 2025 16:41

xbasel self-assigned this Mar 18, 2025

xbasel force-pushed the hll_neon branch from b2c857e to 4c45315 Compare March 18, 2025 16:56

xbasel marked this pull request as ready for review March 18, 2025 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperloglog ARM NEON SIMD optimization #1859

Hyperloglog ARM NEON SIMD optimization #1859

xbasel commented Mar 18, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading

Hyperloglog ARM NEON SIMD optimization #1859

Are you sure you want to change the base?

Hyperloglog ARM NEON SIMD optimization #1859

Conversation

xbasel commented Mar 18, 2025 • edited Loading

codecov bot commented Mar 18, 2025 • edited Loading

Codecov Report

xbasel commented Mar 18, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading