perf: extend multiple accumulator optimization to all fsum paths (#824)#827
perf: extend multiple accumulator optimization to all fsum paths (#824)#827SebKrantz wants to merge 1 commit into
Conversation
#824) Apply loop unrolling with FSUM_N_ACC=4 independent accumulators to all scalar (non-grouped) sum functions, breaking the serial dependency chain and enabling SIMD vectorization across all na.rm and type combinations. Functions updated: - fsum_double_impl: na.rm=TRUE (narm==1 and narm==2) and na.rm=FALSE paths - fsum_double_omp_impl: all paths, using OpenMP array reduction - fsum_weights_impl: all paths (narm==1, narm==2, narm==0) - fsum_weights_omp_impl: all paths, using OpenMP array reduction - fsum_int_impl: na.rm=TRUE path (na.rm=FALSE has early-return on NA, not vectorizable) - fsum_int_omp_impl: both paths, using OpenMP array reduction on long long array The OMP versions switch from reduction(+:sum) on a single scalar to reduction(+:partial_sums[:FSUM_N_ACC]) on an array of accumulators, which combines OpenMP parallelism with SIMD within each thread's chunk. Co-authored-by: Sebastian Krantz <SebKrantz@users.noreply.github.com>
|
@SebKrantz The code generated by claude is a bit inconsistent. I think this is because it tries to incorporate the code from all of the iterations that I posted: sometimes it uses
Could I implement the changes to the sum functions myself and submit a PR? |
|
@TylerSagendorf Thanks! And you are certainly very welcome to implement it! |
|
@SebKrantz no problem! I will work on that PR tonight. |
|
@SebKrantz I finished implementing the changes to the sum functions, and all tests are passing. I am going to review the code a bit more to verify that the changes are consistent/minimal and check some benchmarks before I open a PR. |

Extends the loop unrolling optimization proposed in #824 by @TylerSagendorf to cover all scalar sum paths, not just
na.rm = FALSE.Functions updated
fsum_double_impl: all paths (na.rm=TRUE default, fill, na.rm=FALSE)fsum_double_omp_impl: all paths using OpenMP array reductionfsum_weights_impl: all pathsfsum_weights_omp_impl: all paths using OpenMP array reductionfsum_int_impl: na.rm=TRUE path with long long accumulatorsfsum_int_omp_impl: both paths using OpenMP array reductionGrouped functions are unchanged — the scatter pattern (
pout[pg[i]] += ...) prevents vectorization regardless.Closes #824
Generated with Claude Code