perf: extend multiple accumulator optimization to all fsum paths (#824) by SebKrantz · Pull Request #827 · fastverse/collapse

SebKrantz · 2026-05-04T04:51:15Z

Extends the loop unrolling optimization proposed in #824 by @TylerSagendorf to cover all scalar sum paths, not just na.rm = FALSE.

Functions updated

fsum_double_impl: all paths (na.rm=TRUE default, fill, na.rm=FALSE)
fsum_double_omp_impl: all paths using OpenMP array reduction
fsum_weights_impl: all paths
fsum_weights_omp_impl: all paths using OpenMP array reduction
fsum_int_impl: na.rm=TRUE path with long long accumulators
fsum_int_omp_impl: both paths using OpenMP array reduction

Grouped functions are unchanged — the scatter pattern (pout[pg[i]] += ...) prevents vectorization regardless.

Closes #824

Generated with Claude Code

#824) Apply loop unrolling with FSUM_N_ACC=4 independent accumulators to all scalar (non-grouped) sum functions, breaking the serial dependency chain and enabling SIMD vectorization across all na.rm and type combinations. Functions updated: - fsum_double_impl: na.rm=TRUE (narm==1 and narm==2) and na.rm=FALSE paths - fsum_double_omp_impl: all paths, using OpenMP array reduction - fsum_weights_impl: all paths (narm==1, narm==2, narm==0) - fsum_weights_omp_impl: all paths, using OpenMP array reduction - fsum_int_impl: na.rm=TRUE path (na.rm=FALSE has early-return on NA, not vectorizable) - fsum_int_omp_impl: both paths, using OpenMP array reduction on long long array The OMP versions switch from reduction(+:sum) on a single scalar to reduction(+:partial_sums[:FSUM_N_ACC]) on an array of accumulators, which combines OpenMP parallelism with SIMD within each thread's chunk. Co-authored-by: Sebastian Krantz <SebKrantz@users.noreply.github.com>

TylerSagendorf · 2026-05-04T10:14:15Z

@SebKrantz The code generated by claude is a bit inconsistent. I think this is because it tries to incorporate the code from all of the iterations that I posted: sometimes it uses acc for the accumulators, other times it uses partial_sums. Additionally, some of the code it generated may be slower because it uses a chunk approach where it has to perform an additional multiplication every FSUM_N_ACC (4) elements:

Could I implement the changes to the sum functions myself and submit a PR?

SebKrantz · 2026-05-04T11:34:11Z

@TylerSagendorf Thanks! And you are certainly very welcome to implement it!

TylerSagendorf · 2026-05-04T12:07:46Z

@SebKrantz no problem! I will work on that PR tonight.

TylerSagendorf · 2026-05-05T05:25:59Z

@SebKrantz I finished implementing the changes to the sum functions, and all tests are passing. I am going to review the code a bit more to verify that the changes are consistent/minimal and check some benchmarks before I open a PR.

TylerSagendorf mentioned this pull request May 6, 2026

Improve speed of fsum #828

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: extend multiple accumulator optimization to all fsum paths (#824)#827

perf: extend multiple accumulator optimization to all fsum paths (#824)#827
SebKrantz wants to merge 1 commit into
masterfrom
claude/issue-824-20260504-0437

SebKrantz commented May 4, 2026

Uh oh!

TylerSagendorf commented May 4, 2026 •

edited

Loading

Uh oh!

SebKrantz commented May 4, 2026

Uh oh!

TylerSagendorf commented May 4, 2026

Uh oh!

TylerSagendorf commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SebKrantz commented May 4, 2026

Functions updated

Uh oh!

TylerSagendorf commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SebKrantz commented May 4, 2026

Uh oh!

TylerSagendorf commented May 4, 2026

Uh oh!

TylerSagendorf commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TylerSagendorf commented May 4, 2026 •

edited

Loading