Skip to content

Investigate pairwise sum/product flops for reductions/horizontal ops #235

Open
@workingjubilee

Description

@workingjubilee

Currently, the horizontal ops use a sequential ordering. This ordering shouldn't matter at all for integers or operations like min/max, because those should be associative, last I checked.

But it does matter for floats and their sums/products, a lot, as using a strictly sequential order not only is not very fast due to introducing data dependencies, it also introduces close to the maximum error possible. IEEE-754 states the order of the sum function on a vector may be reassociated, almost certainly precisely for this reason, and most of the hardware we are targeting supports pairwise summation or product operations. These are also called "tree reductions" for reasons that are obvious when you tilt your head and actually stare at what happens. They should be reasonable to emulate in software even in their absence.

I know @gnzlbg was intending to use tree reductions for packed_simd. We can just add the "allow reassociation" flags to the ops, and that was what the intrinsics originally did, but that makes the order totally unspecified, unfortunately. Perhaps LLVM always uses tree reductions to implement this, but they don't promise to do so. It does seem to make things go faster, though:
https://llvm.godbolt.org/z/dhbdoeq4d

We would like to still have a consistent ordering, as that would be most portable (the algorithm would produce the same results on all machines). We should explore if it's possible to consistently make LLVM emit pairwise sums of floats ops, if we can, and the performance implications.

This got discussed extensively in the original Rust Zulip thread.
There's a Julia forum thread on summation algorithms that probably contains relevant content.

We should adopt a resolution to commit to a horizontal_{sum,product} ordering (or lack thereof) before we stabilize, so marking this as blocking stable. Some actionable steps to take first:

  • ask other stakeholders (read: backend maintainers) like @bjorn3 and @antoyo what they think would actually be best.
  • try to emulate tree reductions using swizzles and see if LLVM is earning its paycheck?
  • produce a few examples we can actually benchmark that do piles of summations and products and try out various reduction algorithms (sequential, pairwise, etc.) with them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: LLVMA-floating-pointArea: Floating point numbers and arithmeticC-feature-requestCategory: a feature request, i.e. not implemented / a PRblocks-stable

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions