Description
Currently, the horizontal ops use a sequential ordering. This ordering shouldn't matter at all for integers or operations like min/max, because those should be associative, last I checked.
But it does matter for floats and their sums/products, a lot, as using a strictly sequential order not only is not very fast due to introducing data dependencies, it also introduces close to the maximum error possible. IEEE-754 states the order of the sum
function on a vector may be reassociated, almost certainly precisely for this reason, and most of the hardware we are targeting supports pairwise summation or product operations. These are also called "tree reductions" for reasons that are obvious when you tilt your head and actually stare at what happens. They should be reasonable to emulate in software even in their absence.
I know @gnzlbg was intending to use tree reductions for packed_simd
. We can just add the "allow reassociation" flags to the ops, and that was what the intrinsics originally did, but that makes the order totally unspecified, unfortunately. Perhaps LLVM always uses tree reductions to implement this, but they don't promise to do so. It does seem to make things go faster, though:
https://llvm.godbolt.org/z/dhbdoeq4d
We would like to still have a consistent ordering, as that would be most portable (the algorithm would produce the same results on all machines). We should explore if it's possible to consistently make LLVM emit pairwise sums of floats ops, if we can, and the performance implications.
This got discussed extensively in the original Rust Zulip thread.
There's a Julia forum thread on summation algorithms that probably contains relevant content.
We should adopt a resolution to commit to a horizontal_{sum,product}
ordering (or lack thereof) before we stabilize, so marking this as blocking stable. Some actionable steps to take first:
- ask other stakeholders (read: backend maintainers) like @bjorn3 and @antoyo what they think would actually be best.
- try to emulate tree reductions using swizzles and see if LLVM is earning its paycheck?
- produce a few examples we can actually benchmark that do piles of summations and products and try out various reduction algorithms (sequential, pairwise, etc.) with them.