-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose algebraic floating point intrinsics #136457
base: master
Are you sure you want to change the base?
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @thomcc (or someone else) some time within the next two weeks. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter cc @rust-lang/miri, @rust-lang/wg-const-eval |
Thanks for the speedy review @saethlin! I had a couple questions:
|
Thanks for the PR!
Going through the usual process, the first step would be a t-libs-api ACP to gauge the team's thinking on whether and how this should be exposed. This PR cannot land before a corresponding FCP has been accepted.
|
In terms of tests, usually we have doc tests. Also, given that the semantics are far from obvious (these operations are nondeterministic!), they need to be documented more carefully - probably in some central location, which is then referenced from everywhere.
|
@RalfJung: Thanks for the quick response!
|
https://doc.rust-lang.org/nightly/std/primitive.f32.html seems like a good place, that's where we already document everything special around NaNs. So a new section on algebraic operations there would probably be a good fit. |
I don't see an existing codegen test for the intrinsics so these should probably get one. https://github.com/rust-lang/rust/tree/3f33b30e19b7597a3acbca19e46d9e308865a0fe/tests/codegen/float would be a reasonable home. For reference, this would just be a file containing functions like this for each of the new methods, in order to verify the flags that we expect are getting set: // CHECK-LABEL: float @f32_algebraic_add(
#[no_mangle]
pub fn f32_algebraic_add(a: f32, b: f32) -> f32 {
// CHECK: fadd reassoc nsz arcp contract float %a, %b
a.algebraic_add(b)
} |
The Miri subtree was changed cc @rust-lang/miri |
// CHECK-LABEL: fp128 @f128_algebraic_add( | ||
#[no_mangle] | ||
pub fn f128_algebraic_add(a: f128, b: f128) -> f128 { | ||
// CHECK: fadd reassoc nsz arcp contract fp128 {{(%a, %b)|(%b, %a)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition and multiplication cases both end up as %b, %a
rather than %a, %b
which surprised me but isn't incorrect. I opted to allow either in case behavior changes in the future.
This looks pretty reasonable to me but all the public functions should get some examples. I think it may also be good to give a small demo of how this may work at the end of the new "Algebraic operators" section, e.g.: For example, the below:
```
x = x.algebraic_add(x, a.algebraic_mul(b));
```
May be rewritten as either of the following:
```
x = x + (a * b); // As written
x = (a * b) + x; // Reordered to allow using a single `fma`
``` |
Per function examplesDid you have specific examples in mind? I'm struggling to think of ones that aren't repetitive / low signal-to-noise (i.e. assert this algebraic add is approximately equal to a normal add). Should we instead link to the central documentation and test approximate equality of each of the ops in normal tests like this so they don't clutter the documentation? Central exampleMade a couple small edits for brevity. How's this look? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@bors try I think you should be able to run |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
@bors delegate+ for |
@bors try It's a bit better to use |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
library/std/tests/floats/f32.rs
Outdated
#[test] | ||
fn test_algebraic() { | ||
let a: f32 = 123.0; | ||
let b: f32 = 456.0; | ||
|
||
assert_approx_eq!(a.algebraic_add(b), a + b, 1e-2); | ||
assert_approx_eq!(a.algebraic_sub(b), a - b, 1e-2); | ||
assert_approx_eq!(a.algebraic_mul(b), a * b, 1e-1); | ||
assert_approx_eq!(a.algebraic_div(b), a / b, 1e-5); | ||
assert_approx_eq!(a.algebraic_rem(b), a % b, 1e-2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually going on that causes these tests to fail? a + b
should be 579.0, but the result is 579.0011. Afaict none of the algebraic effects can come into play here, so shouldn't the result be exact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LorrensP-2158466 @RalfJung can confirm but I assume this is from rust-lang/miri@f5330d0 since this works under ./x test
but fails under ./x miri
:
assert_eq!(a.algebraic_add(b), 579.0);
assert_eq!(b.algebraic_add(a), 579.0);
Would it make sense to specialize this test for MIRI vs. non-MIRI and assert exact values in the non-MIRI variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what that could look like: 6ab3fec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typing on my phone, so I'm sorry in advance).
In a PR a while ago I introduced some extra non-determinism on floating point operations with no specified precision. Sometime later Ralf disabled these because they failed tests in std lib and others. But I think not the algebraic ones. These get a 16ULP error on top of the host error.
There is new PR where we are working on lowering these errors and accounting for this in the rust tests.
I think the best thing to do is just disabling this ULP error everywhere and take this on in the current pr or a follow up.
But Ralf knows best ofc ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LorrensP-2158466: Thanks! Re:
I think the best thing to do is just disabling this ULP error everywhere and take this on in the current pr or a follow up.
Do you mean revert 622e8f4 in its entirety, or temporarily make apply_random_float_error_ulp()
a noop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. How's 8ac2e73 look? Should we also update these for consistency or leave them as is?
rust/src/tools/miri/src/intrinsics/mod.rs
Line 252 in 716dd22
// Apply a relative error of 16ULP to introduce some non-determinism rust/src/tools/miri/src/intrinsics/mod.rs
Line 288 in 716dd22
// Apply a relative error of 16ULP to introduce some non-determinism rust/src/tools/miri/src/intrinsics/mod.rs
Line 503 in 716dd22
/// Applies a random 16ULP floating point error to `val` and returns the new value.
Not sure what you are asking about the error terms?
Sorry, I was asking if there's a constant or helper I should use instead of hardcoding absolute error terms in these assert_approx_eq()
calls. I don't see any that make sense though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update these for consistency or leave them as is?
There's another PR touching those already (#138062), so please don't touch them here to avoid conflicts.
Sorry, I was asking if there's a constant or helper I should use instead of hardcoding absolute error terms in these assert_approx_eq() calls. I don't see any that make sense though.
There's a default of 1e-5 or so, but for f32 values on the order of 100000, that's like less than 1 ULP or so, hence it doesn't work for these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy, thanks! All good as is then. I'll squash so we can get this merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can legitimately fail without Miri.
Do you mean within the limits of what we document, or in practice with what we currently have? I don't think any of reassoc nsz arcp contract
allow imprecise results for only a single float op, with the possible exception of arcp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I should have been more clear -- I mean within the limits of the spec.
☀️ Try build successful - checks-actions |
By they way, in my codebase I use something minimal like this: #[derive(Copy, Clone, Default)]
#[repr(transparent)]
struct Af64(pub f64);
impl Af64 {
const ZERO: Af64 = Af64(0.0);
const ONE: Af64 = Af64(1.0);
fn from(x: i32) -> Self { Self(f64::from(x)) }
}
impl Add for Af64 {
type Output = Self;
#[inline]
fn add(self, other: Self) -> Self {
Self(fadd_algebraic(self.0, other.0))
}
}
impl Sub for Af64 {
type Output = Self;
#[inline]
fn sub(self, other: Self) -> Self {
Self(fsub_algebraic(self.0, other.0))
}
}
impl Mul for Af64 {
type Output = Self;
#[inline]
fn mul(self, other: Self) -> Self {
Self(fmul_algebraic(self.0, other.0))
}
}
impl Div for Af64 {
type Output = Self;
#[inline]
fn div(self, other: Self) -> Self {
Self(fdiv_algebraic(self.0, other.0))
}
}
impl Rem for Af64 {
type Output = Self;
#[inline]
fn rem(self, other: Self) -> Self {
Self(frem_algebraic(self.0, other.0))
}
} |
Squashed with |
We explicitly do not guarantee an exact value so it seems odd to have a test that checks for an exact value...
|
@tgross35: #136457 (comment) resolved, ready to merge whenever you are! |
@bors try |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
The job Click to see the possible cause of the failure (guessed by this bot)
|
💔 Test failed - checks-actions |
If you prefer you can remove the non-determinism from the algebraic operations in Miri and we'll figure out the best way to deal with that in a follow-up PR. We should figure this out before stabilization to ensure we have a coherent semantics, but it doesn't have to all be fully done in the first PR. |
Problem
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.
C++: 10us ✅
With Clang 18.1.3 and
-O2 -march=haswell
:Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9) and
-C opt-level=3 -C target-feature=+avx2,+fma
:Proposed Change
Add
core::intrinsics::f*_algebraic
wrappers tof16
,f32
,f64
, andf128
gated on a newfloat_algebraic
feature.Alternatives Considered
#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
References
try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux