-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Bad codegen for comparing struct of two 16bit ints #140167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Using |
Trying this in C shows that it is also unable to optimize (the equivalent of) https://godbolt.org/z/cqhM9hThK #include <stdint.h>
#include <stdbool.h>
#include <string.h>
typedef struct {
uint16_t x;
uint16_t y;
} Foo;
bool eq(const Foo* a, const Foo* b) {
return memcmp(a, b, sizeof(Foo)) == 0;
}
bool partial_eq(const Foo* a, const Foo* b) {
return a->x == b->x && a->y == b->y;
} eq:
mov eax, dword ptr [rdi]
cmp eax, dword ptr [rsi]
sete al
ret
partial_eq:
movzx eax, word ptr [rdi]
cmp ax, word ptr [rsi]
jne .LBB1_1
movzx eax, word ptr [rdi + 2]
cmp ax, word ptr [rsi + 2]
sete al
ret
.LBB1_1:
xor eax, eax
ret That suggests that codegen for |
At the time derives are expanded, field types aren’t known at all. The derive just gets tokens and it might not even see the same tokens as later stages of the compiler. So the derive can’t do the necessary checks (notably including: is bitwise equality even correct for all field types). In theory the built-in derives could expand to some magic intrinsic that’s lowered much later when types are known, but that’s a big hammer. I don’t see any fundamental reason why LLVM shouldn’t be able to do this optimization (at least for Rust) and that would help non-derived code as well. But I haven’t checked Alive2. |
Note that a manual Looks like LLVM in all three cases does 2 vectorized loads of (Note the OP version's LLVM IR is different from the // common prefix
%0 = load <2 x i16>, ptr %a, align 2
%1 = load <2 x i16>, ptr %b, align 2
%2 = icmp eq <2 x i16> %0, %1
// OP and && verison
%3 = extractelement <2 x i1> %2, i64 0
%4 = extractelement <2 x i1> %2, i64 1
%_0.sroa.0.0 = select i1 %3, i1 %4, i1 false
ret i1 %_0.sroa.0.0
// & verison
%shift = shufflevector <2 x i1> %2, <2 x i1> poison, <2 x i32> <i32 1, i32 poison>
%3 = and <2 x i1> %2, %shift
%_0 = extractelement <2 x i1> %3, i64 0
ret i1 %_0 godbolt link with asm and llvm ir (Maybe LLVM is assuming that the second fields could be |
Unfortunately not in the code I minimized this from that does roughly #[unsafe(no_mangle)]
pub fn foo(a: &Foo) -> bool {
foo_eq_3(a, &Foo { x: 1, y: 1 })
} That still results in two separate loads even with non-short-circuiting https://rust.godbolt.org/z/1e47McvW1 foo:
movzx eax, word ptr [rdi]
xor eax, 1
movzx ecx, word ptr [rdi + 2]
xor ecx, 1
or cx, ax
sete al
ret |
memorysafety/rav1d#1400 (comment) suggests that adding |
I assume they meant That's not simply a missed optimization opportunity. While the load of the second field can't load poison/undef, that property is control-dependent. For example, when matching on Solving this seems hard: I don't think LLVM has a way to express "loading through this pointer always reads initialized bytes" (like how |
This might be a tangent, but it feels related and gives us some very useful building blocks: If we had something like Fuller example at: https://play.rust-lang.org/?version=stable&mode=release&edition=2024&gist=bc3313d6c653a1c9844c49eaf735528a
This approach wouldn’t directly address the Comparing the assembly when
It's not a full solution. Obviously it would certainly require some nuance around types like |
I don’t think “all bit patterns valid” is relevant for |
@hanna-kruppe, that makes sense. After typing it up but unfortunately after posting I started to see where it might fall apart. The traits required here might be too specific to PartialEq to justify inclusion. This might be somewhere where the |
I don’t think it’s out of scope for the standard derives to do that sort of thing, if it can be done correctly and while keeping the benefits. Someone just has to find a workable way to do it (the ingredients don’t even have to be user-facing or stable, but “all bit patterns valid” doesn’t seem like the right tool for the job). This is nontrivial because (1) the derive doesn’t have type information at expansion time, (2) expanding to multiple alternative implementations to work around the first challenge has some costs, and (3) if it’s applied to all |
Rather than trying to optimize this in the PartialEq derive, could we instead teach rust to do a peephole optimization in MIR to coalesce comparing two u16 tuples into the u32? That might be more generally applicable since I’m guessing it’d work for u16 variables that happen to be next to each other on the stack or arguments, or if the u16s are in a larger struct. |
Trying to do this as an optimization in MIR faces essentially the same problem as asking LLVM to do it. We'd need find a way to make programs like the following UB, otherwise the "optimization" would introduce UB1 in such programs: use std::mem::MaybeUninit;
#[repr(C)]
struct Foo(u16, u16);
fn foo_eq(lhs: &Foo, rhs: &Foo) -> bool {
lhs.0 == rhs.0 && lhs.1 == rhs.1
}
fn main() {
let mut a: MaybeUninit<Foo> = MaybeUninit::uninit();
let mut b: MaybeUninit<Foo> = MaybeUninit::uninit();
unsafe {
a.as_mut_ptr().cast::<u16>().write(1u16);
b.as_mut_ptr().cast::<u16>().write(2u16);
}
// note: second field of a, b remains uninit
let a_ref = unsafe { &*a.as_ptr() };
let b_ref = unsafe { &*b.as_ptr() };
// always prints false because the first field differs and the second one isn't compared
println!("{}", foo_eq(a_ref, b_ref));
} My understanding of the current thinking on opsem is that a program like this probably won't end up as UB. For example, miri doesn't diagnose any UB in the above program, but does diagnose UB if the short circuiting in
Footnotes
|
I think we'd need a rule along the lines of "dereferencing a reference asserts validity of the entire value behind the reference"? |
One interesting aspect to this, is that even if y is undef, in the actual machine code the result of the equality comparison will be the same (since real CPU architectures don't have undef)1. So it feels like LLVM should in this case be able to optimise the comparison regardless of undef. I'm not an opsem guru or anything, so I don't know if there is a way to make the LLVM take advantage of the real hardware here. Footnotes
|
There has been extensive discussion about how it could be made UB and why it might be preferable not to, e.g. in rust-lang/unsafe-code-guidelines#346 -- let's not re-tread that ground here. That discussion also points to #117800 and has several people point out that The problem with crossing page boundaries is actually prevented by the uncontroversial parts of what's UB in Rust: |
Thinking a bit more about this, alignment could still play a role in optimising this for some architectures, but at least for x86/x86-64 unaligned reads are fine (for most non-SIMD instructions) . So I would expect it to get optimised to a u32 comparison in that case (assuming the undef issue is solved). Some other architectures might have to be conservative though. |
Ah, one complication with “just |
For the record, LLVM already has this, see the docs for MergeIcmps pass: // Example:
//
// struct S {
// int a;
// char b;
// char c;
// uint16_t d;
// bool operator==(const S& o) const {
// return a == o.a && b == o.b && c == o.c && d == o.d;
// }
// };
//
// Is optimized as :
//
// bool S::operator==(const S& o) const {
// return memcmp(this, &o, 8) == 0;
// }
//
// Which will later be expanded (ExpandMemCmp) as a single 8-bytes icmp. But again, I don't know why this doesn't trigger here. Could it by any chance be related to #83623 ?
Even if true, this could only be done late during codegen passes (as optimizing to IR to more-UB form could cause the modified IR to be inlined and start actually breaking stuff). Meanwhile in this specific case the vectorizer happened to optimize the comparisons first, just a different slower form (hence |
Yes, here is a repro: https://godbolt.org/z/8j71anzTW |
I haven't properly thought through all this but it would appear that it'd be enough to optimize |
No, |
If you look at the opt pipeline for these functions, you'll see that these examples don't trigger MergeICmps either. The transformations are mostly done by InstCombine and are only valid because the structs are passed to the function as i32/i64 value, so if any of the struct fields are poison to begin with, the whole function returns poison even before optimizations. ( MergeICmps is indeed aimed at this exact patterns but there's a bunch of open issues for it (and also the memcmp expansion pass that it relies on). Its pattern matching seems to be pretty narrow currently and there's correctness issues as well (also related to having to freeze the loaded bytes before comparing!). But if those problems get fixed, normalizing such comparisons to |
I don't see any reason why this couldn't be optimized at the LLVM level. From rust-lang/unsafe-code-guidelines#346:
I don't think explicit freezing is even needed on the Rust side: the short-circuit behaviour of && means there are two possibilities:
|
After reviewing the LLVM issues I linked in my last comment more carefully:
So yeah, I was wrong in my earlier comments in this issue: LLVM could totally do this without any change to rustc's output, or even to Rust's UB rules. But current LLVM can't do it soundly and fixing that would probably require a new LLVM IR construct that does not exist today, the existing " |
Part of the problem here is that de facto memory cannot contain poison (making these transforms legal), but de jure it can (because people would like poison in memory in the future). Which gives us the awkward situation where you can't actually put poison in memory without miscompiles, while at the same time we also can't exploit the absence of poison for optimization purposes :) |
I'm curious as to how that would work: would each byte have 257 possible bit patterns? Presumably you would store this separately as a validity bitmask (similar to sanitizers, valgrind etc work). But this would also break common optimised assembly implementations of strlen, memcmp, etc where you do full aligned loads into SIMD registers and deal with "read past the end" in clever ways. So I'm sceptical of people wanting to break backwards compatibility like that, at least on x86-641. And LLVM should strive for generating the best possible code for each ISA. But following the specifics of the architecture manual trumps "there is no implementation that does that", fair enough. (I assume that when you said "de jure" you have such a specific thing to point to? Because otherwise it is not really "de jure".) But thinking about this from another angle: the Rust type system should be rich enough to allow optimising this anyway: an Footnotes
|
This LLVM issue also seems very relevant: llvm/llvm-project#52930 |
Uh oh!
There was an error while loading. Please reload this page.
I tried this code:
I expected to see this happen:
a
andb
are loaded into a single register each and then the registers are compared against each other.Instead, this happened:
For
-Copt-level=2
:For
-Copt-level=3
:Meta
rustc --version --verbose
:Both 1.86 and nightly.
The text was updated successfully, but these errors were encountered: