Skip to content

Bad codegen for comparing struct of two 16bit ints #140167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bjorn3 opened this issue Apr 22, 2025 · 28 comments
Open

Bad codegen for comparing struct of two 16bit ints #140167

bjorn3 opened this issue Apr 22, 2025 · 28 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@bjorn3
Copy link
Member

bjorn3 commented Apr 22, 2025

I tried this code:

#[derive(Clone, Copy, PartialEq)]
pub struct Foo {
    pub x: u16,
    pub y: u16,
}

#[no_mangle]
fn eq(a: &Foo, b: &Foo) -> bool {
    a == b
}

I expected to see this happen: a and b are loaded into a single register each and then the registers are compared against each other.

Instead, this happened:

For -Copt-level=2:

eq:
        movzx   eax, word ptr [rdi]
        movzx   ecx, word ptr [rdi + 2]
        xor     ax, word ptr [rsi]
        movzx   edx, word ptr [rsi + 2]
        xor     edx, ecx
        or      dx, ax
        sete    al
        ret

For -Copt-level=3:

eq:
        movd    xmm0, dword ptr [rdi]
        movd    xmm1, dword ptr [rsi]
        pcmpeqw xmm1, xmm0
        pshuflw xmm0, xmm1, 80
        pshufd  xmm0, xmm0, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret

Meta

rustc --version --verbose:
Both 1.86 and nightly.

@bjorn3 bjorn3 added C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Apr 22, 2025
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 22, 2025
@bjorn3 bjorn3 changed the title Bad codegen for comparing struct of 16bit ints Bad codegen for comparing struct of two 16bit ints Apr 22, 2025
@bjorn3
Copy link
Member Author

bjorn3 commented Apr 22, 2025

Using transmute::<_, u32>(read_unaligned(self)) and transmute::<_, u32>(read_unaligned(other)) inside of the PartialEq impl produces the expected codegen.

@folkertdev
Copy link
Contributor

Trying this in C shows that it is also unable to optimize (the equivalent of) Foo::eq, but it is able to generate good code when memcmp is used:

https://godbolt.org/z/cqhM9hThK

#include <stdint.h>
#include <stdbool.h>
#include <string.h>

typedef struct {
    uint16_t x;
    uint16_t y;
} Foo;

bool eq(const Foo* a, const Foo* b) {
    return memcmp(a, b, sizeof(Foo)) == 0;
}

bool partial_eq(const Foo* a, const Foo* b) {
    return a->x == b->x && a->y == b->y;
}
eq:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

partial_eq:
        movzx   eax, word ptr [rdi]
        cmp     ax, word ptr [rsi]
        jne     .LBB1_1
        movzx   eax, word ptr [rdi + 2]
        cmp     ax, word ptr [rsi + 2]
        sete    al
        ret
.LBB1_1:
        xor     eax, eax
        ret

That suggests that codegen for Eq could be improved by using memcmp in some cases. (fields must be copy, no padding , probably other conditions that I'm missing).

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Apr 22, 2025

At the time derives are expanded, field types aren’t known at all. The derive just gets tokens and it might not even see the same tokens as later stages of the compiler. So the derive can’t do the necessary checks (notably including: is bitwise equality even correct for all field types). In theory the built-in derives could expand to some magic intrinsic that’s lowered much later when types are known, but that’s a big hammer. I don’t see any fundamental reason why LLVM shouldn’t be able to do this optimization (at least for Rust) and that would help non-derived code as well. But I haven’t checked Alive2.

@Urgau Urgau added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-codegen Area: Code generation and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 22, 2025
@zachs18
Copy link
Contributor

zachs18 commented Apr 24, 2025

Note that a manual (a.x == b.x) && (a.y == b.y) compiles to the same suboptimal codegen (which makes sense, since that's what the derive expands to), but (a.x == b.x) & (a.y == b.y) (non-short-circuiting &) compiles to 32-bit load and compare.

Looks like LLVM in all three cases does 2 vectorized loads of 2 x i16, and then a compare of the vectors, but how it ANDs the resulting 2 x i1 is different: OP and && seems to do like if cmp[0] { cmp[1] } else { false }, while & does something else.

(Note the OP version's LLVM IR is different from the && version, but IIUC only in var names and debug info; they get deduplicated anyway if both are present)

// common prefix
  %0 = load <2 x i16>, ptr %a, align 2
  %1 = load <2 x i16>, ptr %b, align 2
  %2 = icmp eq <2 x i16> %0, %1

// OP and && verison
  %3 = extractelement <2 x i1> %2, i64 0
  %4 = extractelement <2 x i1> %2, i64 1
  %_0.sroa.0.0 = select i1 %3, i1 %4, i1 false
  ret i1 %_0.sroa.0.0

// & verison
  %shift = shufflevector <2 x i1> %2, <2 x i1> poison, <2 x i32> <i32 1, i32 poison>
  %3 = and <2 x i1> %2, %shift
  %_0 = extractelement <2 x i1> %3, i64 0
  ret i1 %_0

godbolt link with asm and llvm ir

(Maybe LLVM is assuming that the second fields could be poison in the && case, so avoiding using their comparison if the first comparison is false?)

@bjorn3
Copy link
Member Author

bjorn3 commented Apr 24, 2025

but (a.x == b.x) & (a.y == b.y) (non-short-circuiting &) compiles to 32-bit load and compare.

Unfortunately not in the code I minimized this from that does roughly

#[unsafe(no_mangle)]
pub fn foo(a: &Foo) -> bool {
    foo_eq_3(a, &Foo { x: 1, y: 1 })
}

That still results in two separate loads even with non-short-circuiting &:

https://rust.godbolt.org/z/1e47McvW1

foo:
        movzx   eax, word ptr [rdi]
        xor     eax, 1
        movzx   ecx, word ptr [rdi + 2]
        xor     ecx, 1
        or      cx, ax
        sete    al
        ret

@jrmuizel
Copy link
Contributor

memorysafety/rav1d#1400 (comment) suggests that adding !undef to the load would fix this.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 20, 2025

I assume they meant !noundef. As noted in a later comment rustc already adds that metadata when it emits LLVM IR, but SimplifyCFG drops them when turning the two branches on icmps into a select.

That's not simply a missed optimization opportunity. While the load of the second field can't load poison/undef, that property is control-dependent. For example, when matching on Option<u32> behind a reference, loading the u32 field in the Some case is also !noundef, but that part of memory will generally be undef/poison on the None branch. Thus, hoisting !noundef loads generally has to drop the !noundef metadata.

Solving this seems hard: I don't think LLVM has a way to express "loading through this pointer always reads initialized bytes" (like how dereferenceable says "loading through this pointer is always okay but may yield poison/undef"). I'm also no sure if we could apply such an attribute to Rust references -- it may break common unsafe patterns that form references to uninitialized memory. The reference conservatively calls that UB for now, but even there it's noted that this is controversial and may end up being non-UB. Miri doesn't enforce any such rule, I believe. So Rust may be unable to justify the desired optimization for the same reason why C compilers can't do it.

@mmastrac
Copy link

mmastrac commented May 22, 2025

This might be a tangent, but it feels related and gives us some very useful building blocks:

If we had something like Plain (previously discussed in the forum) named AllBitPatternsValid<Repr=...> that was derived automatically for structs, and a const stdlib utility function std::mem::all_bit_patterns_valid<T> and std::mem::bitwise_compare<T>, could PartialEq generate code like so?

Fuller example at: https://play.rust-lang.org/?version=stable&mode=release&edition=2024&gist=bc3313d6c653a1c9844c49eaf735528a

unsafe trait AllBitPatternsValid {
    type Repr: AllBitPatternsValid;
}

pub const fn bitwise_compare<T: AllBitPatternsValid>(a: &T, b: &T) -> bool { ... }
pub const fn all_bit_patterns_valid<T: Sized>() -> bool { ... }

impl PartialEq for MyStruct {
    #[inline(always)]
    fn eq(self: &MyStruct, other: &MyStruct) -> bool {
        if all_bit_patterns_valid::<Self>() {
            bitwise_compare(self, other)
        } else {
            self.x == other.x && self.y == other.y
        }
    }
}

This approach wouldn’t directly address the !noundef metadata issue discussed earlier, but it could enable the compiler to perform safe, low-level comparisons for types where all bit patterns are valid, potentially improving performance.

Comparing the assembly when all_bit_patterns_valid returns false vs when all_bit_patterns_valid returns true, we can see that the compiler happily generates the wider comparisons:

-	movq	(%rsp), %rax
-	movq	8(%rsp), %rdi
-	movzwl	(%rax), %ecx
-	cmpw	(%rdi), %cx
-	jne	.LBB4_6
-	movzwl	2(%rax), %eax
-	cmpw	2(%rdi), %ax
-	jne	.LBB4_6
+	movq	8(%rsp), %rax
+	movq	16(%rsp), %rdi
+	movl	(%rax), %ebx
+	movl	(%rdi), %ebp

It's not a full solution. Obviously it would certainly require some nuance around types like Box that are technically AllBitPatternsValid but require pointer chasing to compute equality -- perhaps something like Plain or Pod might be a better approach, but it does seem like these building blocks could be very useful.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 22, 2025

I don’t think “all bit patterns valid” is relevant for PartialEq since you’re not transmuting anything to the type. The crucial part is knowing that the field type’s definition of comparison is equivalent to shallow bitwise comparison (and which comparisons, if you also want to cover ordering). For example, 0 is invalid for NonZero<T> but you can compare that type bitwise just fine, while Option<&()> has the “valid for all bit patterns” property but comparing it bitwise compares the address bits which doesn’t match the relevant PartialEq impls.

@mmastrac
Copy link

@hanna-kruppe, that makes sense. After typing it up but unfortunately after posting I started to see where it might fall apart. The traits required here might be too specific to PartialEq to justify inclusion.

This might be somewhere where the bytemuck crate could offer some improvements for code that needs that extra 1% rather than a stdlib thing.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 22, 2025

I don’t think it’s out of scope for the standard derives to do that sort of thing, if it can be done correctly and while keeping the benefits. Someone just has to find a workable way to do it (the ingredients don’t even have to be user-facing or stable, but “all bit patterns valid” doesn’t seem like the right tool for the job). This is nontrivial because (1) the derive doesn’t have type information at expansion time, (2) expanding to multiple alternative implementations to work around the first challenge has some costs, and (3) if it’s applied to all derive(PartialEq) without opt-in or opt-out, possible performance regressions from the supposed optimization have to be investigated and weighed against the benefits.

@erickt
Copy link
Contributor

erickt commented May 22, 2025

Rather than trying to optimize this in the PartialEq derive, could we instead teach rust to do a peephole optimization in MIR to coalesce comparing two u16 tuples into the u32? That might be more generally applicable since I’m guessing it’d work for u16 variables that happen to be next to each other on the stack or arguments, or if the u16s are in a larger struct.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 22, 2025

Trying to do this as an optimization in MIR faces essentially the same problem as asking LLVM to do it. We'd need find a way to make programs like the following UB, otherwise the "optimization" would introduce UB1 in such programs:

use std::mem::MaybeUninit;

#[repr(C)]
struct Foo(u16, u16);

fn foo_eq(lhs: &Foo, rhs: &Foo) -> bool {
    lhs.0 == rhs.0 && lhs.1 == rhs.1
}

fn main() {
    let mut a: MaybeUninit<Foo> = MaybeUninit::uninit();
    let mut b: MaybeUninit<Foo> = MaybeUninit::uninit();
    
    unsafe {
        a.as_mut_ptr().cast::<u16>().write(1u16);
        b.as_mut_ptr().cast::<u16>().write(2u16);
    }
    // note: second field of a, b remains uninit
    
    let a_ref = unsafe { &*a.as_ptr() };
    let b_ref = unsafe { &*b.as_ptr() };
    // always prints false because the first field differs and the second one isn't compared
    println!("{}", foo_eq(a_ref, b_ref));
}

My understanding of the current thinking on opsem is that a program like this probably won't end up as UB. For example, miri doesn't diagnose any UB in the above program, but does diagnose UB if the short circuiting in foo_eq is removed or if the first field of a and b is equal. As I wrote in an earlier comment:

The reference conservatively calls that UB for now, but even there it's noted that this is controversial and may end up being non-UB. Miri doesn't enforce any such rule, I believe. So Rust may be unable to justify the desired optimization for the same reason why C compilers can't do it.

Footnotes

  1. Sometimes such problems can be circumvented by considering more things well-defined in the IR than in the source language or using a more nuanced notion of UB such as LLVM's poison. But since we do want uses of uninit data to be UB in MIR, and also want this for the LLVM IR generated from MIR, I don't know how to salvage the optimization that way.

@meithecatte
Copy link
Contributor

I think we'd need a rule along the lines of "dereferencing a reference asserts validity of the entire value behind the reference"?

@VorpalBlade
Copy link

VorpalBlade commented May 22, 2025

One interesting aspect to this, is that even if y is undef, in the actual machine code the result of the equality comparison will be the same (since real CPU architectures don't have undef)1.

So it feels like LLVM should in this case be able to optimise the comparison regardless of undef. I'm not an opsem guru or anything, so I don't know if there is a way to make the LLVM take advantage of the real hardware here.

Footnotes

  1. Assuming the struct doesn't cross a page boundary and the latter half is unmapped, but I don't think the AM allows for such a "half deallocation" anyway? This could be solved with overaligning the struct to prevent such crossing of page boundaries as well.

@hanna-kruppe
Copy link
Contributor

I think we'd need a rule along the lines of "dereferencing a reference asserts validity of the entire value behind the reference"?

There has been extensive discussion about how it could be made UB and why it might be preferable not to, e.g. in rust-lang/unsafe-code-guidelines#346 -- let's not re-tread that ground here.

That discussion also points to #117800 and has several people point out that freeze could fix this optimization without making more Rust programs UB. That's essentially the "it actually doesn't matter what the second comparison returns" angle @VorpalBlade brought up.

The problem with crossing page boundaries is actually prevented by the uncontroversial parts of what's UB in Rust: &T function arguments are required to be dereferencable for the full size of T, at least for T: Sized. So maybe there's a way to fix this in LLVM after all (for Rust using references, not for raw pointers in Rust or C). I don't know off-hand if there's some downside or tradeoff why this isn't done (yet).

@VorpalBlade
Copy link

Thinking a bit more about this, alignment could still play a role in optimising this for some architectures, but at least for x86/x86-64 unaligned reads are fine (for most non-SIMD instructions) . So I would expect it to get optimised to a u32 comparison in that case (assuming the undef issue is solved). Some other architectures might have to be conservative though.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 22, 2025

Ah, one complication with “just freeze the bits loaded from memory” is that LLVM’s poison is all-or-nothing for scalar types (for good reasons). There’s no such thing as “i32 where only the lower bits are poison” so freeze doesn’t let you rewrite comparison of two i16s as “load i32, freeze, compare i32”. You’d need to either express it in terms of <2 x i16> (each lane can be poison independently) or with two i16 loads that are loaded and frozen individually before comparing them separately or combining them into i32 and comparing that. Neither option seems great for the rest of the LLVM IR pipeline, nor would that automatically lead to “load 32 bit and compare” codegen (maybe there’s some peephole optimization very late in the backend that handles it well, but it’s not obviously easier than matching the current freeze-free LLVM IR).

@adrian17
Copy link

adrian17 commented May 22, 2025

could we instead teach rust to do a peephole optimization in MIR to coalesce comparing two u16 tuples into the u32?

For the record, LLVM already has this, see the docs for MergeIcmps pass:

// Example:
//
//  struct S {
//    int a;
//    char b;
//    char c;
//    uint16_t d;
//    bool operator==(const S& o) const {
//      return a == o.a && b == o.b && c == o.c && d == o.d;
//    }
//  };
//
//  Is optimized as :
//
//    bool S::operator==(const S& o) const {
//      return memcmp(this, &o, 8) == 0;
//    }
//
//  Which will later be expanded (ExpandMemCmp) as a single 8-bytes icmp.

But again, I don't know why this doesn't trigger here. Could it by any chance be related to #83623 ?

One interesting aspect to this, is that even if y is undef, in the actual machine code the result of the equality comparison will be the same (since real CPU architectures don't have undef)

Even if true, this could only be done late during codegen passes (as optimizing to IR to more-UB form could cause the modified IR to be inlined and start actually breaking stuff). Meanwhile in this specific case the vectorizer happened to optimize the comparisons first, just a different slower form (hence pcmpeqw), and I guess later passes couldn't recover this back to simpler optimizable form?
But again, even when I passed pre-vectorizer IR to MergeIcmps pass, it couldn't do it; I don't know why :(

@appujee
Copy link

appujee commented May 22, 2025

could we instead teach rust to do a peephole optimization in MIR to coalesce comparing two u16 tuples into the u32?

For the record, LLVM already has this, see the docs for MergeIcmps pass:

Yes, here is a repro: https://godbolt.org/z/8j71anzTW

@meithecatte
Copy link
Contributor

One interesting aspect to this, is that even if y is undef, in the actual machine code the result of the equality comparison will be the same (since real CPU architectures don't have undef)

Even if true, this could only be done late during codegen passes (as optimizing the function too early could cause the modified IR to be inlined and start actually breaking stuff).

I haven't properly thought through all this but it would appear that it'd be enough to optimize a == b && c == d into freeze(ac == bd)?

@hanna-kruppe
Copy link
Contributor

No, freeze(ab == bc) is freeze(poison) if any of a, b, c, or d is poison.

@hanna-kruppe
Copy link
Contributor

For the record, LLVM already has this, see the docs for MergeIcmps pass:

Yes, here is a repro: https://godbolt.org/z/8j71anzTW

If you look at the opt pipeline for these functions, you'll see that these examples don't trigger MergeICmps either. The transformations are mostly done by InstCombine and are only valid because the structs are passed to the function as i32/i64 value, so if any of the struct fields are poison to begin with, the whole function returns poison even before optimizations. (undef is bitwise so it's not a concern either.)

MergeICmps is indeed aimed at this exact patterns but there's a bunch of open issues for it (and also the memcmp expansion pass that it relies on). Its pattern matching seems to be pretty narrow currently and there's correctness issues as well (also related to having to freeze the loaded bytes before comparing!). But if those problems get fixed, normalizing such comparisons to memcmp is a neat canonicalization that's easier to pattern-match into integer loads and compares later.

@Diggsey
Copy link
Contributor

Diggsey commented May 22, 2025

I don't see any reason why this couldn't be optimized at the LLVM level. From rust-lang/unsafe-code-guidelines#346:

So the only potential optimization lost is speculative reads. However, we already justify that references must be dereferencable by the borrow validity rules, so it is perfectly fine to speculatively read memory from behind a reference. It is even valid to make decisions based on the value before it is semantically guaranteed to be read by the source program, so long as the speculative execution can deal with speculation being driven by uninit (e.g. by freezeing it to an arbitrary-but-consistent noundef byte value).

I don't think explicit freezing is even needed on the Rust side: the short-circuit behaviour of && means there are two possibilities:

  1. The first fields of each value are equal. In this case, the second fields are read via a typed copy which is UB if the second fields are uninit, so we can assume the second fields are not uninit, and read the entire value as one u32.

  2. The first fields of each value are not equal. Logically the second fields are never read, but nothing is stopping LLVM from speculatively loading these fields as part of a u32 even if they are uninit since we already guarantee that the reference to the entire value is at least dereferenceable.

@hanna-kruppe
Copy link
Contributor

After reviewing the LLVM issues I linked in my last comment more carefully:

  1. The right combination of existing LLVM passes can optimize this kind of pattern at least in principle. I haven't manage to trigger MergeIcmps on the IR that rustc generates, but I believe that's just due to the very specific pattern matching that pass currently does, not a fundamental issue.
  2. However, that's actually a miscompilation as alive2 shows (see discussion in MergeICmps reorders comparisons and introduces UB llvm/llvm-project#51187 and [MergeICmps] Correctness related to eq-chain comparisons llvm/llvm-project#62459). The problem is indeed, as I rediscovered during my flailing above, that poison infects the wider load we'd like to get. Unsurprisingly, @nikic described this problem and the possible solutions many years ago:

we can't actually fix this right now, because this needs the notion of a frozen load. Unless we want to expand into vector loads and comparisons (which is known to have very bad codegen)

So yeah, I was wrong in my earlier comments in this issue: LLVM could totally do this without any change to rustc's output, or even to Rust's UB rules. But current LLVM can't do it soundly and fixing that would probably require a new LLVM IR construct that does not exist today, the existing "freeze on SSA values" is not enough. (Or I guess someone could "just" fix all the cases of vector loads and comparisons having bad codegen for this use case 🙃)

@nikic
Copy link
Contributor

nikic commented May 22, 2025

Part of the problem here is that de facto memory cannot contain poison (making these transforms legal), but de jure it can (because people would like poison in memory in the future). Which gives us the awkward situation where you can't actually put poison in memory without miscompiles, while at the same time we also can't exploit the absence of poison for optimization purposes :)

@VorpalBlade
Copy link

Part of the problem here is that de facto memory cannot contain poison (making these transforms legal), but de jure it can (because people would like poison in memory in the future)

I'm curious as to how that would work: would each byte have 257 possible bit patterns? Presumably you would store this separately as a validity bitmask (similar to sanitizers, valgrind etc work).

But this would also break common optimised assembly implementations of strlen, memcmp, etc where you do full aligned loads into SIMD registers and deal with "read past the end" in clever ways. So I'm sceptical of people wanting to break backwards compatibility like that, at least on x86-641. And LLVM should strive for generating the best possible code for each ISA.

But following the specifics of the architecture manual trumps "there is no implementation that does that", fair enough. (I assume that when you said "de jure" you have such a specific thing to point to? Because otherwise it is not really "de jure".)

But thinking about this from another angle: the Rust type system should be rich enough to allow optimising this anyway: an T& where T contains no MaybeUninit, and is valid for all bit patterns should allow for this. Are there some missing annotations for telling LLVM about this?

Footnotes

  1. Apple or Google on ARM? Anything could happen, they care much less about breaking backward compatibility.

@Diggsey
Copy link
Contributor

Diggsey commented May 22, 2025

This LLVM issue also seems very relevant: llvm/llvm-project#52930

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests