Skip to content

Conversation

@usamoi
Copy link
Contributor

@usamoi usamoi commented Jan 1, 2026

@folkertdev
Copy link
Contributor

Can you add something like this

    /// This is a trick used in the adler32 algorithm to get a widening addition. The
    /// multiplication by 1 is trivial, but must not be optimized out because then the vpmaddwd
    /// instruction is no longer selected.
    #[target_feature(enable = "avx2")]
    #[cfg_attr(test, assert_instr(vpmaddwd))]
    unsafe fn _mm256_madd_epi16_mul_one(mad: __m256i) -> __m256i {
        let one_v = _mm256_set1_epi16(1);
        _mm256_madd_epi16(mad, one_v)
    }

So that we remember in the future why we can't just use that generic implementation?

@usamoi
Copy link
Contributor Author

usamoi commented Jan 1, 2026

I have added the following comments:

// It's a trick used in the Adler-32 algorithm to perform a widening addition.
//
// ```rust
// #[target_feature(enable = "sse2")]
// unsafe fn widening_add(mad: __m128i) -> __m128i {
//     _mm_madd_epi16(mad, _mm_set1_epi16(1))
// }
// ```
//
// If we implement this using generic vector intrinsics, the optimizer
// will eliminate this pattern, and `pmaddwd` will no longer be emitted.
// For this reason, we use x86 intrinsics.

Comment on lines -213 to 227
#[rustc_const_unstable(feature = "stdarch_const_x86", issue = "149298")]
pub const fn _mm_madd_epi16(a: __m128i, b: __m128i) -> __m128i {
unsafe {
let r: i32x8 = simd_mul(simd_cast(a.as_i16x8()), simd_cast(b.as_i16x8()));
let even: i32x4 = simd_shuffle!(r, r, [0, 2, 4, 6]);
let odd: i32x4 = simd_shuffle!(r, r, [1, 3, 5, 7]);
simd_add(even, odd).as_m128i()
}
pub fn _mm_madd_epi16(a: __m128i, b: __m128i) -> __m128i {
// It's a trick used in the Adler-32 algorithm to perform a widening addition.
//
// ```rust
// #[target_feature(enable = "sse2")]
// unsafe fn widening_add(mad: __m128i) -> __m128i {
// _mm_madd_epi16(mad, _mm_set1_epi16(1))
// }
// ```
//
// If we implement this using generic vector intrinsics, the optimizer
// will eliminate this pattern, and `pmaddwd` will no longer be emitted.
// For this reason, we use x86 intrinsics.
unsafe { transmute(pmaddwd(a.as_i16x8(), b.as_i16x8())) }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider using https://doc.rust-lang.org/std/intrinsics/fn.const_eval_select.html so we don't loose all of the const stuff. Up to @sayantn though, I don't have full context on what we'd like to be const fn currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[target_feature] functions do not implement the Fn traits, while const_eval_select restricts FnOnce. So this does not seem feasible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still do it, the inner function which will be invoked doesn't need to have the correct target features if marked #[inline] (#[inline(always)] will probably be better) Godbolt. But I don't think it is that important to make this const right now. I think a better approach will be fixing the LLVM bug and make this truly const in future, but this should be enough for the patch

Comment on lines -213 to 227
#[rustc_const_unstable(feature = "stdarch_const_x86", issue = "149298")]
pub const fn _mm_madd_epi16(a: __m128i, b: __m128i) -> __m128i {
unsafe {
let r: i32x8 = simd_mul(simd_cast(a.as_i16x8()), simd_cast(b.as_i16x8()));
let even: i32x4 = simd_shuffle!(r, r, [0, 2, 4, 6]);
let odd: i32x4 = simd_shuffle!(r, r, [1, 3, 5, 7]);
simd_add(even, odd).as_m128i()
}
pub fn _mm_madd_epi16(a: __m128i, b: __m128i) -> __m128i {
// It's a trick used in the Adler-32 algorithm to perform a widening addition.
//
// ```rust
// #[target_feature(enable = "sse2")]
// unsafe fn widening_add(mad: __m128i) -> __m128i {
// _mm_madd_epi16(mad, _mm_set1_epi16(1))
// }
// ```
//
// If we implement this using generic vector intrinsics, the optimizer
// will eliminate this pattern, and `pmaddwd` will no longer be emitted.
// For this reason, we use x86 intrinsics.
unsafe { transmute(pmaddwd(a.as_i16x8(), b.as_i16x8())) }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still do it, the inner function which will be invoked doesn't need to have the correct target features if marked #[inline] (#[inline(always)] will probably be better) Godbolt. But I don't think it is that important to make this const right now. I think a better approach will be fixing the LLVM bug and make this truly const in future, but this should be enough for the patch

@sayantn sayantn added this pull request to the merge queue Jan 2, 2026
@sayantn
Copy link
Contributor

sayantn commented Jan 2, 2026

The Miri test suite might start failing now, since the implementation of these intrinsics were removed. cc @RalfJung

Merged via the queue into rust-lang:main with commit 7de6f04 Jan 2, 2026
75 checks passed
RKSimon pushed a commit to llvm/llvm-project that referenced this pull request Jan 6, 2026
The `_mm256_madd_epi16` intrinsic performs first a pointwise widening
multiplication, and then adds adjacent elements. In SIMD versions of the
adler32 checksum algorithm, a trivial multiplication by an all-ones
vector is used to get just the widening and addition behavior.

In the rust standard library, we like to implement intrinsics in terms
of simpler building blocks, so that all backends can implement a small
set of primitives instead of supporting all of LLVM's intrinsics. When
we try that for `_mm256_madd_epi16` in isolation it works, but when one
of the arguments is an all-ones vector, the multiplication is optimized
out long before the `vpmaddwd` instruction can be selected.

This PR recognizes the widening adjacent addition pattern that adler32
uses directly, and manually inserts a trivial multiplication by an
all-ones vector. Experimentally, performing this optimization increases
adler32 throughput from 41 gb/s to 67 gb/s
(rust-lang/rust#150560 (comment))

cc rust-lang/stdarch#1985
cc rust-lang/rust#150560
higher-performance pushed a commit to higher-performance/llvm-project that referenced this pull request Jan 6, 2026
The `_mm256_madd_epi16` intrinsic performs first a pointwise widening
multiplication, and then adds adjacent elements. In SIMD versions of the
adler32 checksum algorithm, a trivial multiplication by an all-ones
vector is used to get just the widening and addition behavior.

In the rust standard library, we like to implement intrinsics in terms
of simpler building blocks, so that all backends can implement a small
set of primitives instead of supporting all of LLVM's intrinsics. When
we try that for `_mm256_madd_epi16` in isolation it works, but when one
of the arguments is an all-ones vector, the multiplication is optimized
out long before the `vpmaddwd` instruction can be selected.

This PR recognizes the widening adjacent addition pattern that adler32
uses directly, and manually inserts a trivial multiplication by an
all-ones vector. Experimentally, performing this optimization increases
adler32 throughput from 41 gb/s to 67 gb/s
(rust-lang/rust#150560 (comment))

cc rust-lang/stdarch#1985
cc rust-lang/rust#150560
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 6, 2026
The `_mm256_madd_epi16` intrinsic performs first a pointwise widening
multiplication, and then adds adjacent elements. In SIMD versions of the
adler32 checksum algorithm, a trivial multiplication by an all-ones
vector is used to get just the widening and addition behavior.

In the rust standard library, we like to implement intrinsics in terms
of simpler building blocks, so that all backends can implement a small
set of primitives instead of supporting all of LLVM's intrinsics. When
we try that for `_mm256_madd_epi16` in isolation it works, but when one
of the arguments is an all-ones vector, the multiplication is optimized
out long before the `vpmaddwd` instruction can be selected.

This PR recognizes the widening adjacent addition pattern that adler32
uses directly, and manually inserts a trivial multiplication by an
all-ones vector. Experimentally, performing this optimization increases
adler32 throughput from 41 gb/s to 67 gb/s
(rust-lang/rust#150560 (comment))

cc rust-lang/stdarch#1985
cc rust-lang/rust#150560
@folkertdev
Copy link
Contributor

The Miri test suite might start failing now, since the implementation of these intrinsics were removed.

Do you happen to know where that was changed? It turns out zlib-rs uses

    #[link_name = "llvm.x86.avx512.pmaddw.d.512"]
    fn vpmaddwd(a: i16x32, b: i16x32) -> i32x16;

So our CI started failing now that miri is no longer able to execute that instruction.

@usamoi
Copy link
Contributor Author

usamoi commented Jan 7, 2026

The Miri test suite might start failing now, since the implementation of these intrinsics were removed.

Do you happen to know where that was changed? It turns out zlib-rs uses

    #[link_name = "llvm.x86.avx512.pmaddw.d.512"]
    fn vpmaddwd(a: i16x32, b: i16x32) -> i32x16;

So our CI started failing now that miri is no longer able to execute that instruction.

It's rust-lang/rust@8d597aa. See rust-lang/rust#150639 (comment). Reverted in rust-lang/rust@5e4168b. It's never implemented for AVX512.

@usamoi usamoi deleted the vpmaddwd branch January 7, 2026 15:37
@RalfJung
Copy link
Member

RalfJung commented Jan 7, 2026

Yeah, rust-lang/rust@5e4168b fixed this for every case where this was ever supported by Miri. Maybe the avx512 variant was accidentally supported and we didn't even realize.^^

@folkertdev
Copy link
Contributor

Right, miri would have been able to execute the implementation that is removed by this PR, and we happened to rely on that. I'll see if i can add support for the avx512 variant to miri then to tide us over.

navaneethshan pushed a commit to qualcomm/cpullvm-toolchain that referenced this pull request Jan 8, 2026
The `_mm256_madd_epi16` intrinsic performs first a pointwise widening
multiplication, and then adds adjacent elements. In SIMD versions of the
adler32 checksum algorithm, a trivial multiplication by an all-ones
vector is used to get just the widening and addition behavior.

In the rust standard library, we like to implement intrinsics in terms
of simpler building blocks, so that all backends can implement a small
set of primitives instead of supporting all of LLVM's intrinsics. When
we try that for `_mm256_madd_epi16` in isolation it works, but when one
of the arguments is an all-ones vector, the multiplication is optimized
out long before the `vpmaddwd` instruction can be selected.

This PR recognizes the widening adjacent addition pattern that adler32
uses directly, and manually inserts a trivial multiplication by an
all-ones vector. Experimentally, performing this optimization increases
adler32 throughput from 41 gb/s to 67 gb/s
(rust-lang/rust#150560 (comment))

cc rust-lang/stdarch#1985
cc rust-lang/rust#150560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants