-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
(Issue loosely owned by @wesleywiser and @pnkfelix monitoring llvm/llvm-project#57476 )
Original Description below
pub fn f(arr: [u64; 2]) -> u32 {
arr.into_iter().map(u64::count_ones).sum()
}
Before 1.62.0, this code correctly compiled to two popcounts and an addition on a modern x86-64 target.
example::f:
popcnt rcx, qword ptr [rdi]
popcnt rax, qword ptr [rdi + 8]
add eax, ecx
ret
Since 1.62.0 (up to latest nightly), the codegen is... baffling at best.
.LCPI0_0:
.zero 16,15
.LCPI0_1:
.byte 0
.byte 1
.byte 1
.byte 2
.byte 1
.byte 2
.byte 2
.byte 3
.byte 1
.byte 2
.byte 2
.byte 3
.byte 2
.byte 3
.byte 3
.byte 4
example::f:
sub rsp, 40
vmovups xmm0, xmmword ptr [rdi]
vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0]
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1]
vmovaps xmmword ptr [rsp], xmm0
vmovdqa xmm0, xmmword ptr [rsp]
vpand xmm2, xmm0, xmm1
vpsrlw xmm0, xmm0, 4
vpand xmm0, xmm0, xmm1
vpshufb xmm2, xmm3, xmm2
vpxor xmm1, xmm1, xmm1
vpshufb xmm0, xmm3, xmm0
vpaddb xmm0, xmm0, xmm2
vpsadbw xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 170
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
add rsp, 40
ret
The assembly for the original function is now a terribly misguided autovectorization. And, just to make sure (even though it's pretty obvious), I did run a benchmark - the autovectorized function is ~8x slower on my Zen 2 system.
Calling that function from a different function brings back normal assembly. -Cno-vectorize-slp
does nothing. I don't know exactly what -Cno-vectorize-loops
does, but it's not good.
If you change the length of the array to 4, both functions get autovectorized. -Cno-vectorize-slp
fixes the second function now. Adding -Cno-vectorize-loops
causes the passthrough function to generate the worst assembly.
Changing into_iter
to iter
fixes length 2, but doesn't fix length 4.
I could go on, but in short it's a whole mess.
I found a workaround that consistently works for all lengths: iter
and -Cno-vectorize-slp
.
@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged +A-array +A-codegen +A-iterators +A-LLVM +A-simd +I-slow +O-x86_64 +perf-regression
Activity
4 remaining items
apiraino commentedon Aug 31, 2022
WG-prioritization assigning priority (Zulip discussion). IIUC this and the related issues seem to be caused by an LLVM regression (see comment).
@rustbot label -I-prioritize +P-high t-compiler
nikic commentedon Aug 31, 2022
Reduced example:
The cost model for znver2 says that ctpop.i64 costs 1 and ctpop.v2i64 costs 3, which is why the vectorization is considered profitable.
nikic commentedon Aug 31, 2022
Upstream issue: llvm/llvm-project#57476
Alternatively, this would also be fixed if we managed to unroll the loop early (during full unroll rather than runtime unroll). That's probably where the into_iter distinction comes from.
nikic commentedon Apr 3, 2023
Still an issue with LLVM 16.
nikic commentedon Aug 14, 2023
Godbolt: https://godbolt.org/z/MoYTvb9qW
Still an issue with LLVM 17.
qarmin commentedon Dec 27, 2023
Now nightly version(but not stable or beta) produce such output