Description
I'm investigating performance differences between packed_simd
and this crate. My assumption was that the separate bitmask representation used for avx512 could lead to better performance but instead I see some rather weird assembly for mask handling. This seems to be specific to masks with more than 8 lanes. The following is a reduced example. The same behavior can also be seen with other comparison operations like simd_eq
instead of is_nan
.
#[inline(never)]
fn nan_bitmask_16(data: &[f32; 16]) -> u16 {
let chunk = f32x16::from_slice(data);
let is_nan = chunk.is_nan();
is_nan.to_bitmask()
}
#[inline(never)]
fn nan_bitmask_8(data: &[f32; 8]) -> u8 {
let chunk = f32x8::from_slice(data);
let is_nan = chunk.is_nan();
is_nan.to_bitmask()
}
The generated code for 8 lanes looks very good:
vxorps xmm0,xmm0,xmm0
vcmpunordps k0,ymm0,YMMWORD PTR [rdi]
kmovd eax,k0
vzeroupper
ret
But for 16 lanes something strange is happening:
push rax
vxorps xmm0,xmm0,xmm0
vcmpunordps k0,zmm0,ZMMWORD PTR [rdi]
kmovw WORD PTR [rsp],k0
kmovd eax,k0
movzx ecx,BYTE PTR [rsp+0x1]
shl ecx,0x8
movzx eax,al
or eax,ecx
pop rcx
vzeroupper
ret
The mask seems to get spilled to the stack, the high byte gets reloaded and then both halves get combined back into a 16bit word.
This was observed using portable_simd as a crate (actually as an example binary inside a repo checkout) as it was easier to try and simplify the example. I observed similar code in a bigger example using the version from the standard library with Zbuild-std
.
Meta
$ rustc +nightly --version
rustc 1.66.0-nightly (81f391930 2022-10-09)
portable_simd commit aad8f0a