Skip to content

Weird codegen with avx512 feature and masks > 8 lanes #312

Open
@jhorstmann

Description

@jhorstmann

I'm investigating performance differences between packed_simd and this crate. My assumption was that the separate bitmask representation used for avx512 could lead to better performance but instead I see some rather weird assembly for mask handling. This seems to be specific to masks with more than 8 lanes. The following is a reduced example. The same behavior can also be seen with other comparison operations like simd_eq instead of is_nan.

#[inline(never)]
fn nan_bitmask_16(data: &[f32; 16]) -> u16 {
    let chunk = f32x16::from_slice(data);

    let is_nan = chunk.is_nan();
    is_nan.to_bitmask()
}

#[inline(never)]
fn nan_bitmask_8(data: &[f32; 8]) -> u8 {
    let chunk = f32x8::from_slice(data);

    let is_nan = chunk.is_nan();
    is_nan.to_bitmask()
}

The generated code for 8 lanes looks very good:

vxorps xmm0,xmm0,xmm0
vcmpunordps k0,ymm0,YMMWORD PTR [rdi]
kmovd  eax,k0
vzeroupper 
ret

But for 16 lanes something strange is happening:

push   rax
vxorps xmm0,xmm0,xmm0
vcmpunordps k0,zmm0,ZMMWORD PTR [rdi]
kmovw  WORD PTR [rsp],k0
kmovd  eax,k0
movzx  ecx,BYTE PTR [rsp+0x1]
shl    ecx,0x8
movzx  eax,al
or eax,ecx
pop rcx
vzeroupper 
ret 

The mask seems to get spilled to the stack, the high byte gets reloaded and then both halves get combined back into a 16bit word.

This was observed using portable_simd as a crate (actually as an example binary inside a repo checkout) as it was easier to try and simplify the example. I observed similar code in a bigger example using the version from the standard library with Zbuild-std.

Meta

$ rustc +nightly --version
rustc 1.66.0-nightly (81f391930 2022-10-09)

portable_simd commit aad8f0a

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: BugI-heavyImpact: bad code size

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions