BitStream Buffer implementation discussion (u128 or not) #7

zommiommy · 2023-04-03T16:40:26Z

zommiommy
Apr 3, 2023
Collaborator

On a preliminary experiment (https://godbolt.org/z/nMeMEzszK) u128 is not optimized correctly by Rust / LLVM even when compiling with arguments -C opt-level=3 -C target-cpu=skylake.

Example

pub fn test(x: u128) -> u128 {
    x >> 3
}

is compiled to:

example::test:
        mov     rdx, rsi
        mov     rax, rdi
        shrd    rax, rsi, 3 ; Lat: 1 RTP: 0.5 Port 1*p1
        shr     rdx, 3      ; Lat: 3 RTP: 1.0 Port: 1*p06
        ret

While I'd expect the compiler to exploit the vpsrldq instruction like:

use std::arch::x86_64::{
    __m128i,
    _mm_srli_si128
};

pub unsafe fn test2(x: __m128i) -> __m128i {
    _mm_srli_si128(x, 3)
}

to obtain:

example::test2:
        mov     rax, rdi
        vmovdqa xmm0, xmmword ptr [rsi]
        vpsrldq xmm0, xmm0, 3 ; Lat: 1 RTP: 1.0 Port: 1*p5
        vmovdqa xmmword ptr [rdi], xmm0
        ret

The Latencies and the Reciprocal ThroughPut I signed in the Assembly are taken from https://uops.info for the Skylake architecture.

This fail of optimizing results in spending 3 cycles, since shr and shrd are independent and use different ports they are executed at the same time, instead of 1, resulting in a x3 slowdown of this operation.

This is not due to the compilation target since vpsrldq is an SSE2 instruction that skylake supports:

$ rustc --print cfg -C target-cpu=skylake
debug_assertions
panic="unwind"
target_abi=""
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="adx"
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="cmpxchg16b"
target_feature="ermsb"
target_feature="f16c"
target_feature="fma"
target_feature="fxsr"
target_feature="llvm14-builtins-abi"
target_feature="lzcnt"
target_feature="movbe"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2" <--------------------
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_has_atomic_equal_alignment="16"
target_has_atomic_equal_alignment="32"
target_has_atomic_equal_alignment="64"
target_has_atomic_equal_alignment="8"
target_has_atomic_equal_alignment="ptr"
target_has_atomic_load_store="16"
target_has_atomic_load_store="32"
target_has_atomic_load_store="64"
target_has_atomic_load_store="8"
target_has_atomic_load_store="ptr"
target_os="linux"
target_pointer_width="64"
target_thread_local
target_vendor="unknown"
unix

We should discuss how to proceed with the implementation.

vigna · 2023-04-03T17:16:37Z

vigna
Apr 3, 2023
Maintainer

I read the message 5 times and I'm pretty sure the first phrase says the opposite of the rest of the message. Please confirm (and then we can go on).

0 replies

zommiommy · 2023-04-03T17:31:48Z

zommiommy
Apr 3, 2023
Collaborator Author

Yeah, my bad 😅 u128 is not optimized as it should. I edited the comment.

0 replies

vigna · 2023-04-03T17:35:14Z

vigna
Apr 3, 2023
Maintainer

OK, I think that:

This appears to be a compiler problem, that we should not address in the implementation. At worst, we can file a bug report.
I'm not sure that your computation cover all costs because moving data in and out special registers has a cost, too.
In the really worst case we can compare the u32-word vs. the u64-word implementation (the former doesn't need special instructions to do the shifts) and see what we learn.
Your analysis in isolation is correct, but that shift will be in the middle of several other (sometimes independent) operations and the extra cycles might be less or vanish.

0 replies

vigna · 2023-04-03T17:36:35Z

vigna
Apr 3, 2023
Maintainer

PS: Your link to ucode brings me to a domain on sale.

0 replies

zommiommy · 2023-04-03T17:55:05Z

zommiommy
Apr 3, 2023
Collaborator Author

PS: Your link to ucode brings me to a domain on sale.

sorry, it was uops.info

0 replies

zommiommy · 2023-04-03T18:15:22Z

zommiommy
Apr 3, 2023
Collaborator Author

Indeed, this might not be noticable at all and it depends on multiple factors.
My goal is to document this behavior I noticed so are aware of it and we can study this more if needed.

0 replies

zacchiro · 2023-04-03T18:28:06Z

zacchiro
Apr 3, 2023
Collaborator

So, bottom line: we can start with Rust u128 integers for the bit buffer, and we'll reassess later if/when it becomes a performance problem.
Meanwhile, we can report this as a compiler (suboptimality) issue and see if it gets improved upon in the meantime.
(Although I suspect having an actual use case/benchmark later on in our project will help making the case it should be optimized better.)

Shout if I'm getting it wrong!

Thanks @zommiommy for testing/documenting/discussing this here!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BitStream Buffer implementation discussion (u128 or not) #7

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BitStream Buffer implementation discussion (u128 or not) #7

Uh oh!

Uh oh!

zommiommy Apr 3, 2023 Collaborator

Example

Replies: 7 comments

Uh oh!

vigna Apr 3, 2023 Maintainer

Uh oh!

Uh oh!

zommiommy Apr 3, 2023 Collaborator Author

Uh oh!

vigna Apr 3, 2023 Maintainer

Uh oh!

vigna Apr 3, 2023 Maintainer

Uh oh!

zommiommy Apr 3, 2023 Collaborator Author

Uh oh!

zommiommy Apr 3, 2023 Collaborator Author

Uh oh!

zacchiro Apr 3, 2023 Collaborator

zommiommy
Apr 3, 2023
Collaborator

vigna
Apr 3, 2023
Maintainer

zommiommy
Apr 3, 2023
Collaborator Author

vigna
Apr 3, 2023
Maintainer

vigna
Apr 3, 2023
Maintainer

zommiommy
Apr 3, 2023
Collaborator Author

zommiommy
Apr 3, 2023
Collaborator Author

zacchiro
Apr 3, 2023
Collaborator