Replies: 7 comments
-
I read the message 5 times and I'm pretty sure the first phrase says the opposite of the rest of the message. Please confirm (and then we can go on). |
Beta Was this translation helpful? Give feedback.
-
Yeah, my bad 😅 u128 is not optimized as it should. I edited the comment. |
Beta Was this translation helpful? Give feedback.
-
OK, I think that:
|
Beta Was this translation helpful? Give feedback.
-
PS: Your link to ucode brings me to a domain on sale. |
Beta Was this translation helpful? Give feedback.
-
sorry, it was uops.info |
Beta Was this translation helpful? Give feedback.
-
Indeed, this might not be noticable at all and it depends on multiple factors. |
Beta Was this translation helpful? Give feedback.
-
So, bottom line: we can start with Rust u128 integers for the bit buffer, and we'll reassess later if/when it becomes a performance problem. Shout if I'm getting it wrong! Thanks @zommiommy for testing/documenting/discussing this here! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
On a preliminary experiment (https://godbolt.org/z/nMeMEzszK)
u128
is not optimized correctly by Rust / LLVM even when compiling with arguments-C opt-level=3 -C target-cpu=skylake
.Example
is compiled to:
While I'd expect the compiler to exploit the
vpsrldq
instruction like:to obtain:
The Latencies and the Reciprocal ThroughPut I signed in the Assembly are taken from https://uops.info for the Skylake architecture.
This fail of optimizing results in spending 3 cycles, since
shr
andshrd
are independent and use different ports they are executed at the same time, instead of 1, resulting in a x3 slowdown of this operation.This is not due to the compilation target since
vpsrldq
is an SSE2 instruction that skylake supports:We should discuss how to proceed with the implementation.
Beta Was this translation helpful? Give feedback.
All reactions