-
Notifications
You must be signed in to change notification settings - Fork 15
polyval: implement Karatsuba multiplication for arm64 #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improves performance by ~200 MB/s on a 2020 M1. Signed-off-by: Eric Lagergren <[email protected]>
The code is taken from https://github.com/ericlagergren/polyval-rs/tree/dev, which also has "wide" implementations (8 blocks at a time), which has significantly better performance (~0.17 cycles per byte instead of ~2). |
I also have an x86 version I can submit as well if you'd like. |
Parallel and x86 versions would be appreciated, although perhaps as separate PRs to ease reviewability |
Signed-off-by: Eric Lagergren <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally on an M2 Max, where I observed the reported speedups.
Percentage-wise it's about a 17% speedup.
Actually, your x86 implementation only uses 3 clmul instructions, so I don't think the serial version can be improved much. I'll look at adding parallel implementations. Off hand, do you know if the current API supports it? The input probably needs to be in one contiguous buffer. (Maybe not?) But that's the common case, at least for stuff like non-interleaved AES-GCM-SIV or HCTR2. |
Take a look at |
Added - add `new_with_init_block` (RustCrypto#195) Changed - implement Karatsuba multiplication for arm64 (RustCrypto#181)
Improves performance by ~200 MB/s on a 2020 M1.