-
Notifications
You must be signed in to change notification settings - Fork 15
polyval, ghash: use powers of H to process multiple blocks at a time #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Exciting work! The parallelism would be prospectively quite helpful for RustCrypto/AEADs#74 It'd be great if you could open a PR. Just looking over your code there are great code comments which are much better than what we currently have. |
Arti would benefit from this a lot. Would it be okay if I worked on making @ericlargergren's code into a PR? (It is under a 3BSD license.) (Alternatively I could try to do a clean-room implementation of the basic technique, which would take more time.) |
You're more than welcome to use my code. I am happy to relicense it as needed. I've been meaning to send a PR myself, but I just haven't had the time yet. |
This MR adapts code from @ericlagergren with permission (see #225) to speed up `polyval` and `ghash` using instruction level parallelism. I expect there are still some performance gains to be had, but this is already fairly well improved: I'm seeing ~~4.3x~~ 7.5x improvement for large inputs on my Alder Lake x86_64 CPU, and 10x improvement on my Apple M1 ARM CPU. Because this pipelining is not always appropriate (notably, when you know that you will only be handling small inputs), I've also imitated polyval-rs by making the number of blocks to parallelize a generic parameter. I experimented with manually unrolling the loops here, but I didn't see a performance gain. --------- Co-authored-by: Tony Arcieri <[email protected]>
Note: I'd still invite someone to adapt the soft backend from https://github.com/ericlagergren/polyval-rs/ There are possible refactoring improvements from that implementation (e.g. extracting a proper |
We can compute
h = [H^n, H^(n-1), ..., H]
and then process N blocks at a time. On a 2020 M1, a stride of 8 runs at about 0.17 cycles per byte whereas a stride of 1 runs at about 1.4 cycles per byte—an ~8x improvement.Note that sometimes this isn't desirable. For example, HCTR-2 computes POLYVAL over single blocks and the overhead of constructing plus cleaning up an N-wide POLYVAL hurt performance. So, we probably need to offer both a "wide" and a "lite" implementation.
I'm happy to donate my implementation (x86 and aarch64).
The text was updated successfully, but these errors were encountered: