Skip to content

polyval, ghash: use powers of H to process multiple blocks at a time #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ericlagergren opened this issue Feb 12, 2025 · 4 comments
Open

Comments

@ericlagergren
Copy link
Contributor

We can compute h = [H^n, H^(n-1), ..., H] and then process N blocks at a time. On a 2020 M1, a stride of 8 runs at about 0.17 cycles per byte whereas a stride of 1 runs at about 1.4 cycles per byte—an ~8x improvement.

Note that sometimes this isn't desirable. For example, HCTR-2 computes POLYVAL over single blocks and the overhead of constructing plus cleaning up an N-wide POLYVAL hurt performance. So, we probably need to offer both a "wide" and a "lite" implementation.

I'm happy to donate my implementation (x86 and aarch64).

@tarcieri
Copy link
Member

Exciting work! The parallelism would be prospectively quite helpful for RustCrypto/AEADs#74

It'd be great if you could open a PR. Just looking over your code there are great code comments which are much better than what we currently have.

@nmathewson
Copy link
Contributor

Arti would benefit from this a lot. Would it be okay if I worked on making @ericlargergren's code into a PR? (It is under a 3BSD license.)

(Alternatively I could try to do a clean-room implementation of the basic technique, which would take more time.)

@ericlagergren
Copy link
Contributor Author

You're more than welcome to use my code. I am happy to relicense it as needed. I've been meaning to send a PR myself, but I just haven't had the time yet.

tarcieri added a commit that referenced this issue May 16, 2025
This MR adapts code from @ericlagergren with permission (see #225) to
speed up `polyval` and `ghash` using instruction level parallelism.

I expect there are still some performance gains to be had, but this is
already fairly well improved: I'm seeing ~~4.3x~~ 7.5x improvement for
large inputs on my Alder Lake x86_64 CPU, and 10x improvement on my
Apple M1 ARM CPU.

Because this pipelining is not always appropriate (notably, when you
know that you will only be handling small inputs), I've also imitated
polyval-rs by making the number of blocks to parallelize a generic
parameter.

I experimented with manually unrolling the loops here, but I didn't see
a performance gain.

---------

Co-authored-by: Tony Arcieri <[email protected]>
@tarcieri
Copy link
Member

Note: I'd still invite someone to adapt the soft backend from https://github.com/ericlagergren/polyval-rs/

There are possible refactoring improvements from that implementation (e.g. extracting a proper FieldElement type) which would help for more exotic use cases that perform operations over e.g. GHASH's field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants