polyval, ghash: use powers of H to process multiple blocks at a time #225

ericlagergren · 2025-02-12T18:25:08Z

We can compute h = [H^n, H^(n-1), ..., H] and then process N blocks at a time. On a 2020 M1, a stride of 8 runs at about 0.17 cycles per byte whereas a stride of 1 runs at about 1.4 cycles per byte—an ~8x improvement.

Note that sometimes this isn't desirable. For example, HCTR-2 computes POLYVAL over single blocks and the overhead of constructing plus cleaning up an N-wide POLYVAL hurt performance. So, we probably need to offer both a "wide" and a "lite" implementation.

I'm happy to donate my implementation (x86 and aarch64).

The text was updated successfully, but these errors were encountered:

tarcieri · 2025-02-14T02:23:08Z

Exciting work! The parallelism would be prospectively quite helpful for RustCrypto/AEADs#74

It'd be great if you could open a PR. Just looking over your code there are great code comments which are much better than what we currently have.

nmathewson · 2025-05-11T16:05:07Z

Arti would benefit from this a lot. Would it be okay if I worked on making @ericlargergren's code into a PR? (It is under a 3BSD license.)

(Alternatively I could try to do a clean-room implementation of the basic technique, which would take more time.)

ericlagergren · 2025-05-11T16:12:15Z

You're more than welcome to use my code. I am happy to relicense it as needed. I've been meaning to send a PR myself, but I just haven't had the time yet.

@ericlagergren

This MR adapts code from @ericlagergren with permission (see #225) to speed up `polyval` and `ghash` using instruction level parallelism. I expect there are still some performance gains to be had, but this is already fairly well improved: I'm seeing ~~4.3x~~ 7.5x improvement for large inputs on my Alder Lake x86_64 CPU, and 10x improvement on my Apple M1 ARM CPU. Because this pipelining is not always appropriate (notably, when you know that you will only be handling small inputs), I've also imitated polyval-rs by making the number of blocks to parallelize a generic parameter. I experimented with manually unrolling the loops here, but I didn't see a performance gain. --------- Co-authored-by: Tony Arcieri <[email protected]>

tarcieri · 2025-05-17T17:20:35Z

Note: I'd still invite someone to adapt the soft backend from https://github.com/ericlagergren/polyval-rs/

There are possible refactoring improvements from that implementation (e.g. extracting a proper FieldElement type) which would help for more exotic use cases that perform operations over e.g. GHASH's field

nmathewson mentioned this issue May 12, 2025

polyval+ghash: large performance improvements via better instruction-level parallelism #230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

polyval, ghash: use powers of H to process multiple blocks at a time #225

polyval, ghash: use powers of H to process multiple blocks at a time #225

ericlagergren commented Feb 12, 2025

tarcieri commented Feb 14, 2025

Uh oh!

nmathewson commented May 11, 2025

Uh oh!

ericlagergren commented May 11, 2025

Uh oh!

tarcieri commented May 17, 2025

Uh oh!

polyval, ghash: use powers of H to process multiple blocks at a time #225

polyval, ghash: use powers of H to process multiple blocks at a time #225

Comments

ericlagergren commented Feb 12, 2025

tarcieri commented Feb 14, 2025

Uh oh!

nmathewson commented May 11, 2025

Uh oh!

ericlagergren commented May 11, 2025

Uh oh!

tarcieri commented May 17, 2025

Uh oh!