Skip to content

Initial riscv64 vector support (uses standard vector instrinsics for rvv 1.0. Presently VLEN=256 only.) #1037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mjosaarinen
Copy link

Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)

Steps:
If your pull request consists of multiple sequential changes, please describe them here:

Performed local tests:

  • lint passing
  • tests all passing
  • tests bench passing
  • tests cbmc passing

Do you expect this change to impact performance: Yes/No
yes (risc-v only)

If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.

@mjosaarinen mjosaarinen requested a review from a team as a code owner May 19, 2025 14:57
@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 3 times, most recently from 0637941 to 6b4f845 Compare May 19, 2025 18:35
@hanno-becker
Copy link
Contributor

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 2 times, most recently from dcab861 to 38e79bb Compare May 19, 2025 20:40
@rod-chapman
Copy link
Contributor

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

@mjosaarinen
Copy link
Author

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :)

@mjosaarinen
Copy link
Author

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing.

@hanno-becker
Copy link
Contributor

hanno-becker commented May 20, 2025

@mjosaarinen Sorry, we should have pointed that out earlier. With the nix environment, you should not need to waste anymore time making the linter happy. Just run format && autogen before pushing.

Comment on lines +21 to +38
/* check-magic: off */

/* Montgomery reduction constants */
/* n = 256; q = 3329; r = 2^16 */
/* qi = lift(Mod(-q, r)^-1) */
#define MLKEM_QI 3327

/* r1 = lift(Mod(r, q)) */
#define MLK_MONT_R1 2285

/* r2 = lift(Mod(r, q)^2) */
#define MLK_MONT_R2 1353

/* in = lift(Mod(n / 2, q)^-1) */
/* nr = (in * r^2) % q */
#define MLK_MONT_NR 1441

/* check-magic: on */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* check-magic: off */
/* Montgomery reduction constants */
/* n = 256; q = 3329; r = 2^16 */
/* qi = lift(Mod(-q, r)^-1) */
#define MLKEM_QI 3327
/* r1 = lift(Mod(r, q)) */
#define MLK_MONT_R1 2285
/* r2 = lift(Mod(r, q)^2) */
#define MLK_MONT_R2 1353
/* in = lift(Mod(n / 2, q)^-1) */
/* nr = (in * r^2) % q */
#define MLK_MONT_NR 1441
/* check-magic: on */
/* check-magic: 3327 == pow(-MLKEM_Q, -1, 2^16) */
#define MLKEM_QI 3327
/* check-magic: 2285 == unsigned_mod(2^16, MLKEM_Q) */
#define MLK_MONT_R1 2285
/* check-magic: 1353 == pow(MLK_MONT_R1, 2, MLKEM_Q) */
#define MLK_MONT_R2 1353
/* check-magic: 1441 == pow(2,32 - 7,MLKEM_Q) */
#define MLK_MONT_NR 1441

This auto-checks the magic number explanations in CI.

{
/* zetas can be compiled into vector constants; don't pass as a pointer */
/* check-magic: off */
const int16_t zeta[0x80] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should ultimately be autogenerated via autogen similar to the other twiddles tables (e.g. zetas.inc). You will be able to copy-paste adjust most of it, I think -- from a cursory look, this is different from zetas.inc only in the order of the twiddles.

Do you have time to look into that, or shall me/Matthias do it as a follow-up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be returning to this code in a couple of weeks, and I don't mind if someone scripts them in the meanwhile. All three tables were generated with throw-away gp-pari statements but are obvious to "reverse engineer" as you note; just a some combo of various orderings of roots-of-unity powers and Montgomery constants. Anyway, I understand what you're after here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants