-
Notifications
You must be signed in to change notification settings - Fork 27
Initial riscv64 vector support (uses standard vector instrinsics for rvv 1.0. Presently VLEN=256 only.) #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0637941
to
6b4f845
Compare
@mjosaarinen If you have |
dcab861
to
38e79bb
Compare
Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end. |
Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :) |
Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing. |
@mjosaarinen Sorry, we should have pointed that out earlier. With the |
/* check-magic: off */ | ||
|
||
/* Montgomery reduction constants */ | ||
/* n = 256; q = 3329; r = 2^16 */ | ||
/* qi = lift(Mod(-q, r)^-1) */ | ||
#define MLKEM_QI 3327 | ||
|
||
/* r1 = lift(Mod(r, q)) */ | ||
#define MLK_MONT_R1 2285 | ||
|
||
/* r2 = lift(Mod(r, q)^2) */ | ||
#define MLK_MONT_R2 1353 | ||
|
||
/* in = lift(Mod(n / 2, q)^-1) */ | ||
/* nr = (in * r^2) % q */ | ||
#define MLK_MONT_NR 1441 | ||
|
||
/* check-magic: on */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* check-magic: off */ | |
/* Montgomery reduction constants */ | |
/* n = 256; q = 3329; r = 2^16 */ | |
/* qi = lift(Mod(-q, r)^-1) */ | |
#define MLKEM_QI 3327 | |
/* r1 = lift(Mod(r, q)) */ | |
#define MLK_MONT_R1 2285 | |
/* r2 = lift(Mod(r, q)^2) */ | |
#define MLK_MONT_R2 1353 | |
/* in = lift(Mod(n / 2, q)^-1) */ | |
/* nr = (in * r^2) % q */ | |
#define MLK_MONT_NR 1441 | |
/* check-magic: on */ | |
/* check-magic: 3327 == pow(-MLKEM_Q, -1, 2^16) */ | |
#define MLKEM_QI 3327 | |
/* check-magic: 2285 == unsigned_mod(2^16, MLKEM_Q) */ | |
#define MLK_MONT_R1 2285 | |
/* check-magic: 1353 == pow(MLK_MONT_R1, 2, MLKEM_Q) */ | |
#define MLK_MONT_R2 1353 | |
/* check-magic: 1441 == pow(2,32 - 7,MLKEM_Q) */ | |
#define MLK_MONT_NR 1441 | |
This auto-checks the magic number explanations in CI.
{ | ||
/* zetas can be compiled into vector constants; don't pass as a pointer */ | ||
/* check-magic: off */ | ||
const int16_t zeta[0x80] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those should ultimately be autogenerated via autogen
similar to the other twiddles tables (e.g. zetas.inc
). You will be able to copy-paste adjust most of it, I think -- from a cursory look, this is different from zetas.inc
only in the order of the twiddles.
Do you have time to look into that, or shall me/Matthias do it as a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be returning to this code in a couple of weeks, and I don't mind if someone scripts them in the meanwhile. All three tables were generated with throw-away gp-pari statements but are obvious to "reverse engineer" as you note; just a some combo of various orderings of roots-of-unity powers and Montgomery constants. Anyway, I understand what you're after here.
…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)
Steps:
If your pull request consists of multiple sequential changes, please describe them here:
Performed local tests:
lint
passingtests all
passingtests bench
passingtests cbmc
passingDo you expect this change to impact performance: Yes/No
yes (risc-v only)
If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.