argon2: add parallelism #547

jonasmalacofilho · 2025-01-13T03:19:55Z

Adds a ~~default-enabled~~ parallel feature, with an ~~otherwise~~ optional dependency on rayon, and parallelize the filling of blocks using the memory views mentioned above.

Coordinated shared access in the memory blocks is implemented with a SegmentViewIter iterator, which implements either rayon::iter::ParallelIterator or core::iter::Iterator and returns SegmentView views into the Argon2 blocks memory that are safe to be used in parallel.

The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.

To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.

The following tests have been tried in and pass Miri (modulo unrelated warnings):

reference_argon2i_v0x13_2_8_2
reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).

~~Finally, the alignment of Blocks increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes on crossbeam-utils::CachePadded.~~

I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).

newpavlov · 2025-01-13T11:34:04Z

Could you benchmark the parallel implementation and compare it against the single threaded one?

argon2/src/lib.rs

jonasmalacofilho · 2025-01-16T11:16:18Z

Some benchmarks:

Note: these are outdated since the removal of 018c3e9 due to #568 (comment).

Benchmarking master...HEAD with parallel feature

argon2i V0x10           time:   [21.324 ms 21.344 ms 21.371 ms]                          
                        change: [-0.3322% -0.1068% +0.0761%] (p = 0.34 > 0.05)
                        No change in performance detected.

argon2i V0x13           time:   [21.429 ms 21.447 ms 21.471 ms]                          
                        change: [+0.0329% +0.2197% +0.3896%] (p = 0.01 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.302 ms 21.322 ms 21.348 ms]                          
                        change: [+0.6139% +0.8010% +0.9679%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.367 ms 21.384 ms 21.408 ms]                          
                        change: [+1.8140% +1.9978% +2.1628%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x10          time:   [21.361 ms 21.379 ms 21.405 ms]                           
                        change: [+1.2980% +1.4700% +1.6321%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13          time:   [21.303 ms 21.320 ms 21.342 ms]                           
                        change: [+0.9147% +1.1631% +1.3556%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=2048 t=4 p=4                                                                             
                        time:   [1.6939 ms 1.6979 ms 1.7026 ms]
                        change: [-58.795% -58.661% -58.490%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=16384 t=4 p=4                                                                            
                        time:   [11.230 ms 11.309 ms 11.391 ms]
                        change: [-67.907% -67.695% -67.447%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=65536 t=4 p=4                                                                            
                        time:   [44.778 ms 45.122 ms 45.489 ms]
                        change: [-71.067% -70.867% -70.621%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=262144 t=4 p=4                                                                            
                        time:   [172.61 ms 173.58 ms 174.61 ms]
                        change: [-72.478% -72.337% -72.127%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=2 p=4                                                                            
                        time:   [11.964 ms 12.047 ms 12.132 ms]
                        change: [-69.521% -69.311% -69.093%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=8 p=4                                                                            
                        time:   [45.011 ms 45.311 ms 45.623 ms]
                        change: [-69.838% -69.634% -69.434%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=16 p=4                                                                            
                        time:   [88.879 ms 89.461 ms 90.061 ms]
                        change: [-69.861% -69.687% -69.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=24 p=4                                                                            
                        time:   [133.26 ms 134.09 ms 134.93 ms]
                        change: [-69.816% -69.628% -69.446%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=1                                                                            
                        time:   [8.1242 ms 8.1254 ms 8.1268 ms]
                        change: [+1.4099% +1.4320% +1.4529%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13 m=2048 t=8 p=2                                                                             
                        time:   [4.8775 ms 4.9057 ms 4.9336 ms]
                        change: [-39.640% -39.331% -38.984%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=4                                                                             
                        time:   [3.2967 ms 3.3045 ms 3.3137 ms]
                        change: [-59.213% -59.105% -58.995%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=6                                                                             
                        time:   [2.5706 ms 2.5757 ms 2.5827 ms]
                        change: [-68.446% -68.385% -68.301%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=8                                                                             
                        time:   [2.1205 ms 2.1339 ms 2.1500 ms]
                        change: [-73.975% -73.809% -73.631%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=12                                                                             
                        time:   [1.8220 ms 1.8515 ms 1.8819 ms]
                        change: [-77.377% -76.954% -76.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=16                                                                             
                        time:   [2.2035 ms 2.2221 ms 2.2437 ms]
                        change: [-73.287% -73.088% -72.841%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=64                                                                             
                        time:   [2.2370 ms 2.2553 ms 2.2788 ms]
                        change: [-74.567% -74.380% -74.087%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=1                                                                            
                        time:   [74.181 ms 74.228 ms 74.292 ms]
                        change: [-0.8519% -0.7318% -0.6115%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=32768 t=4 p=2                                                                            
                        time:   [39.565 ms 39.759 ms 39.980 ms]
                        change: [-47.750% -47.455% -47.143%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4                                                                            
                        time:   [23.032 ms 23.199 ms 23.368 ms]
                        change: [-69.607% -69.389% -69.150%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=6                                                                            
                        time:   [18.127 ms 18.171 ms 18.214 ms]
                        change: [-75.369% -75.303% -75.239%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=8                                                                            
                        time:   [14.412 ms 14.439 ms 14.471 ms]
                        change: [-80.442% -80.403% -80.360%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=12                                                                            
                        time:   [11.878 ms 12.021 ms 12.200 ms]
                        change: [-83.827% -83.654% -83.390%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=16                                                                            
                        time:   [14.359 ms 14.388 ms 14.423 ms]
                        change: [-80.504% -80.462% -80.415%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=64                                                                            
                        time:   [12.239 ms 12.285 ms 12.343 ms]
                        change: [-83.542% -83.480% -83.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=1                                                                            
                        time:   [652.11 ms 652.26 ms 652.40 ms]
                        change: [-6.4332% -6.4049% -6.3769%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=2                                                                            
                        time:   [337.65 ms 338.01 ms 338.40 ms]
                        change: [-51.454% -51.401% -51.345%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4                                                                            
                        time:   [178.52 ms 179.41 ms 180.40 ms]
                        change: [-74.218% -74.087% -73.947%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=6                                                                            
                        time:   [137.57 ms 139.27 ms 141.00 ms]
                        change: [-80.074% -79.832% -79.558%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=8                                                                            
                        time:   [136.21 ms 136.41 ms 136.64 ms]
                        change: [-80.298% -80.265% -80.231%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=12                                                                            
                        time:   [119.20 ms 120.03 ms 121.02 ms]
                        change: [-82.675% -82.535% -82.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=16                                                                            
                        time:   [146.64 ms 147.06 ms 147.47 ms]
                        change: [-78.611% -78.557% -78.499%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=64                                                                            
                        time:   [131.18 ms 131.41 ms 131.64 ms]
                        change: [-80.804% -80.771% -80.735%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: 6-core CPU with SMT.

Also:

Benchmarking master...HEAD without parallel feature, default param tests only

argon2i V0x10           time:   [21.365 ms 21.390 ms 21.419 ms]                          
                        change: [-0.9417% -0.7019% -0.4585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2i V0x13           time:   [21.523 ms 21.548 ms 21.574 ms]                          
                        change: [+0.0241% +0.2325% +0.4389%] (p = 0.03 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.201 ms 21.220 ms 21.243 ms]                          
                        change: [-0.6101% -0.4179% -0.2436%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.403 ms 21.426 ms 21.453 ms]                          
                        change: [+0.3981% +0.6366% +0.8608%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x10          time:   [21.241 ms 21.258 ms 21.279 ms]                           
                        change: [-1.7410% -1.5319% -1.3262%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13          time:   [21.319 ms 21.335 ms 21.355 ms]                           
                        change: [-0.9682% -0.7757% -0.5904%] (p = 0.00 < 0.05)
                        Change within noise threshold.

tarcieri · 2025-01-21T18:21:23Z

@jonasmalacofilho if you can rebase I added cargo careful in #553 which should help spot issues in unsafe code

jonasmalacofilho · 2025-01-21T18:37:02Z

@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out!

That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:

That said, there is a lot of Undefined Behavior that is not detected by cargo careful; check out Miri if you want to be more exhaustively covered. The advantage of cargo careful over Miri is that it works on all code, supports using arbitrary system and C FFI functions, and is much faster.

jonasmalacofilho · 2025-01-21T18:46:18Z

By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it.

tarcieri · 2025-01-21T18:47:15Z

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

jonasmalacofilho · 2025-01-21T18:59:15Z

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI.

Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime.

argon2/src/memory.rs

jonasmalacofilho · 2025-03-04T04:24:44Z

Quick update: I ended up getting stuck trying to remove the (apparently) unrelated warnings from Miri (a warning in crossbeam and a leak due to rayon), and then I couldn't get to this PR for a few weeks.

EDIT: (easily) running the tests in Miri is not currently possible due to crossbeam-rs/crossbeam#1181. Once that fix is released, it's possible that only Tree Borrows may work due to crossbeam-rs/crossbeam#545.

jonasmalacofilho · 2025-03-08T18:37:23Z

I removed the conflict, rebased the PR, fixed/updated the benchmarks and did some other minor cleanup.

Crossbeam-epoch doesn't currently work in Miri (see my edited comment above). Between that and the fact that even the most minimal Miri test would be super slow on GitHub free runners, I just don't think they are worth it for now. (It should be still possible to get an older toolchain and Miri and run some specific tests locally).

Is there's something else you would like me to add here?

tarcieri · 2025-03-09T21:01:53Z

@jonasmalacofilho still need to go through it in detail, but there's nothing I see that's an immediate blocker

Maksych · 2025-03-14T19:28:03Z

Just interesting, this or is done and waiting for merge or need some work?

tarcieri · 2025-03-14T20:12:36Z

I still need to review it. Sorry it's been on the backburner since it's difficult to review due to the use of unsafe code and because we're currently working on more fundamental crates which are a blocker for another stable argon2 release.

In the meantime I appreciate any review others can offer.

tarcieri · 2025-05-30T16:59:18Z

Apologies for not getting to this yet. I hope to get to it soon, especially as we get closer to final releases.

@jonasmalacofilho in the meantime, can you fix the conflict? Thanks

Coordinated shared access in the memory blocks is implemented with `SegmentViewIter` and associated types, which provide views into Argon2 memory that can be processed in parallel. These views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible, as argued in SAFETY comments and/or checked at runtime. Finally, add a `parallel` feature and parallelize filling the blocks using the memory views mentioned above and rayon.

This was cause by having multiple different versions of criterion, and therefore the train, in use: we specified ^0.4, but pprof 0.14.0 already required ^0.5.

Additionally, use a set instead of trying to avoid repeating a particular set of params by hand.

argon2/src/memory.rs

tarcieri

This seems plausible to me, and the benchmark improvements look impressive.

Would appreciate a second set of eyes from @newpavlov before merging.

newpavlov

I will try to take a closer look at the implementation during this weekend.

Meanwhile, maybe we should introduce a separate feature-gated type (or generic parameter) for the parallel implementation instead of switching to it under the hood by enabling the feature?

argon2/Cargo.toml

tarcieri · 2025-07-18T21:46:48Z

@newpavlov IMO it should "just work" when the feature is enabled

newpavlov · 2025-07-18T22:07:59Z

I don't have a strong opinion here, so I am fine with either, but my minor concern here is that rayon-based multithreading may not be always an appropriate option. Imagine an async authentication microservice, it probably should not use the parallel feature, but its dependency may enable the feature unconditionally (because "it's much faster!"). It would be less of a problem if we had exclusive/global features. With current Rust/Cargo capabilities the other option would be to introduce a cfg-based gating, but it's probably would be too unergonomic for this feature.

tarcieri · 2025-07-18T22:13:23Z

Yeah, that's a reasonable concern. I guess my answer there would be to file an issue against those dependencies and have them propagate an off-by-default parallel feature rather than enabling it unconditionally

jonasmalacofilho · 2025-07-18T22:34:23Z

I simply followed what you guys did on balloon-hash. Still, I think it's a simple yet good enough solution here.

Parallelism is also gated by the number of lanes, which the defender usually controls. And parallelism ~~only~~ serves defenders ~~, specifically~~ by allowing them to use higher memory and/or time parameters with acceptable latency. So if the defender doesn't want to use parallelism, they can create hashes with p=1 (which is currently the default).

To pass the newly enabled clippy lints, add a missing safety comment to the compress_avx2 call.

This comment was marked as outdated.

Sign in to view

jonasmalacofilho commented Jan 13, 2025

View reviewed changes

argon2/src/lib.rs Outdated Show resolved Hide resolved

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 264821d to 018c3e9 Compare January 21, 2025 18:32

tarcieri mentioned this pull request Mar 3, 2025

Argon2::hash_password_into should use fallible memory allocations #566

Closed

BlackHoleFox reviewed Mar 3, 2025

View reviewed changes

argon2/src/memory.rs Outdated Show resolved Hide resolved

tarcieri mentioned this pull request Mar 4, 2025

argon2: detect allocation failures in hash_password_into #568

Merged

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch 2 times, most recently from 31cecde to 0f5355a Compare March 8, 2025 17:48

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 0fd407c to dc29161 Compare June 3, 2025 07:08

jonasmalacofilho added 6 commits July 15, 2025 23:59

benches: bump criterion to prevent Profiler not implemented error

0e2d2e8

This was cause by having multiple different versions of criterion, and therefore the train, in use: we specified ^0.4, but pprof 0.14.0 already required ^0.5.

benches: expand argon2 benchmarks varying p_cost

cbc5b4b

Additionally, use a set instead of trying to avoid repeating a particular set of params by hand.

benches: patch password-hash due to new os-rng feature

28583a9

argon2: simplify memory view implementation and internal API

32db77c

benches: run benches in a predictable order

3a380c3

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from dc29161 to 3a380c3 Compare July 16, 2025 03:00

tarcieri reviewed Jul 17, 2025

View reviewed changes

argon2/src/memory.rs Outdated Show resolved Hide resolved

tarcieri reviewed Jul 17, 2025

View reviewed changes

argon2/src/memory.rs Outdated Show resolved Hide resolved

tarcieri approved these changes Jul 17, 2025

View reviewed changes

newpavlov reviewed Jul 18, 2025

View reviewed changes

argon2/Cargo.toml Outdated Show resolved Hide resolved

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from bcc9c03 to cdeac38 Compare July 18, 2025 22:52

jonasmalacofilho added 2 commits July 19, 2025 00:06

argon2: don't enable std by enabling parallel

267c753

argon2: configure lints at the toplevel

fb4b0f4

To pass the newly enabled clippy lints, add a missing safety comment to the compress_avx2 call.

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from cdeac38 to fb4b0f4 Compare July 19, 2025 03:07

tarcieri merged commit e75b27d into RustCrypto:master Jul 21, 2025
62 checks passed

tarcieri mentioned this pull request Jul 21, 2025

argon2: parallel memory view abstraction #380

Closed

argon2: add parallelism #547

argon2: add parallelism #547

Uh oh!

Conversation

jonasmalacofilho commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newpavlov commented Jan 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

jonasmalacofilho commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Jan 21, 2025

Uh oh!

jonasmalacofilho commented Jan 21, 2025

Uh oh!

jonasmalacofilho commented Jan 21, 2025

Uh oh!

tarcieri commented Jan 21, 2025

Uh oh!

jonasmalacofilho commented Jan 21, 2025

Uh oh!

Uh oh!

jonasmalacofilho commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonasmalacofilho commented Mar 8, 2025

Uh oh!

tarcieri commented Mar 9, 2025

Uh oh!

Maksych commented Mar 14, 2025

Uh oh!

tarcieri commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented May 30, 2025

Uh oh!

Uh oh!

Uh oh!

tarcieri left a comment

Choose a reason for hiding this comment

Uh oh!

newpavlov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tarcieri commented Jul 18, 2025

Uh oh!

newpavlov commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Jul 18, 2025

Uh oh!

jonasmalacofilho commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonasmalacofilho commented Jan 13, 2025 •

edited

Loading

jonasmalacofilho commented Jan 16, 2025 •

edited

Loading

jonasmalacofilho commented Mar 4, 2025 •

edited

Loading

tarcieri commented Mar 14, 2025 •

edited

Loading

newpavlov commented Jul 18, 2025 •

edited

Loading

jonasmalacofilho commented Jul 18, 2025 •

edited

Loading