Skip to content

argon2: add parallelism #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jonasmalacofilho
Copy link
Contributor

@jonasmalacofilho jonasmalacofilho commented Jan 13, 2025

Adds a default-enabled parallel feature, with an otherwise optional dependency on rayon, and parallelize the filling of blocks using the memory views mentioned above.

Coordinated shared access in the memory blocks is implemented with a SegmentViewIter iterator, which implements either rayon::iter::ParallelIterator or core::iter::Iterator and returns SegmentView views into the Argon2 blocks memory that are safe to be used in parallel.

The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.

To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.

The following tests have been tried in and pass Miri (modulo unrelated warnings):

reference_argon2i_v0x13_2_8_2
reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).

Finally, the alignment of Blocks increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes on crossbeam-utils::CachePadded.


I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).

@newpavlov
Copy link
Member

Could you benchmark the parallel implementation and compare it against the single threaded one?

@jonasmalacofilho

This comment was marked as outdated.

@jonasmalacofilho
Copy link
Contributor Author

jonasmalacofilho commented Jan 16, 2025

Some benchmarks:

Note: these are outdated since the removal of 018c3e9 due to #568 (comment).

Benchmarking master...HEAD with parallel feature
argon2i V0x10           time:   [21.324 ms 21.344 ms 21.371 ms]                          
                        change: [-0.3322% -0.1068% +0.0761%] (p = 0.34 > 0.05)
                        No change in performance detected.

argon2i V0x13           time:   [21.429 ms 21.447 ms 21.471 ms]                          
                        change: [+0.0329% +0.2197% +0.3896%] (p = 0.01 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.302 ms 21.322 ms 21.348 ms]                          
                        change: [+0.6139% +0.8010% +0.9679%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.367 ms 21.384 ms 21.408 ms]                          
                        change: [+1.8140% +1.9978% +2.1628%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x10          time:   [21.361 ms 21.379 ms 21.405 ms]                           
                        change: [+1.2980% +1.4700% +1.6321%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13          time:   [21.303 ms 21.320 ms 21.342 ms]                           
                        change: [+0.9147% +1.1631% +1.3556%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=2048 t=4 p=4                                                                             
                        time:   [1.6939 ms 1.6979 ms 1.7026 ms]
                        change: [-58.795% -58.661% -58.490%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=16384 t=4 p=4                                                                            
                        time:   [11.230 ms 11.309 ms 11.391 ms]
                        change: [-67.907% -67.695% -67.447%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=65536 t=4 p=4                                                                            
                        time:   [44.778 ms 45.122 ms 45.489 ms]
                        change: [-71.067% -70.867% -70.621%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=262144 t=4 p=4                                                                            
                        time:   [172.61 ms 173.58 ms 174.61 ms]
                        change: [-72.478% -72.337% -72.127%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=2 p=4                                                                            
                        time:   [11.964 ms 12.047 ms 12.132 ms]
                        change: [-69.521% -69.311% -69.093%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=8 p=4                                                                            
                        time:   [45.011 ms 45.311 ms 45.623 ms]
                        change: [-69.838% -69.634% -69.434%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=16 p=4                                                                            
                        time:   [88.879 ms 89.461 ms 90.061 ms]
                        change: [-69.861% -69.687% -69.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=24 p=4                                                                            
                        time:   [133.26 ms 134.09 ms 134.93 ms]
                        change: [-69.816% -69.628% -69.446%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=1                                                                            
                        time:   [8.1242 ms 8.1254 ms 8.1268 ms]
                        change: [+1.4099% +1.4320% +1.4529%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13 m=2048 t=8 p=2                                                                             
                        time:   [4.8775 ms 4.9057 ms 4.9336 ms]
                        change: [-39.640% -39.331% -38.984%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=4                                                                             
                        time:   [3.2967 ms 3.3045 ms 3.3137 ms]
                        change: [-59.213% -59.105% -58.995%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=6                                                                             
                        time:   [2.5706 ms 2.5757 ms 2.5827 ms]
                        change: [-68.446% -68.385% -68.301%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=8                                                                             
                        time:   [2.1205 ms 2.1339 ms 2.1500 ms]
                        change: [-73.975% -73.809% -73.631%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=12                                                                             
                        time:   [1.8220 ms 1.8515 ms 1.8819 ms]
                        change: [-77.377% -76.954% -76.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=16                                                                             
                        time:   [2.2035 ms 2.2221 ms 2.2437 ms]
                        change: [-73.287% -73.088% -72.841%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=64                                                                             
                        time:   [2.2370 ms 2.2553 ms 2.2788 ms]
                        change: [-74.567% -74.380% -74.087%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=1                                                                            
                        time:   [74.181 ms 74.228 ms 74.292 ms]
                        change: [-0.8519% -0.7318% -0.6115%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=32768 t=4 p=2                                                                            
                        time:   [39.565 ms 39.759 ms 39.980 ms]
                        change: [-47.750% -47.455% -47.143%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4                                                                            
                        time:   [23.032 ms 23.199 ms 23.368 ms]
                        change: [-69.607% -69.389% -69.150%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=6                                                                            
                        time:   [18.127 ms 18.171 ms 18.214 ms]
                        change: [-75.369% -75.303% -75.239%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=8                                                                            
                        time:   [14.412 ms 14.439 ms 14.471 ms]
                        change: [-80.442% -80.403% -80.360%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=12                                                                            
                        time:   [11.878 ms 12.021 ms 12.200 ms]
                        change: [-83.827% -83.654% -83.390%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=16                                                                            
                        time:   [14.359 ms 14.388 ms 14.423 ms]
                        change: [-80.504% -80.462% -80.415%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=64                                                                            
                        time:   [12.239 ms 12.285 ms 12.343 ms]
                        change: [-83.542% -83.480% -83.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=1                                                                            
                        time:   [652.11 ms 652.26 ms 652.40 ms]
                        change: [-6.4332% -6.4049% -6.3769%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=2                                                                            
                        time:   [337.65 ms 338.01 ms 338.40 ms]
                        change: [-51.454% -51.401% -51.345%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4                                                                            
                        time:   [178.52 ms 179.41 ms 180.40 ms]
                        change: [-74.218% -74.087% -73.947%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=6                                                                            
                        time:   [137.57 ms 139.27 ms 141.00 ms]
                        change: [-80.074% -79.832% -79.558%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=8                                                                            
                        time:   [136.21 ms 136.41 ms 136.64 ms]
                        change: [-80.298% -80.265% -80.231%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=12                                                                            
                        time:   [119.20 ms 120.03 ms 121.02 ms]
                        change: [-82.675% -82.535% -82.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=16                                                                            
                        time:   [146.64 ms 147.06 ms 147.47 ms]
                        change: [-78.611% -78.557% -78.499%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=64                                                                            
                        time:   [131.18 ms 131.41 ms 131.64 ms]
                        change: [-80.804% -80.771% -80.735%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: 6-core CPU with SMT.

Also:

Benchmarking master...HEAD without parallel feature, default param tests only
argon2i V0x10           time:   [21.365 ms 21.390 ms 21.419 ms]                          
                        change: [-0.9417% -0.7019% -0.4585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2i V0x13           time:   [21.523 ms 21.548 ms 21.574 ms]                          
                        change: [+0.0241% +0.2325% +0.4389%] (p = 0.03 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.201 ms 21.220 ms 21.243 ms]                          
                        change: [-0.6101% -0.4179% -0.2436%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.403 ms 21.426 ms 21.453 ms]                          
                        change: [+0.3981% +0.6366% +0.8608%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x10          time:   [21.241 ms 21.258 ms 21.279 ms]                           
                        change: [-1.7410% -1.5319% -1.3262%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13          time:   [21.319 ms 21.335 ms 21.355 ms]                           
                        change: [-0.9682% -0.7757% -0.5904%] (p = 0.00 < 0.05)
                        Change within noise threshold.

@tarcieri
Copy link
Member

@jonasmalacofilho if you can rebase I added cargo careful in #553 which should help spot issues in unsafe code

@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 264821d to 018c3e9 Compare January 21, 2025 18:32
@jonasmalacofilho
Copy link
Contributor Author

@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out!

That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:

That said, there is a lot of Undefined Behavior that is not detected by cargo careful; check out Miri if you want to be more exhaustively covered. The advantage of cargo careful over Miri is that it works on all code, supports using arbitrary system and C FFI functions, and is much faster.

@jonasmalacofilho
Copy link
Contributor Author

By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it.

@tarcieri
Copy link
Member

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

@jonasmalacofilho
Copy link
Contributor Author

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI.

Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime.

@jonasmalacofilho
Copy link
Contributor Author

jonasmalacofilho commented Mar 4, 2025

Quick update: I ended up getting stuck trying to remove the (apparently) unrelated warnings from Miri (a warning in crossbeam and a leak due to rayon), and then I couldn't get to this PR for a few weeks.


EDIT: (easily) running the tests in Miri is not currently possible due to crossbeam-rs/crossbeam#1181. Once that fix is released, it's possible that only Tree Borrows may work due to crossbeam-rs/crossbeam#545.

@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch 2 times, most recently from 31cecde to 0f5355a Compare March 8, 2025 17:48
@jonasmalacofilho
Copy link
Contributor Author

I removed the conflict, rebased the PR, fixed/updated the benchmarks and did some other minor cleanup.

Crossbeam-epoch doesn't currently work in Miri (see my edited comment above). Between that and the fact that even the most minimal Miri test would be super slow on GitHub free runners, I just don't think they are worth it for now. (It should be still possible to get an older toolchain and Miri and run some specific tests locally).

Is there's something else you would like me to add here?

@tarcieri
Copy link
Member

tarcieri commented Mar 9, 2025

@jonasmalacofilho still need to go through it in detail, but there's nothing I see that's an immediate blocker

@Maksych
Copy link

Maksych commented Mar 14, 2025

Just interesting, this or is done and waiting for merge or need some work?

@tarcieri
Copy link
Member

tarcieri commented Mar 14, 2025

I still need to review it. Sorry it's been on the backburner since it's difficult to review due to the use of unsafe code and because we're currently working on more fundamental crates which are a blocker for another stable argon2 release.

In the meantime I appreciate any review others can offer.

@tarcieri
Copy link
Member

Apologies for not getting to this yet. I hope to get to it soon, especially as we get closer to final releases.

@jonasmalacofilho in the meantime, can you fix the conflict? Thanks

@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 0fd407c to dc29161 Compare June 3, 2025 07:08
Coordinated shared access in the memory blocks is implemented with
`SegmentViewIter` and associated types, which provide views into Argon2
memory that can be processed in parallel.

These views alias in the regions that are read-only, but are disjoint in
the regions where mutation happens. Effectively, they implement, with a
combination of mutable borrowing and runtime checking, the cooperative
contract outlined in RFC 9106.

To avoid aliasing mutable references into the entire buffer of blocks
(which would be UB), pointers are used up to the moment where a
reference (shared or mutable) into a specific block is returned. At that
point, aliasing is no longer possible, as argued in SAFETY comments
and/or checked at runtime.

Finally, add a `parallel` feature and parallelize filling the blocks
using the memory views mentioned above and rayon.
This was cause by having multiple different versions of criterion, and
therefore the train, in use: we specified ^0.4, but pprof 0.14.0 already
required ^0.5.
Additionally, use a set instead of trying to avoid repeating a
particular set of params by hand.
@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from dc29161 to 3a380c3 Compare July 16, 2025 03:00
Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems plausible to me, and the benchmark improvements look impressive.

Would appreciate a second set of eyes from @newpavlov before merging.

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to take a closer look at the implementation during this weekend.

Meanwhile, maybe we should introduce a separate feature-gated type (or generic parameter) for the parallel implementation instead of switching to it under the hood by enabling the feature?

@tarcieri
Copy link
Member

@newpavlov IMO it should "just work" when the feature is enabled

@newpavlov
Copy link
Member

newpavlov commented Jul 18, 2025

I don't have a strong opinion here, so I am fine with either, but my minor concern here is that rayon-based multithreading may not be always an appropriate option. Imagine an async authentication microservice, it probably should not use the parallel feature, but its dependency may enable the feature unconditionally (because "it's much faster!"). It would be less of a problem if we had exclusive/global features. With current Rust/Cargo capabilities the other option would be to introduce a cfg-based gating, but it's probably would be too unergonomic for this feature.

@tarcieri
Copy link
Member

Yeah, that's a reasonable concern. I guess my answer there would be to file an issue against those dependencies and have them propagate an off-by-default parallel feature rather than enabling it unconditionally

@jonasmalacofilho
Copy link
Contributor Author

jonasmalacofilho commented Jul 18, 2025

I simply followed what you guys did on balloon-hash. Still, I think it's a simple yet good enough solution here.

Parallelism is also gated by the number of lanes, which the defender usually controls. And parallelism only serves defenders , specifically by allowing them to use higher memory and/or time parameters with acceptable latency. So if the defender doesn't want to use parallelism, they can create hashes with p=1 (which is currently the default).

@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from bcc9c03 to cdeac38 Compare July 18, 2025 22:52
To pass the newly enabled clippy lints, add a missing safety comment to
the compress_avx2 call.
@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from cdeac38 to fb4b0f4 Compare July 19, 2025 03:07
@tarcieri tarcieri merged commit e75b27d into RustCrypto:master Jul 21, 2025
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants