Closed
Description
Comparing the performance between these similar functions, the one that uses the iterator with the step_by(2) function performs significantly slower than the one using a traditional "while loop".
fn is_prime_iter(num: u32) -> bool {
if num == 2 {
return true;
}
if num % 2 == 0 {
return false;
}
let mut is_prime = true;
for i in (3..num).step_by(2) {
if num % i == 0 {
is_prime = false;
break;
}
}
is_prime
}
fn is_prime_loop(num: u32) -> bool {
if num == 2 {
return true;
}
if num % 2 == 0 {
return false;
}
let mut is_prime = true;
let mut i = 3;
while i < num {
if num % i == 0 {
is_prime = false;
break;
}
i += 2;
}
is_prime
}
Running the experimental cargo bench --all-targets
produced the following results for me using the nightly-x86_64-unknown-gnu toolchain (1.35.0-nightly):
// benchmark value: 15485867
test tests::bench_is_prime_iter ... bench: 26,813,870 ns/iter (+/- 337,054)
test tests::bench_is_prime_loop ... bench: 21,607,357 ns/iter (+/- 242,028)
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
hellow554 commentedon Mar 19, 2019
Interesting enough, if you are using
all
instead of manual if/break it gets even slightly faster(Playground)
scottmcm commentedon Mar 21, 2019
For tiny loop bodies like this it's not unusual for internal iteration to be faster --
chain
is the same.nikic commentedon Mar 25, 2019
Godbolt: https://rust.godbolt.org/z/LRTDre
Also related: #57517
steveklabnik commentedon Aug 28, 2020
Triage: no change
jdahlstrom commentedon Oct 24, 2021
I ran into this while implementing the inner loop of a rasterizer. Internal iteration or not,
step_by
generates quite inefficient assembly even when iterating over a simple integer range1 or a slice. In the case of an inclusive range, the difference to a while loop is quite extreme! 2I find it a bit unfortunate that there's no obvious equivalent for the "classic"
FOR i = a TO b STEP c
(ie.for(int i = a; i < b; i += c)
) loop, given thatstep_by
is not zero-cost andwhile
is not as clear as to its intention.Edit: apparently there was a fix proposed but it had to be reverted :( 3
Footnotes
https://rust.godbolt.org/z/5hYsr3rWf ↩
https://rust.godbolt.org/z/WqE7xW6aM ↩
https://internals.rust-lang.org/t/more-about-step-by/12436/14 ↩
scottmcm commentedon Oct 24, 2021
@jdahlstrom Make sure you compare equivalent semantics.
foo_while
in your 2nd footnote is an infinite loop if passedu32::MAX
, whereas theRangeInclusive
+step_by
version will properly terminate.Of course, the "classic" loop of
for(int i = a; i < b; i += c)
is UB in those scenarios, which is part of why it's so fast 🙃jdahlstrom commentedon Oct 25, 2021
@scottmcm True, and I appreciate that Rust does the right thing there, but still it seems unfortunate that the penalty for doing the right thing is so great in the common case (where
u32::MAX
is often not an element of the function's "practical" domain – doubly so if we're talkingusize
instead!) And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriateassert!
or iterating over a range of a wider type (ie.0..max as u64
in the example) :((On the other hand, widening the induction variable of the
while
version in the specific case of the example actually makes LLVM ditch the loop altogether and utilize a variation of themax*(max-1)/2
formula instead!)scottmcm commentedon Oct 25, 2021
This is a really interesting one. It looks like https://rust.godbolt.org/z/15W97roTo it has trouble removing the
uadd.with.overflow
into just a normal addition when it should know that it's in-range because of the assert.Might be worth experimenting with that a bit to see whether there's a reduced example that could be an A-llvm issue.
The "obvious" version it does just fine (<https://rust.godbolt.org/z/4jhj6WTdG>) so there's something confounding going on here.
jonasmalacofilho commentedon Oct 25, 2021
Note: this comment has been edited to fix an incorrect assumption. The incorrect assumption was that it was rustc, not LLVM, that was optimizing the
optimized()
version.It seems that the "obvious" version
is optimized by rustc, not LLVM, and thatdepends onitthe compiler also knowing (in the function body and before the assert) the value being added tox
.—https://rust.godbolt.org/z/rWvrqjME1
(I used
checked_add
instead ofoverflowing_add
, but I don't think that matters).scottmcm commentedon Oct 25, 2021
Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to
core::num::<impl u32>::overflowing_add
.jonasmalacofilho commentedon Oct 25, 2021
My reasoning was based on my example that is optimized not calling
@llvm.uadd.with.overflow.i32
in the LLVM IR. But yes, both optimized and non optimized examples contain calls tocore::num::<impl u32>::checked_add()
in the MIR; and the same happens in your optimized example.But isn't LLVM IR the boundary between rustc and LLVM?
scottmcm commentedon Oct 25, 2021
rustc emits LLVM-IR, yes, but if optimizations are on then the
--emit=llvm-ir
will show the IR after LLVM's optimization passes have run.You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.
jonasmalacofilho commentedon Oct 25, 2021
Oh, I didn't know that : \
A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?
For example, I notice that with
opt-level=0
the subtraction inassert!(x <= u32::MAX - y);
apparently gets compiled with overflow checking, like in debug builds.Would
-C llvm-args=-opt-bisect-limit=0
be better to see the IR generated by rustc at the desiredopt-level
?scottmcm commentedon Oct 25, 2021
You can pass
-C debug-assertions=n
to change that one. I'm not sure what MIR opts are configured by optimization level; one can always pass-Z mir-opt-level=1
to turn those back on. (Most of them are off even in optimized builds, though, IIRC.)The zero-optimization-fuel approach is interesting. I don't know if all the LLVM optimizations actually hook into that properly. If they do it sounds like an interesting approach, though.
nikic commentedon Oct 25, 2021
-C no-prepopulate-passes
can be used for this purpose.jrmuizel commentedon Feb 7, 2022
Comparing
checked_add
in a loop vs step_by shows there's something special going on in withstep_by
beyond thechecked_add
https://rust.godbolt.org/z/79aTac847
the8472 commentedon Sep 3, 2023
#111850 fixed this.
https://rust.godbolt.org/z/h1M73Tf34
The iter and loop assembly look a bit different but in both cases there are just 3 branches in the loop:
A big improvement compared to the rustc 1.33 version which had tons of branches
The benchmarks from #59281 (comment) look identical now:
So I think this can be closed.