Skip to content

Iter with step_by(2) performs slowly #59281

Closed
@wooleyra

Description

@wooleyra

Comparing the performance between these similar functions, the one that uses the iterator with the step_by(2) function performs significantly slower than the one using a traditional "while loop".

fn is_prime_iter(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    for i in (3..num).step_by(2) {
        if num % i == 0 {
            is_prime = false;
            break;
        }
    }
    is_prime
}

fn is_prime_loop(num: u32) -> bool {
    if num == 2 {
        return true;
    }
    if num % 2 == 0 {
        return false;
    }
    let mut is_prime = true;
    let mut i = 3;
    while i < num {
        if num % i == 0 {
            is_prime = false;
            break;
        }
        i += 2;
    }
    is_prime
}

Running the experimental cargo bench --all-targets produced the following results for me using the nightly-x86_64-unknown-gnu toolchain (1.35.0-nightly):

// benchmark value: 15485867
test tests::bench_is_prime_iter  ... bench:  26,813,870 ns/iter (+/- 337,054)
test tests::bench_is_prime_loop ... bench:  21,607,357 ns/iter (+/- 242,028)

Activity

added
I-slowIssue: Problems and improvements with respect to performance of generated code.
T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.
on Mar 18, 2019
hellow554

hellow554 commented on Mar 19, 2019

@hellow554
Contributor

Interesting enough, if you are using all instead of manual if/break it gets even slightly faster

# cargo bench
    Finished release [optimized] target(s) in 0.01s
     Running target/release/deps/foo-8f4575d6c09c1e6c

running 3 tests
test allbar ... bench:      89,120 ns/iter (+/- 2,359)
test iter   ... bench:     109,650 ns/iter (+/- 3,717)
test loop   ... bench:      92,457 ns/iter (+/- 2,685)

(Playground)

scottmcm

scottmcm commented on Mar 21, 2019

@scottmcm
Member

For tiny loop bodies like this it's not unusual for internal iteration to be faster -- chain is the same.

nikic

nikic commented on Mar 25, 2019

@nikic
Contributor
steveklabnik

steveklabnik commented on Aug 28, 2020

@steveklabnik
Member

Triage: no change

jdahlstrom

jdahlstrom commented on Oct 24, 2021

@jdahlstrom

I ran into this while implementing the inner loop of a rasterizer. Internal iteration or not, step_by generates quite inefficient assembly even when iterating over a simple integer range1 or a slice. In the case of an inclusive range, the difference to a while loop is quite extreme! 2

I find it a bit unfortunate that there's no obvious equivalent for the "classic" FOR i = a TO b STEP c (ie. for(int i = a; i < b; i += c)) loop, given thatstep_by is not zero-cost and while is not as clear as to its intention.

Edit: apparently there was a fix proposed but it had to be reverted :( 3

Footnotes

  1. https://rust.godbolt.org/z/5hYsr3rWf

  2. https://rust.godbolt.org/z/WqE7xW6aM

  3. https://internals.rust-lang.org/t/more-about-step-by/12436/14

scottmcm

scottmcm commented on Oct 24, 2021

@scottmcm
Member

@jdahlstrom Make sure you compare equivalent semantics. foo_while in your 2nd footnote is an infinite loop if passed u32::MAX, whereas the RangeInclusive+step_by version will properly terminate.

Of course, the "classic" loop of for(int i = a; i < b; i += c) is UB in those scenarios, which is part of why it's so fast 🙃

jdahlstrom

jdahlstrom commented on Oct 25, 2021

@jdahlstrom

@scottmcm True, and I appreciate that Rust does the right thing there, but still it seems unfortunate that the penalty for doing the right thing is so great in the common case (where u32::MAX is often not an element of the function's "practical" domain – doubly so if we're talking usize instead!) And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert! or iterating over a range of a wider type (ie. 0..max as u64 in the example) :(

(On the other hand, widening the induction variable of the while version in the specific case of the example actually makes LLVM ditch the loop altogether and utilize a variation of the max*(max-1)/2 formula instead!)

scottmcm

scottmcm commented on Oct 25, 2021

@scottmcm
Member

And unfortunately LLVM isn't smart enough to generate better code even if we make the special case(s) impossible by adding an appropriate assert!

This is a really interesting one. It looks like https://rust.godbolt.org/z/15W97roTo it has trouble removing the uadd.with.overflow into just a normal addition when it should know that it's in-range because of the assert.

Might be worth experimenting with that a bit to see whether there's a reduced example that could be an A-llvm issue.

The "obvious" version it does just fine (<https://rust.godbolt.org/z/4jhj6WTdG>) so there's something confounding going on here.

jonasmalacofilho

jonasmalacofilho commented on Oct 25, 2021

@jonasmalacofilho

Note: this comment has been edited to fix an incorrect assumption. The incorrect assumption was that it was rustc, not LLVM, that was optimizing the optimized() version.


The "obvious" version it does just fine (https://rust.godbolt.org/z/4jhj6WTdG) so there's something confounding going on here.

It seems that the "obvious" version is optimized by rustc, not LLVM, and that depends on it the compiler also knowing (in the function body and before the assert) the value being added to x.

/// Optimized as long as the compiler knows:
/// - (a) the value of y;
/// - and (b) that x <= u32::MAX - y.
/// Note that the order of (a) and (b) also matters.
pub fn optimized(x: u32, y: u32) -> u32 {
    assert!(y == 42); // cannot be moved down
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

/// The check for overflow is not yet optimized away, even though
/// x <= u32::MAX - y.
pub fn not_optimized_yet(x: u32, y: u32) -> u32 {
    assert!(x <= u32::MAX - y);

    x.checked_add(y).unwrap()
}

https://rust.godbolt.org/z/rWvrqjME1

(I used checked_add instead of overflowing_add, but I don't think that matters).

scottmcm

scottmcm commented on Oct 25, 2021

@scottmcm
Member

It seems that the "obvious" version is optimized by rustc, not LLVM

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

jonasmalacofilho

jonasmalacofilho commented on Oct 25, 2021

@jonasmalacofilho

Is it? In https://rust.godbolt.org/z/Yov3GfTqc the MIR still contains the call to core::num::<impl u32>::overflowing_add.

My reasoning was based on my example that is optimized not calling @llvm.uadd.with.overflow.i32 in the LLVM IR. But yes, both optimized and non optimized examples contain calls to core::num::<impl u32>::checked_add() in the MIR; and the same happens in your optimized example.

But isn't LLVM IR the boundary between rustc and LLVM?

scottmcm

scottmcm commented on Oct 25, 2021

@scottmcm
Member

But isn't LLVM IR the boundary between rustc and LLVM?

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

jonasmalacofilho

jonasmalacofilho commented on Oct 25, 2021

@jonasmalacofilho

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

scottmcm

scottmcm commented on Oct 25, 2021

@scottmcm
Member

apparently gets compiled with overflow checking, like in debug builds.

You can pass -C debug-assertions=n to change that one. I'm not sure what MIR opts are configured by optimization level; one can always pass -Z mir-opt-level=1 to turn those back on. (Most of them are off even in optimized builds, though, IIRC.)

The zero-optimization-fuel approach is interesting. I don't know if all the LLVM optimizations actually hook into that properly. If they do it sounds like an interesting approach, though.

nikic

nikic commented on Oct 25, 2021

@nikic
Contributor

rustc emits LLVM-IR, yes, but if optimizations are on then the --emit=llvm-ir will show the IR after LLVM's optimization passes have run.

Oh, I didn't know that : \

You can pass opt-level=0 to see the often-horrifying IR that rustc directly produces.

A bit off topic (and please feel free to hide this comment), but wouldn't that also affect optimizations performed by rustc?

For example, I notice that with opt-level=0 the subtraction in assert!(x <= u32::MAX - y); apparently gets compiled with overflow checking, like in debug builds.

Would -C llvm-args=-opt-bisect-limit=0 be better to see the IR generated by rustc at the desired opt-level?

-C no-prepopulate-passes can be used for this purpose.

jrmuizel

jrmuizel commented on Feb 7, 2022

@jrmuizel
Contributor

Comparing checked_add in a loop vs step_by shows there's something special going on in with step_by beyond the checked_add
https://rust.godbolt.org/z/79aTac847

the8472

the8472 commented on Sep 3, 2023

@the8472
Member

#111850 fixed this.

https://rust.godbolt.org/z/h1M73Tf34

The iter and loop assembly look a bit different but in both cases there are just 3 branches in the loop:

  • panic
  • loop exit
  • backedge

A big improvement compared to the rustc 1.33 version which had tons of branches

The benchmarks from #59281 (comment) look identical now:

test allbar ... bench:     133,624 ns/iter (+/- 1,487)
test iter   ... bench:     133,673 ns/iter (+/- 1,271)
test r#loop ... bench:     133,629 ns/iter (+/- 1,067)

So I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @steveklabnik@nikic@jrmuizel@hellow554@the8472

        Issue actions

          Iter with step_by(2) performs slowly · Issue #59281 · rust-lang/rust