Skip to content

Performance regression in mixed f32/i32/u16 code in 1.30 #55413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pedrocr opened this issue Oct 27, 2018 · 3 comments
Closed

Performance regression in mixed f32/i32/u16 code in 1.30 #55413

pedrocr opened this issue Oct 27, 2018 · 3 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@pedrocr
Copy link

pedrocr commented Oct 27, 2018

I tried my rawloader performance regression tests with 1.30:

http://chimper.org/rawloader-rustc-benchmarks/
http://chimper.org/rawloader-rustc-benchmarks/version-1.30.0.html

Besides the several formats that have been regressed since 1.25 because of #53340 there was a new regression of ~17% in a single file. That corresponded to a specific section of code that decodes YUV format to RGB. Here's a minimized version

fn clampbits(val: i32, bits: u32) -> i32 {
  let temp = val >> bits;
  if temp != 0 {
    !temp >> (32-bits)
  } else {
    val
  }
}

fn decode_snef_compressed(buf: &[u8], width: usize, height: usize, coeffs: [f32;3]) -> Vec<u16> {
  let mut out: Vec<u16> = vec![0; width*height*3];

  let inv_wb_r = (1024.0 / coeffs[0]) as i32;
  let inv_wb_b = (1024.0 / coeffs[2]) as i32;

  for (o, i) in out.chunks_mut(6).zip(buf.chunks(6)) {
    let g1: u16 = i[0] as u16;
    let g2: u16 = i[1] as u16;
    let g3: u16 = i[2] as u16;
    let g4: u16 = i[3] as u16;
    let g5: u16 = i[4] as u16;
    let g6: u16 = i[5] as u16;

    let y1  = (g1 | ((g2 & 0x0f) << 8)) as f32;
    let y2  = ((g2 >> 4) | (g3 << 4)) as f32;
    let cb = (g4 | ((g5 & 0x0f) << 8)) as f32 - 2048.0;
    let cr = ((g5 >> 4) | (g6 << 4)) as f32 - 2048.0;

    let r = clampbits((y1 + 1.370705 * cr) as i32, 12) as u16;
    let g = clampbits((y1 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
    let b = clampbits((y1 + 1.732446 * cb) as i32, 12) as u16;
    // invert the white balance
    o[0] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
    o[1] = g;
    o[2] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;

    let r = clampbits((y2 + 1.370705 * cr) as i32, 12) as u16;
    let g = clampbits((y2 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
    let b = clampbits((y2 + 1.732446 * cb) as i32, 12) as u16;
    // invert the white balance
    o[3] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
    o[4] = g;
    o[5] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;
  }

  out
}

fn main() {
  let width = 5000;
  let height = 4000;

  let mut buffer: Vec<u8> = vec![0; width*height*3];
  // Make sure we don't get optimized out by writing some data into the buffer
  for (i, val) in buffer.chunks_mut(1).enumerate() {
    val[0] = i as u8;
  }
  
  for _ in 0..30 {
    decode_snef_compressed(&buffer, width, height, [2.0, 1.0, 1.5]);
  }
}

I'm running this on my machine across versions with opt-level=3 and taking the median of 5 runs:

snef_compressed : 1.20.0  BASE 6.96
snef_compressed : 1.21.0  OK   6.95 (-0%)
snef_compressed : 1.22.1  OK   6.92 (-0%)
snef_compressed : 1.23.0  OK   6.93 (-0%)
snef_compressed : 1.24.1  OK   6.94 (-0%)
snef_compressed : 1.25.0  OK   7.14 (+2%)
snef_compressed : 1.26.2  OK   7.09 (+1%)
snef_compressed : 1.27.2  OK   7.07 (+1%)
snef_compressed : 1.28.0  OK   7.14 (+2%)
snef_compressed : 1.29.2  OK   7.01 (+0%)
snef_compressed : 1.30.0  FAIL 7.37 (+5%)
snef_compressed : beta    FAIL 7.32 (+5%)
snef_compressed : nightly FAIL 7.27 (+4%)

I'm not replicating the full 17% and there seems to have already been a smaller regression since 1.25. The difference in replication may be because the original code runs threaded with rayon across 4 threads but only 2 cores (hyperthreading) and that causes a higher penalty.

@memoryruins memoryruins added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 29, 2018
@nikic
Copy link
Contributor

nikic commented Dec 15, 2018

Not seeing any relevant differences in the assembly output between 1.29.0 and 1.30.0: https://godbolt.org/z/B1PJwc

@pedrocr
Copy link
Author

pedrocr commented Dec 15, 2018

Strange, the slowdown seems easy to reproduce and consistent. This one has clear godbolt differences and is a much bigger regression if you want to look at that instead:

#53340

@pedrocr
Copy link
Author

pedrocr commented Dec 15, 2018

The bad news is that the same regression seems to continue to be visible somehow:

snef_compressed : 1.29.2  BASE 7.09
snef_compressed : 1.30.1  FAIL 7.31 (+3%)
snef_compressed : 1.31.0  FAIL 7.44 (+4%)
snef_compressed : beta    FAIL 7.34 (+3%)
snef_compressed : nightly FAIL 7.37 (+3%)

The good news is that I apparently screwed up minimizing the test case. I had wondered why I couldn't reproduce the 17% regression but assumed it was still exercising the problem. As it turns out 1.31 fixed the regression and yet this test case is still showing the same 3-4%. Here's the run for 1.31:

http://chimper.org/rawloader-rustc-benchmarks/version-1.31.0.html

As you can see in the "D810_Small.NEF" line, it got 14% faster than 1.30 and is now even slightly faster than the original 1.20. The large regression in #53340 is still there though.

@pedrocr pedrocr closed this as completed Dec 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants