Performance regression in mixed f32/i32/u16 code in 1.30

I tried my rawloader performance regression tests with 1.30:

http://chimper.org/rawloader-rustc-benchmarks/
http://chimper.org/rawloader-rustc-benchmarks/version-1.30.0.html

Besides the several formats that have been regressed since 1.25 because of #53340 there was a new regression of ~17% in a single file. That corresponded to a specific section of code that decodes YUV format to RGB. Here's a minimized version

```rust
fn clampbits(val: i32, bits: u32) -> i32 {
  let temp = val >> bits;
  if temp != 0 {
    !temp >> (32-bits)
  } else {
    val
  }
}

fn decode_snef_compressed(buf: &[u8], width: usize, height: usize, coeffs: [f32;3]) -> Vec<u16> {
  let mut out: Vec<u16> = vec![0; width*height*3];

  let inv_wb_r = (1024.0 / coeffs[0]) as i32;
  let inv_wb_b = (1024.0 / coeffs[2]) as i32;

  for (o, i) in out.chunks_mut(6).zip(buf.chunks(6)) {
    let g1: u16 = i[0] as u16;
    let g2: u16 = i[1] as u16;
    let g3: u16 = i[2] as u16;
    let g4: u16 = i[3] as u16;
    let g5: u16 = i[4] as u16;
    let g6: u16 = i[5] as u16;

    let y1  = (g1 | ((g2 & 0x0f) << 8)) as f32;
    let y2  = ((g2 >> 4) | (g3 << 4)) as f32;
    let cb = (g4 | ((g5 & 0x0f) << 8)) as f32 - 2048.0;
    let cr = ((g5 >> 4) | (g6 << 4)) as f32 - 2048.0;

    let r = clampbits((y1 + 1.370705 * cr) as i32, 12) as u16;
    let g = clampbits((y1 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
    let b = clampbits((y1 + 1.732446 * cb) as i32, 12) as u16;
    // invert the white balance
    o[0] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
    o[1] = g;
    o[2] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;

    let r = clampbits((y2 + 1.370705 * cr) as i32, 12) as u16;
    let g = clampbits((y2 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
    let b = clampbits((y2 + 1.732446 * cb) as i32, 12) as u16;
    // invert the white balance
    o[3] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
    o[4] = g;
    o[5] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;
  }

  out
}

fn main() {
  let width = 5000;
  let height = 4000;

  let mut buffer: Vec<u8> = vec![0; width*height*3];
  // Make sure we don't get optimized out by writing some data into the buffer
  for (i, val) in buffer.chunks_mut(1).enumerate() {
    val[0] = i as u8;
  }
  
  for _ in 0..30 {
    decode_snef_compressed(&buffer, width, height, [2.0, 1.0, 1.5]);
  }
}
```

I'm running this on my machine across versions with `opt-level=3` and taking the median of 5 runs:

```sh
snef_compressed : 1.20.0  BASE 6.96
snef_compressed : 1.21.0  OK   6.95 (-0%)
snef_compressed : 1.22.1  OK   6.92 (-0%)
snef_compressed : 1.23.0  OK   6.93 (-0%)
snef_compressed : 1.24.1  OK   6.94 (-0%)
snef_compressed : 1.25.0  OK   7.14 (+2%)
snef_compressed : 1.26.2  OK   7.09 (+1%)
snef_compressed : 1.27.2  OK   7.07 (+1%)
snef_compressed : 1.28.0  OK   7.14 (+2%)
snef_compressed : 1.29.2  OK   7.01 (+0%)
snef_compressed : 1.30.0  FAIL 7.37 (+5%)
snef_compressed : beta    FAIL 7.32 (+5%)
snef_compressed : nightly FAIL 7.27 (+4%)
```

I'm not replicating the full 17% and there seems to have already been a smaller regression since 1.25. The difference in replication may be because the original code runs threaded with rayon across 4 threads but only 2 cores (hyperthreading) and that causes a higher penalty.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in mixed f32/i32/u16 code in 1.30 #55413

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance regression in mixed f32/i32/u16 code in 1.30 #55413

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions