Description
I tried my rawloader performance regression tests with 1.30:
http://chimper.org/rawloader-rustc-benchmarks/
http://chimper.org/rawloader-rustc-benchmarks/version-1.30.0.html
Besides the several formats that have been regressed since 1.25 because of #53340 there was a new regression of ~17% in a single file. That corresponded to a specific section of code that decodes YUV format to RGB. Here's a minimized version
fn clampbits(val: i32, bits: u32) -> i32 {
let temp = val >> bits;
if temp != 0 {
!temp >> (32-bits)
} else {
val
}
}
fn decode_snef_compressed(buf: &[u8], width: usize, height: usize, coeffs: [f32;3]) -> Vec<u16> {
let mut out: Vec<u16> = vec![0; width*height*3];
let inv_wb_r = (1024.0 / coeffs[0]) as i32;
let inv_wb_b = (1024.0 / coeffs[2]) as i32;
for (o, i) in out.chunks_mut(6).zip(buf.chunks(6)) {
let g1: u16 = i[0] as u16;
let g2: u16 = i[1] as u16;
let g3: u16 = i[2] as u16;
let g4: u16 = i[3] as u16;
let g5: u16 = i[4] as u16;
let g6: u16 = i[5] as u16;
let y1 = (g1 | ((g2 & 0x0f) << 8)) as f32;
let y2 = ((g2 >> 4) | (g3 << 4)) as f32;
let cb = (g4 | ((g5 & 0x0f) << 8)) as f32 - 2048.0;
let cr = ((g5 >> 4) | (g6 << 4)) as f32 - 2048.0;
let r = clampbits((y1 + 1.370705 * cr) as i32, 12) as u16;
let g = clampbits((y1 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
let b = clampbits((y1 + 1.732446 * cb) as i32, 12) as u16;
// invert the white balance
o[0] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
o[1] = g;
o[2] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;
let r = clampbits((y2 + 1.370705 * cr) as i32, 12) as u16;
let g = clampbits((y2 - 0.337633 * cb - 0.698001 * cr) as i32, 12) as u16;
let b = clampbits((y2 + 1.732446 * cb) as i32, 12) as u16;
// invert the white balance
o[3] = clampbits((inv_wb_r * r as i32 + (1<<9)) >> 10, 15) as u16;
o[4] = g;
o[5] = clampbits((inv_wb_b * b as i32 + (1<<9)) >> 10, 15) as u16;
}
out
}
fn main() {
let width = 5000;
let height = 4000;
let mut buffer: Vec<u8> = vec![0; width*height*3];
// Make sure we don't get optimized out by writing some data into the buffer
for (i, val) in buffer.chunks_mut(1).enumerate() {
val[0] = i as u8;
}
for _ in 0..30 {
decode_snef_compressed(&buffer, width, height, [2.0, 1.0, 1.5]);
}
}
I'm running this on my machine across versions with opt-level=3
and taking the median of 5 runs:
snef_compressed : 1.20.0 BASE 6.96
snef_compressed : 1.21.0 OK 6.95 (-0%)
snef_compressed : 1.22.1 OK 6.92 (-0%)
snef_compressed : 1.23.0 OK 6.93 (-0%)
snef_compressed : 1.24.1 OK 6.94 (-0%)
snef_compressed : 1.25.0 OK 7.14 (+2%)
snef_compressed : 1.26.2 OK 7.09 (+1%)
snef_compressed : 1.27.2 OK 7.07 (+1%)
snef_compressed : 1.28.0 OK 7.14 (+2%)
snef_compressed : 1.29.2 OK 7.01 (+0%)
snef_compressed : 1.30.0 FAIL 7.37 (+5%)
snef_compressed : beta FAIL 7.32 (+5%)
snef_compressed : nightly FAIL 7.27 (+4%)
I'm not replicating the full 17% and there seems to have already been a smaller regression since 1.25. The difference in replication may be because the original code runs threaded with rayon across 4 threads but only 2 cores (hyperthreading) and that causes a higher penalty.