Skip to content

[Perf] Linux/x64: Regressions in SIMD.ConsoleMandel and System.Numerics.Tests.Perf_BigInteger #105329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
performanceautofiler bot opened this issue Jul 23, 2024 · 8 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@performanceautofiler
Copy link

Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue ViperUbuntu
Baseline 19f03850cafa68cf396ecadf86e19df714b0a280
Compare 223249fa87a5f84cc67e83699f64ca80180a1862
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in SIMD.ConsoleMandel

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
649.42 ms 693.68 ms 1.07 0.00 True
656.66 ms 692.76 ms 1.05 0.00 True

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'SIMD.ConsoleMandel*'

SIMD.ConsoleMandel.ScalarDoubleSinglethreadADT

ETL Files

Histogram

JIT Disasms

SIMD.ConsoleMandel.ScalarFloatSinglethreadADT

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue ViperUbuntu
Baseline 19f03850cafa68cf396ecadf86e19df714b0a280
Compare 223249fa87a5f84cc67e83699f64ca80180a1862
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
16.57 μs 18.02 μs 1.09 0.08 False

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.Tests.Perf_PriorityQueue&lt;Int32, Int32&gt;*'

System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>.HeapSort(Size: 1000)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue ViperUbuntu
Baseline 19f03850cafa68cf396ecadf86e19df714b0a280
Compare 223249fa87a5f84cc67e83699f64ca80180a1862
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Numerics.Tests.Perf_BigInteger

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
4.37 μs 4.68 μs 1.07 0.00 True

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_BigInteger*'

System.Numerics.Tests.Perf_BigInteger.GreatestCommonDivisor(arguments: 1024,1024 bits)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-x64 os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels Jul 23, 2024
@LoopedBard3 LoopedBard3 transferred this issue from dotnet/perf-autofiling-issues Jul 23, 2024
@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 23, 2024
@LoopedBard3
Copy link
Member

LoopedBard3 commented Jul 23, 2024

Potentially caused by: #104752, improvements listed in PR. @AndyAyersMS

@LoopedBard3 LoopedBard3 changed the title [Perf] Linux/x64: 4 Regressions on 7/17/2024 10:10:17 PM [Perf] Linux/x64: Regressions in SIMD.ConsoleMandel and System.Numerics.Tests.Perf_BigInteger Jul 23, 2024
@LoopedBard3
Copy link
Member

LoopedBard3 commented Jul 23, 2024

Related Regressions:
Windows/x64: dotnet/perf-autofiling-issues#38719
Linux/x64: dotnet/perf-autofiling-issues#38695

@AndyAyersMS AndyAyersMS self-assigned this Jul 23, 2024
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jul 23, 2024
@AndyAyersMS AndyAyersMS added this to the 9.0.0 milestone Jul 23, 2024
@AndyAyersMS
Copy link
Member

Will see if this is fixable in 9.0...

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS AndyAyersMS added Priority:2 Work that is important, but not critical for the release tenet-performance-benchmarks Issue from performance benchmark and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 25, 2024
@AndyAyersMS
Copy link
Member

These all seem to be viper (zen4) specific, and linux specific as well. Since other arch/os combinations are ok, I'm going to defer this one.

@AndyAyersMS AndyAyersMS modified the milestones: 9.0.0, 10.0.0 Aug 7, 2024
@AndyAyersMS
Copy link
Member

Related regressions are resolved already. But three of the regressions listed here persist:

Image

Image

Image

and one is largely resolved (very modal benchmark)

Image

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 14, 2025

As noted earlier, the GreatestCommonDivisor regression is just on Zen4 Linux (light blue below)

Image

But the ConsoleMandel is more widespread:

Image

@AndyAyersMS
Copy link
Member

On intel linux, for .net 8, the inner loop is

G_M61269_IG05:        ; offs=0x000084, size=0x0033, bbWeight=1109935.07, PerfScore 36905341.03, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, BB05 [0003], byref, isz

IN001b: 000084 vmulsd   xmm9, xmm6, xmm6
IN001c: 000088 vmulsd   xmm10, xmm7, xmm7
IN001d: 00008C vsubsd   xmm9, xmm9, xmm10
IN001e: 000091 vaddsd   xmm6, xmm6, xmm6
IN001f: 000095 vmulsd   xmm7, xmm6, xmm7
IN0020: 000099 vaddsd   xmm6, xmm9, xmm3
IN0021: 00009D vaddsd   xmm7, xmm7, xmm5
IN0022: 0000A1 inc      ecx
IN0023: 0000A3 vmulsd   xmm9, xmm6, xmm6
IN0024: 0000A7 vmulsd   xmm10, xmm7, xmm7
IN0025: 0000AB vaddsd   xmm9, xmm9, xmm10
IN0026: 0000B0 vucomisd xmm8, xmm9
IN0027: 0000B5 jbe      SHORT G_M61269_IG07

G_M61269_IG06:        ; offs=0x0000B7, size=0x0008, bbWeight=1106928.83, PerfScore 1383661.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, loop=IG05, BB06 [0004], byref, isz

IN0028: 0000B7 cmp      ecx, 0x3E8
IN0029: 0000BD jl       SHORT G_M61269_IG05

whereas later on it becomes

_M61269_IG10:        ; offs=0x0000E0, size=0x003C, bbWeight=60645.02, PerfScore 2137736.83, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, BB04 [0003], byref, isz

IN0034: 0000E0 vmulsd   xmm7, xmm3, xmm3
IN0035: 0000E4 vmulsd   xmm8, xmm5, xmm5
IN0036: 0000E8 vsubsd   xmm7, xmm7, xmm8
IN0037: 0000ED vaddsd   xmm3, xmm3, xmm3
IN0038: 0000F1 vmulsd   xmm5, xmm3, xmm5
IN0039: 0000F5 vmovsd   qword ptr [V14 rbp-0x48], xmm1
IN003a: 0000FA vaddsd   xmm3, xmm7, xmm1
IN003b: 0000FE vmovsd   qword ptr [V12 rbp-0x40], xmm4
IN003c: 000103 vaddsd   xmm5, xmm5, xmm4
IN003d: 000107 inc      ecx
IN003e: 000109 vmulsd   xmm7, xmm3, xmm3
IN003f: 00010D vmulsd   xmm8, xmm5, xmm5
IN0040: 000111 vaddsd   xmm7, xmm7, xmm8
IN0041: 000116 vucomisd xmm6, xmm7
IN0042: 00011A jbe      SHORT G_M61269_IG08

G_M61269_IG11:        ; offs=0x00011C, size=0x000C, bbWeight=60240.69, PerfScore 75300.87, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, BB05 [0004], byref

IN0043: 00011C cmp      ecx, 0x3E8
IN0044: 000122 jge      G_M61269_IG08

G_M61269_IG12:        ; offs=0x000128, size=0x000C, bbWeight=59932.78, PerfScore 479462.26, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, loop=IG10, BB16 [0033], byref, isz

IN0045: 000128 vmovsd   xmm1, qword ptr [V14 rbp-0x48]
IN0046: 00012D vmovsd   xmm4, qword ptr [V12 rbp-0x40]
IN0047: 000132 jmp      SHORT G_M61269_IG10

So there are two xmm spill/reloads in the inner loop now.

Root cause for this is slightly more aggressive copy prop, likely the result of phi refinement

VN based copy assertion for [000258] V36 $241 by [000303] V14 $241.
N001 (  1,  2) [000258] -----+-----                         *  LCL_VAR   double V36 tmp17        u:2 $241
copy propagated to:
N001 (  1,  2) [000258] -----+-----                         *  LCL_VAR   double V14 loc8         u:3 $241

and this likely creates a conflict that LSRA is unable to resolve without a spill.

Not clear there is any good fix here. Seems like the previous behavior where there were temps in the loops gave LSRA natural split points that are now gone.

The regressions are all fairly small so I am just going to close this.

@github-actions github-actions bot locked and limited conversation to collaborators May 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

3 participants