Skip to content

Improve Math.BigMul on x64 by adding new internal Multiply hardware intrinsic to X86Base #115966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented May 24, 2025

The biggest improvements are signed long and for platforms without BMI2.
A nice side effect is that the ready2run code can now emit a simple mul instead of having to fallback to the 32bit code.

This pull request introduces a internal Multiply hardware intrinsics (NI_X86Base_Multiply and NI_X86Base_X64_Multiply) for x86 and x64 architectures in the JIT compiler and calls them from Math.BigMul

This improves the machine code for signed BigMul which should fix #75594 based on the API shape suggested in #58263
It can also help with implementing IntPtr.BigMul #114731

NOTES:

  • The code is heavily based on the DivRem code introduced in Implement DivRem intrinsic for X86 #66551 (I went through the current version of all the files touched and tried to add similar code for multiply).

  • I did not do Mono so I did try to use conditional compilation to exclude it from Mono (since it does not seem as straightforward and I do not know how to test the various combinations). Also it seems like it might already has special cases for bigmul

  • I have not tuched the jit compiler before, so while the code executes and seems to work fine i might have missed something.

  • Since it uses tuples it has some of the downsides of DivRem (especially on windows) where extra temp variables and stackspill, so there might be a few scenarios where performance is slighly worse or the same. (There was some discussion in Consume DivRem intrinsics from Math.DivRem #82194 )

  • There might be other better solutions including handing Math.BigMul as an instrinct in itself (or change it to a pair of MUL/HI_MUL with some extra logic) but that would probably to many new changes to the JTI for me to take on

Exampels of generated code code

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
static void TestBigMul2(ref ulong x, ref ulong y)
{
    x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
static void TestBigMul1(ref long x, ref long y)
{
    x = Math.BigMul(x, y, out y);
}

Produces the following (code from just before rebase)

; Method Program:<<Main>$>g__TestBigMul2|0_1(byref,byref) (FullOpts)
G_M19919_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M19919_IG02:  ;; offset=0x0000
       mov      rax, qword ptr [rcx]
       mov      bword ptr [rsp+0x10], rdx
       mul      rdx:rax, qword ptr [rdx]
       mov      r8, bword ptr [rsp+0x10]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx], rdx
						;; size=22 bbWeight=1 PerfScore 12.00

G_M19919_IG03:  ;; offset=0x0016
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 23


; Method Program:<<Main>$>g__TestBigMul1|0_2(byref,byref) (FullOpts)
G_M20175_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M20175_IG02:  ;; offset=0x0000
       mov      rax, qword ptr [rcx]
       mov      bword ptr [rsp+0x10], rdx
       imul     rdx:rax, qword ptr [rdx]
       mov      r8, bword ptr [rsp+0x10]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx], rdx
						;; size=22 bbWeight=1 PerfScore 12.00

G_M20175_IG03:  ;; offset=0x0016
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 23

Current code according to goodbolt goodbolt

Further code samples with array access
static long TestBigMulArr2(long[] x, ref long y)
{
    return Math.BigMul(y, x[1], out y);
}

				
static void TestBigMulArr12(long[] x, ref long y)
{
    x[1] = Math.BigMul(y, x[1], out y);
}
		
; Method Program:<<Main>$>g__TestBigMulArr2|0_5(long[],byref):long (FullOpts)
G_M60604_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M60604_IG02:  ;; offset=0x0004
       mov      bword ptr [rsp+0x38], rdx
       mov      rax, qword ptr [rdx]
       cmp      dword ptr [rcx+0x08], 1
       jbe      SHORT G_M60604_IG04
       imul     rdx:rax, qword ptr [rcx+0x18]
       mov      rcx, bword ptr [rsp+0x38]
       mov      qword ptr [rcx], rax
       mov      rax, rdx
						;; size=29 bbWeight=1 PerfScore 15.25

G_M60604_IG03:  ;; offset=0x0021
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
											
						; Method Program:<<Main>$>g__TestBigMulArr12|0_6(long[],byref) (FullOpts)
G_M53374_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M53374_IG02:  ;; offset=0x0004
       mov      bword ptr [rsp+0x38], rdx
       mov      rax, qword ptr [rdx]
       mov      r8d, dword ptr [rcx+0x08]
       cmp      r8d, 1
       jbe      SHORT G_M53374_IG04
       imul     rdx:rax, qword ptr [rcx+0x18]
       mov      r8, bword ptr [rsp+0x38]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx+0x18], rdx
						;; size=34 bbWeight=1 PerfScore 15.25

G_M53374_IG03:  ;; offset=0x0026
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

G_M53374_IG04:  ;; offset=0x002B
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code: 49

Benchmarks

The Full Benchmark code is found here

The benchmarks are based on a becnhmark suggested for MultplyNoFlags below does the following

        [Benchmark]
        public ulong BenchBigMulUnsigned()
        {
            ulong accLo = TestA;
            ulong accHi = TestB;
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            return accLo + accHi;
        }
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private unsafe void MathBigMulAcc(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
        {
            ulong lo;
            ulong hi = Math.BigMul(a, b, out lo);
            accHi += hi;
            accLo += lo;
        }

Note: The x64_bigmul toolchain variant is the code in #115182 which will be closed favor of this PR.

Results for Math.Bigmul
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.4061)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  Job-SZTYJW : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-LFLJNQ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-IIXCDT : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method Job Toolchain TestA TestB Mean Error StdDev Ratio
BenchBigMulUnsigned Job-SZTYJW \net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.277 ns 0.0168 ns 0.0157 ns 0.86
BenchBigMulUnsigned Job-LFLJNQ \net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.397 ns 0.0030 ns 0.0028 ns 0.94
BenchBigMulUnsigned Job-IIXCDT \net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.490 ns 0.0060 ns 0.0057 ns 1.00
BenchBigMulSigned Job-SZTYJW \net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.259 ns 0.0023 ns 0.0020 ns 0.43
BenchBigMulSigned Job-LFLJNQ \net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 3.382 ns 0.0037 ns 0.0033 ns 1.15
BenchBigMulSigned Job-IIXCDT \net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 2.943 ns 0.0040 ns 0.0033 ns 1.00
BenchMultiplyNoFlags3Ards Job-SZTYJW \net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 3.284 ns 0.0049 ns 0.0045 ns 1.00
BenchMultiplyNoFlags3Ards Job-LFLJNQ \net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 3.282 ns 0.0059 ns 0.0055 ns 1.00
BenchMultiplyNoFlags3Ards Job-IIXCDT \net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 3.292 ns 0.0141 ns 0.0118 ns 1.00
Hardware without BMI2, "~10 times faster"
BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  Job-JCYSGS : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-YXAHPB : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-SJSTKO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_EnableBMI2=0  
Method Job Toolchain TestA TestB Mean Error StdDev Ratio RatioSD
BenchBigMulUnsigned Job-JCYSGS \net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.283 ns 0.0104 ns 0.0092 ns 0.10 0.00
BenchBigMulUnsigned Job-YXAHPB \net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 12.231 ns 0.0136 ns 0.0121 ns 1.00 0.00
BenchBigMulUnsigned Job-SJSTKO \net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 12.256 ns 0.0133 ns 0.0118 ns 1.00 0.00
BenchBigMulSigned Job-JCYSGS \net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 1.275 ns 0.0051 ns 0.0048 ns 0.12 0.00
BenchBigMulSigned Job-YXAHPB \net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 10.753 ns 0.0607 ns 0.0538 ns 1.00 0.01
BenchBigMulSigned Job-SJSTKO \net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe 81985529216486895 16045690984833335023 10.783 ns 0.0743 ns 0.0620 ns 1.00 0.00

Additional benchmarks results

Additional resutls can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/7acd61943336356fa363763914a5b963de962065/ClrDecimal/Benchmarks/BenchmarkDotNet.Artifacts/results , I mostly checked that there was no significant regressions to decimal performance since Math.BigMul is has several usages there. There were a few minor improvements, mostly in the composite "InterestBenchmarks" which contains a mix of operations similar to interest calculation.

Copilot Summary

Summary

JIT Compiler Enhancements

  • Added support for Multiply intrinsics in the JIT compiler, including updates to ContainCheckHWIntrinsic, BuildHWIntrinsic, and impSpecialIntrinsic to handle the new instructions and their constraints (src/coreclr/jit/lowerxarch.cpp, src/coreclr/jit/lsraxarch.cpp, src/coreclr/jit/hwintrinsicxarch.cpp). [1] [2] [3]
  • Updated HWIntrinsicInfo and GenTreeHWIntrinsic to include the Multiply intrinsics and their associated properties (src/coreclr/jit/hwintrinsic.h, src/coreclr/jit/gentree.cpp). [1] [2]
  • Extended hwintrinsiclistxarch.h to define the Multiply intrinsics and their characteristics, such as instruction mapping and flags (src/coreclr/jit/hwintrinsiclistxarch.h). [1] [2]

Runtime Library Updates

  • Introduced X86Base.Multiply methods for both signed and unsigned multiplication in the runtime intrinsics API, providing platform-specific implementations or fallback behavior (src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs, src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs). [1] [2]
  • Updated the Math class to use the new Multiply intrinsics for optimized BigMul operations, improving performance on supported platforms (src/libraries/System.Private.CoreLib/src/System/Math.cs). [1] [2]

Code Cleanup

  • Removed outdated and unused code paths related to older multiplication implementations in the Math class (src/libraries/System.Private.CoreLib/src/System/Math.cs). [1] [2]

These changes collectively enhance the performance and capabilities of multiplication operations in .NET, leveraging hardware acceleration where available.Summary:

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 24, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 24, 2025
// mulEAX always have op1 in EAX
srcCount += BuildOperandUses(op1, SRBM_EAX);
srcCount += BuildOperandUses(op2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to add an if statement checking if (!op2->isContained()) similar to DivRem ?
If so what should it do different? Should it do BuildDelayFreeUses or similar

@@ -10642,6 +10642,8 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)

case NI_BMI2_MultiplyNoFlags:
case NI_BMI2_X64_MultiplyNoFlags:
case NI_X86Base_Multiply:
case NI_X86Base_X64_Multiply:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly guessing that the "containment" check and switching of operations should be the same, and the resulting code looks ok, but I am not sure

@@ -159,16 +159,6 @@ internal static void ThrowNegateTwosCompOverflow()
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe ulong BigMul(uint a, uint b)
{
#if false // TARGET_32BIT
// This generates slower code currently than the simple multiplication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing has changed here and this block/comment still holds, why remove it?

@@ -215,13 +202,21 @@ internal static ulong BigMul(uint a, ulong b, out ulong low)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe ulong BigMul(ulong a, ulong b, out ulong low)
{
#if MONO // Multiply is not yet implemented in MONO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition should be moved below (and reversed) so that the new intrinsic is used only if BMI2 isn't supported - MULX should be preferred over MUL since it has more flexible register allocation (currently wasted by the unnecessary memory spills though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal x64 codegen for signed Math.BigMul
2 participants