Improve Math.BigMul on x64 by adding new internal `Multiply` hardware intrinsic to X86Base #115966

Daniel-Svensson · 2025-05-24T20:49:34Z

The biggest improvements are signed long and for platforms without BMI2.
A nice side effect is that the ready2run code can now emit a simple mul instead of having to fallback to the 32bit code.

This pull request introduces a internal Multiply hardware intrinsics (NI_X86Base_Multiply and NI_X86Base_X64_Multiply) for x86 and x64 architectures in the JIT compiler and calls them from Math.BigMul

This improves the machine code for signed BigMul which should fix #75594 based on the API shape suggested in #58263
It can also help with implementing IntPtr.BigMul #114731

NOTES:

The code is heavily based on the DivRem code introduced in Implement DivRem intrinsic for X86 #66551 (I went through the current version of all the files touched and tried to add similar code for multiply).
I did not do Mono so I did try to use conditional compilation to exclude it from Mono (since it does not seem as straightforward and I do not know how to test the various combinations). Also it seems like it might already has special cases for bigmul
I have not tuched the jit compiler before, so while the code executes and seems to work fine i might have missed something.
Since it uses tuples it has some of the downsides of DivRem (especially on windows) where extra temp variables and stackspill, so there might be a few scenarios where performance is slighly worse or the same. (There was some discussion in Consume DivRem intrinsics from Math.DivRem #82194 )
There might be other better solutions including handing Math.BigMul as an instrinct in itself (or change it to a pair of MUL/HI_MUL with some extra logic) but that would probably to many new changes to the JTI for me to take on

Exampels of generated code code

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
static void TestBigMul2(ref ulong x, ref ulong y)
{
    x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
static void TestBigMul1(ref long x, ref long y)
{
    x = Math.BigMul(x, y, out y);
}

Produces the following (code from just before rebase)

; Method Program:<<Main>$>g__TestBigMul2|0_1(byref,byref) (FullOpts)
G_M19919_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M19919_IG02:  ;; offset=0x0000
       mov      rax, qword ptr [rcx]
       mov      bword ptr [rsp+0x10], rdx
       mul      rdx:rax, qword ptr [rdx]
       mov      r8, bword ptr [rsp+0x10]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx], rdx
						;; size=22 bbWeight=1 PerfScore 12.00

G_M19919_IG03:  ;; offset=0x0016
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 23


; Method Program:<<Main>$>g__TestBigMul1|0_2(byref,byref) (FullOpts)
G_M20175_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M20175_IG02:  ;; offset=0x0000
       mov      rax, qword ptr [rcx]
       mov      bword ptr [rsp+0x10], rdx
       imul     rdx:rax, qword ptr [rdx]
       mov      r8, bword ptr [rsp+0x10]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx], rdx
						;; size=22 bbWeight=1 PerfScore 12.00

G_M20175_IG03:  ;; offset=0x0016
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 23

Current code according to goodbolt goodbolt

Further code samples with array access

static long TestBigMulArr2(long[] x, ref long y)
{
    return Math.BigMul(y, x[1], out y);
}

				
static void TestBigMulArr12(long[] x, ref long y)
{
    x[1] = Math.BigMul(y, x[1], out y);
}

; Method Program:<<Main>$>g__TestBigMulArr2|0_5(long[],byref):long (FullOpts)
G_M60604_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M60604_IG02:  ;; offset=0x0004
       mov      bword ptr [rsp+0x38], rdx
       mov      rax, qword ptr [rdx]
       cmp      dword ptr [rcx+0x08], 1
       jbe      SHORT G_M60604_IG04
       imul     rdx:rax, qword ptr [rcx+0x18]
       mov      rcx, bword ptr [rsp+0x38]
       mov      qword ptr [rcx], rax
       mov      rax, rdx
						;; size=29 bbWeight=1 PerfScore 15.25

G_M60604_IG03:  ;; offset=0x0021
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
											
						; Method Program:<<Main>$>g__TestBigMulArr12|0_6(long[],byref) (FullOpts)
G_M53374_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M53374_IG02:  ;; offset=0x0004
       mov      bword ptr [rsp+0x38], rdx
       mov      rax, qword ptr [rdx]
       mov      r8d, dword ptr [rcx+0x08]
       cmp      r8d, 1
       jbe      SHORT G_M53374_IG04
       imul     rdx:rax, qword ptr [rcx+0x18]
       mov      r8, bword ptr [rsp+0x38]
       mov      qword ptr [r8], rax
       mov      qword ptr [rcx+0x18], rdx
						;; size=34 bbWeight=1 PerfScore 15.25

G_M53374_IG03:  ;; offset=0x0026
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

G_M53374_IG04:  ;; offset=0x002B
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code: 49

Benchmarks

The Full Benchmark code is found here

The benchmarks are based on a becnhmark suggested for MultplyNoFlags below does the following

        [Benchmark]
        public ulong BenchBigMulUnsigned()
        {
            ulong accLo = TestA;
            ulong accHi = TestB;
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            MathBigMulAcc(accLo, accHi, ref accHi, ref accLo);
            return accLo + accHi;
        }
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private unsafe void MathBigMulAcc(ulong a, ulong b, ref ulong accHi, ref ulong accLo)
        {
            ulong lo;
            ulong hi = Math.BigMul(a, b, out lo);
            accHi += hi;
            accLo += lo;
        }

Note: The x64_bigmul toolchain variant is the code in #115182 which will be closed favor of this PR.

Results for Math.Bigmul

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.4061)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  Job-SZTYJW : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-LFLJNQ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-IIXCDT : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Job	Toolchain	TestA	TestB	Mean	Error	StdDev	Ratio
BenchBigMulUnsigned	Job-SZTYJW	\net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.277 ns	0.0168 ns	0.0157 ns	0.86
BenchBigMulUnsigned	Job-LFLJNQ	\net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.397 ns	0.0030 ns	0.0028 ns	0.94
BenchBigMulUnsigned	Job-IIXCDT	\net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.490 ns	0.0060 ns	0.0057 ns	1.00

BenchBigMulSigned	Job-SZTYJW	\net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.259 ns	0.0023 ns	0.0020 ns	0.43
BenchBigMulSigned	Job-LFLJNQ	\net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	3.382 ns	0.0037 ns	0.0033 ns	1.15
BenchBigMulSigned	Job-IIXCDT	\net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	2.943 ns	0.0040 ns	0.0033 ns	1.00

BenchMultiplyNoFlags3Ards	Job-SZTYJW	\net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	3.284 ns	0.0049 ns	0.0045 ns	1.00
BenchMultiplyNoFlags3Ards	Job-LFLJNQ	\net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	3.282 ns	0.0059 ns	0.0055 ns	1.00
BenchMultiplyNoFlags3Ards	Job-IIXCDT	\net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	3.292 ns	0.0141 ns	0.0118 ns	1.00

Hardware without BMI2, "~10 times faster"

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  Job-JCYSGS : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-YXAHPB : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-SJSTKO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_EnableBMI2=0

Method	Job	Toolchain	TestA	TestB	Mean	Error	StdDev	Ratio	RatioSD
BenchBigMulUnsigned	Job-JCYSGS	\net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.283 ns	0.0104 ns	0.0092 ns	0.10	0.00
BenchBigMulUnsigned	Job-YXAHPB	\net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	12.231 ns	0.0136 ns	0.0121 ns	1.00	0.00
BenchBigMulUnsigned	Job-SJSTKO	\net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	12.256 ns	0.0133 ns	0.0118 ns	1.00	0.00

BenchBigMulSigned	Job-JCYSGS	\net10.0-windows-Release-x64_MathInstrinct\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	1.275 ns	0.0051 ns	0.0048 ns	0.12	0.00
BenchBigMulSigned	Job-YXAHPB	\net10.0-windows-Release-x64_bigmul\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	10.753 ns	0.0607 ns	0.0538 ns	1.00	0.01
BenchBigMulSigned	Job-SJSTKO	\net10.0-windows-Release-x64_main\shared\Microsoft.NETCore.App\10.0.0\corerun.exe	81985529216486895	16045690984833335023	10.783 ns	0.0743 ns	0.0620 ns	1.00	0.00

Additional benchmarks results

Additional resutls can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/7acd61943336356fa363763914a5b963de962065/ClrDecimal/Benchmarks/BenchmarkDotNet.Artifacts/results , I mostly checked that there was no significant regressions to decimal performance since Math.BigMul is has several usages there. There were a few minor improvements, mostly in the composite "InterestBenchmarks" which contains a mix of operations similar to interest calculation.

Copilot Summary

Summary

JIT Compiler Enhancements

Added support for Multiply intrinsics in the JIT compiler, including updates to ContainCheckHWIntrinsic, BuildHWIntrinsic, and impSpecialIntrinsic to handle the new instructions and their constraints (src/coreclr/jit/lowerxarch.cpp, src/coreclr/jit/lsraxarch.cpp, src/coreclr/jit/hwintrinsicxarch.cpp). [1] [2] [3]
Updated HWIntrinsicInfo and GenTreeHWIntrinsic to include the Multiply intrinsics and their associated properties (src/coreclr/jit/hwintrinsic.h, src/coreclr/jit/gentree.cpp). [1] [2]
Extended hwintrinsiclistxarch.h to define the Multiply intrinsics and their characteristics, such as instruction mapping and flags (src/coreclr/jit/hwintrinsiclistxarch.h). [1] [2]

Runtime Library Updates

Introduced X86Base.Multiply methods for both signed and unsigned multiplication in the runtime intrinsics API, providing platform-specific implementations or fallback behavior (src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs, src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs). [1] [2]
Updated the Math class to use the new Multiply intrinsics for optimized BigMul operations, improving performance on supported platforms (src/libraries/System.Private.CoreLib/src/System/Math.cs). [1] [2]

Code Cleanup

Removed outdated and unused code paths related to older multiplication implementations in the Math class (src/libraries/System.Private.CoreLib/src/System/Math.cs). [1] [2]

These changes collectively enhance the performance and capabilities of multiplication operations in .NET, leveraging hardware acceleration where available.Summary:

- TODO: lsra and containment - containment in

Daniel-Svensson · 2025-05-24T20:55:44Z

src/coreclr/jit/lsraxarch.cpp

+                // mulEAX always have op1 in EAX
+                srcCount += BuildOperandUses(op1, SRBM_EAX);
+                srcCount += BuildOperandUses(op2);
+


Does it need to add an if statement checking if (!op2->isContained()) similar to DivRem ?
If so what should it do different? Should it do BuildDelayFreeUses or similar

Daniel-Svensson · 2025-05-24T21:00:12Z

src/coreclr/jit/lowerxarch.cpp

@@ -10642,6 +10642,8 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)

                            case NI_BMI2_MultiplyNoFlags:
                            case NI_BMI2_X64_MultiplyNoFlags:
+                            case NI_X86Base_Multiply:
+                            case NI_X86Base_X64_Multiply:


I am mostly guessing that the "containment" check and switching of operations should be the same, and the resulting code looks ok, but I am not sure

pentp · 2025-05-25T12:05:13Z

src/libraries/System.Private.CoreLib/src/System/Math.cs

@@ -159,16 +159,6 @@ internal static void ThrowNegateTwosCompOverflow()
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static unsafe ulong BigMul(uint a, uint b)
        {
-#if false // TARGET_32BIT
-            // This generates slower code currently than the simple multiplication


Nothing has changed here and this block/comment still holds, why remove it?

Is it fine to just att back the link ?

I initially called the new Multiply and it was better than the BMI2 code, and seems to generate as good code as the built in JIT support for 32*32=>64bit multiply, but decided against adding in case it produces worse code than the current JIT optimisations.

From my understanding of the MultiplyNoFlags issue is that it will not be fixed for that overload, or at least not in the near future.

Is there a specific reason why not to emit mulx GT_MUL_LONG instead ?

It would then apply to a lot more places, including those that used the standard cast+mul pattern (ulong)a*(ulong)b instead because it is simipler or due to thee bad perf this had because of MultiplyNoFlags.

If I update Multiply with mulx support I can add that one back on here, or have a lock at handling it in GT_MUL_LONG isntead.

pentp · 2025-05-25T12:10:25Z

src/libraries/System.Private.CoreLib/src/System/Math.cs

@@ -215,13 +202,21 @@ internal static ulong BigMul(uint a, ulong b, out ulong low)
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static unsafe ulong BigMul(ulong a, ulong b, out ulong low)
        {
+#if MONO // Multiply is not yet implemented in MONO


This condition should be moved below (and reversed) so that the new intrinsic is used only if BMI2 isn't supported - MULX should be preferred over MUL since it has more flexible register allocation (currently wasted by the unnecessary memory spills though).

I can reorder the condition, but rather not switch to the slow BMI method to be called.

That would eliminate the performance improvements this PR was intended to fix (apart from 0.86 execeution time for benchmark) it makes a significant improvement of nearly 2* for other cases such as FastMod where previous workarounds for slow BigMul can now be removed and BigMul called directly (so #113352 can be "fixed" by just calling BigMul on both arm and x64).

I forgot to mention that I want feedback about if it would be an acceptable solution to try to add mulx support to the new Multiply intrinsic when supported by hardware. (I have a separate local branch, but it is not as tested, and I want to merge this one first so that all tests are executed against the base x64 code first).

If the answer is no I can remove the following comment on the internal api and toss the bmi2 version (it performs on par with mul so have similar performance improvements as in this pr)

runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs

Line 77 in 4ed65b8

/// <para>In the future it might emit mulx on compatible hardware</para>

Otherwise I am happy to push those changes as a separate PR if that is fine

Daniel-Svensson · 2025-05-29T11:07:42Z

src/coreclr/jit/lsraxarch.cpp

+
+                // mulEAX always have op1 in EAX
+                srcCount += BuildOperandUses(op1, SRBM_EAX);
+                srcCount += BuildOperandUses(op2);


I see there are a number of APX related changes to this files since the code was written, I will merge/rebase and try to update it once I have feedback if this solution is in the right direction. (or if it would be better to handle BigMul itself directly, )

Will probaly replace this with :
BuildOperandUses(op2, BuildApxIncompatibleGPRMask(op2))

Daniel-Svensson added 10 commits April 30, 2025 09:00

Improve BigMul to fix VarDecCmpSub regression dotnet#11243

b41a669

Fix typo

d5acd40

Add multiply based on DivRem

794d135

- TODO: lsra and containment - containment in

keep old behaviour on mono

b93aac3

Change Multiply to internal

046723b

Merge remote-tracking branch 'upstream/main' into x86_multiply

e70da88

remove whitespace and remove some comments

1f8080d

apply format.patch

81198ac

update PlatformNotSupported file

a5a5334

update #if

0ce33ea

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 24, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 24, 2025

Daniel-Svensson commented May 24, 2025

View reviewed changes

#if MONO

e1c1ade

This was referenced May 24, 2025

Improve Math.BigMul to fix Decimal compare perf regression #115182

Closed

[Perf] Windows/x64: Decimal Regressions on 2/3/2025 6:32:46 PM +00:00 #112432

Open

add back unsafe for mono

4ed65b8

This was referenced May 25, 2025

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

System.Net.Quic.Tests.MsQuicTests.WriteTests failed with "System.Net.Quic.QuicException : The connection timed out from inactivity." #105177

Open

Daniel-Svensson marked this pull request as ready for review May 25, 2025 06:59

pentp reviewed May 25, 2025

View reviewed changes

Daniel-Svensson commented May 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Math.BigMul on x64 by adding new internal `Multiply` hardware intrinsic to X86Base #115966

Improve Math.BigMul on x64 by adding new internal `Multiply` hardware intrinsic to X86Base #115966

Daniel-Svensson commented May 24, 2025 •

edited

Loading

Uh oh!

Daniel-Svensson May 24, 2025

Uh oh!

Daniel-Svensson May 24, 2025

Uh oh!

pentp May 25, 2025

Uh oh!

Daniel-Svensson May 29, 2025

Uh oh!

pentp May 25, 2025

Uh oh!

Daniel-Svensson May 29, 2025

Uh oh!

Daniel-Svensson May 29, 2025

Uh oh!

Uh oh!

Improve Math.BigMul on x64 by adding new internal Multiply hardware intrinsic to X86Base #115966

Are you sure you want to change the base?

Improve Math.BigMul on x64 by adding new internal Multiply hardware intrinsic to X86Base #115966

Conversation

Daniel-Svensson commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Exampels of generated code code

Benchmarks

Additional benchmarks results

Copilot Summary

JIT Compiler Enhancements

Runtime Library Updates

Code Cleanup

Uh oh!

Daniel-Svensson May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Daniel-Svensson May 24, 2025

Choose a reason for hiding this comment

Uh oh!

pentp May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Daniel-Svensson May 29, 2025

Choose a reason for hiding this comment

Uh oh!

pentp May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Daniel-Svensson May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Daniel-Svensson May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Improve Math.BigMul on x64 by adding new internal `Multiply` hardware intrinsic to X86Base #115966

Improve Math.BigMul on x64 by adding new internal `Multiply` hardware intrinsic to X86Base #115966

Daniel-Svensson commented May 24, 2025 •

edited

Loading