-
Notifications
You must be signed in to change notification settings - Fork 5k
Improve Math.BigMul on x64 by adding new internal Multiply
hardware intrinsic to X86Base
#115966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- TODO: lsra and containment - containment in
// mulEAX always have op1 in EAX | ||
srcCount += BuildOperandUses(op1, SRBM_EAX); | ||
srcCount += BuildOperandUses(op2); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to add an if statement checking if (!op2->isContained())
similar to DivRem ?
If so what should it do different? Should it do BuildDelayFreeUses or similar
@@ -10642,6 +10642,8 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node) | |||
|
|||
case NI_BMI2_MultiplyNoFlags: | |||
case NI_BMI2_X64_MultiplyNoFlags: | |||
case NI_X86Base_Multiply: | |||
case NI_X86Base_X64_Multiply: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly guessing that the "containment" check and switching of operations should be the same, and the resulting code looks ok, but I am not sure
@@ -159,16 +159,6 @@ internal static void ThrowNegateTwosCompOverflow() | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe ulong BigMul(uint a, uint b) | |||
{ | |||
#if false // TARGET_32BIT | |||
// This generates slower code currently than the simple multiplication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing has changed here and this block/comment still holds, why remove it?
@@ -215,13 +202,21 @@ internal static ulong BigMul(uint a, ulong b, out ulong low) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe ulong BigMul(ulong a, ulong b, out ulong low) | |||
{ | |||
#if MONO // Multiply is not yet implemented in MONO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition should be moved below (and reversed) so that the new intrinsic is used only if BMI2 isn't supported - MULX should be preferred over MUL since it has more flexible register allocation (currently wasted by the unnecessary memory spills though).
The biggest improvements are signed long and for platforms without BMI2.
A nice side effect is that the ready2run code can now emit a simple mul instead of having to fallback to the 32bit code.
This pull request introduces a internal
Multiply
hardware intrinsics (NI_X86Base_Multiply
andNI_X86Base_X64_Multiply
) for x86 and x64 architectures in the JIT compiler and calls them fromMath.BigMul
This improves the machine code for signed BigMul which should fix #75594 based on the API shape suggested in #58263
It can also help with implementing IntPtr.BigMul #114731
NOTES:
The code is heavily based on the DivRem code introduced in Implement DivRem intrinsic for X86 #66551 (I went through the current version of all the files touched and tried to add similar code for multiply).
I did not do Mono so I did try to use conditional compilation to exclude it from Mono (since it does not seem as straightforward and I do not know how to test the various combinations). Also it seems like it might already has special cases for bigmul
I have not tuched the jit compiler before, so while the code executes and seems to work fine i might have missed something.
Since it uses tuples it has some of the downsides of DivRem (especially on windows) where extra temp variables and stackspill, so there might be a few scenarios where performance is slighly worse or the same. (There was some discussion in Consume DivRem intrinsics from Math.DivRem #82194 )
There might be other better solutions including handing Math.BigMul as an instrinct in itself (or change it to a pair of MUL/HI_MUL with some extra logic) but that would probably to many new changes to the JTI for me to take on
Exampels of generated code code
Produces the following (code from just before rebase)
Current code according to goodbolt goodbolt
Further code samples with array access
Benchmarks
The Full Benchmark code is found here
The benchmarks are based on a becnhmark suggested for MultplyNoFlags below does the following
Note: The x64_bigmul toolchain variant is the code in #115182 which will be closed favor of this PR.
Results for Math.Bigmul
Hardware without BMI2, "~10 times faster"
Additional benchmarks results
Additional resutls can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/7acd61943336356fa363763914a5b963de962065/ClrDecimal/Benchmarks/BenchmarkDotNet.Artifacts/results , I mostly checked that there was no significant regressions to decimal performance since Math.BigMul is has several usages there. There were a few minor improvements, mostly in the composite "InterestBenchmarks" which contains a mix of operations similar to interest calculation.
Copilot Summary
Summary
JIT Compiler Enhancements
Multiply
intrinsics in the JIT compiler, including updates toContainCheckHWIntrinsic
,BuildHWIntrinsic
, andimpSpecialIntrinsic
to handle the new instructions and their constraints (src/coreclr/jit/lowerxarch.cpp
,src/coreclr/jit/lsraxarch.cpp
,src/coreclr/jit/hwintrinsicxarch.cpp
). [1] [2] [3]HWIntrinsicInfo
andGenTreeHWIntrinsic
to include theMultiply
intrinsics and their associated properties (src/coreclr/jit/hwintrinsic.h
,src/coreclr/jit/gentree.cpp
). [1] [2]hwintrinsiclistxarch.h
to define theMultiply
intrinsics and their characteristics, such as instruction mapping and flags (src/coreclr/jit/hwintrinsiclistxarch.h
). [1] [2]Runtime Library Updates
X86Base.Multiply
methods for both signed and unsigned multiplication in the runtime intrinsics API, providing platform-specific implementations or fallback behavior (src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs
,src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs
). [1] [2]Math
class to use the newMultiply
intrinsics for optimizedBigMul
operations, improving performance on supported platforms (src/libraries/System.Private.CoreLib/src/System/Math.cs
). [1] [2]Code Cleanup
Math
class (src/libraries/System.Private.CoreLib/src/System/Math.cs
). [1] [2]These changes collectively enhance the performance and capabilities of multiplication operations in .NET, leveraging hardware acceleration where available.Summary: