-
Notifications
You must be signed in to change notification settings - Fork 5k
Improve Math.BigMul on x64 by adding new internal Multiply
hardware intrinsic to X86Base
#115966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- TODO: lsra and containment - containment in
// mulEAX always have op1 in EAX | ||
srcCount += BuildOperandUses(op1, SRBM_EAX); | ||
srcCount += BuildOperandUses(op2); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to add an if statement checking if (!op2->isContained())
similar to DivRem ?
If so what should it do different? Should it do BuildDelayFreeUses or similar
@@ -10642,6 +10642,8 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node) | |||
|
|||
case NI_BMI2_MultiplyNoFlags: | |||
case NI_BMI2_X64_MultiplyNoFlags: | |||
case NI_X86Base_Multiply: | |||
case NI_X86Base_X64_Multiply: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly guessing that the "containment" check and switching of operations should be the same, and the resulting code looks ok, but I am not sure
@@ -159,16 +159,6 @@ internal static void ThrowNegateTwosCompOverflow() | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe ulong BigMul(uint a, uint b) | |||
{ | |||
#if false // TARGET_32BIT | |||
// This generates slower code currently than the simple multiplication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing has changed here and this block/comment still holds, why remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it fine to just att back the link ?
I initially called the new Multiply and it was better than the BMI2 code, and seems to generate as good code as the built in JIT support for 32*32=>64bit multiply, but decided against adding in case it produces worse code than the current JIT optimisations.
From my understanding of the MultiplyNoFlags issue is that it will not be fixed for that overload, or at least not in the near future.
- Is there a specific reason why not to emit mulx GT_MUL_LONG instead ?
It would then apply to a lot more places, including those that used the standard cast+mul pattern (ulong)a*(ulong)b
instead because it is simipler or due to thee bad perf this had because of MultiplyNoFlags.
If I update Multiply with mulx support I can add that one back on here, or have a lock at handling it in GT_MUL_LONG isntead.
@@ -215,13 +202,21 @@ internal static ulong BigMul(uint a, ulong b, out ulong low) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe ulong BigMul(ulong a, ulong b, out ulong low) | |||
{ | |||
#if MONO // Multiply is not yet implemented in MONO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition should be moved below (and reversed) so that the new intrinsic is used only if BMI2 isn't supported - MULX should be preferred over MUL since it has more flexible register allocation (currently wasted by the unnecessary memory spills though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I can reorder the condition, but rather not switch to the slow BMI method to be called.
That would eliminate the performance improvements this PR was intended to fix (apart from 0.86 execeution time for benchmark) it makes a significant improvement of nearly 2* for other cases such as FastMod where previous workarounds for slow BigMul can now be removed and BigMul called directly (so #113352 can be "fixed" by just calling BigMul on both arm and x64).
- I forgot to mention that I want feedback about if it would be an acceptable solution to try to add mulx support to the new Multiply intrinsic when supported by hardware. (I have a separate local branch, but it is not as tested, and I want to merge this one first so that all tests are executed against the base x64 code first).
If the answer is no I can remove the following comment on the internal api and toss the bmi2 version (it performs on par with mul so have similar performance improvements as in this pr)
runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs
Line 77 in 4ed65b8
/// <para>In the future it might emit mulx on compatible hardware</para> |
Otherwise I am happy to push those changes as a separate PR if that is fine
|
||
// mulEAX always have op1 in EAX | ||
srcCount += BuildOperandUses(op1, SRBM_EAX); | ||
srcCount += BuildOperandUses(op2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are a number of APX related changes to this files since the code was written, I will merge/rebase and try to update it once I have feedback if this solution is in the right direction. (or if it would be better to handle BigMul itself directly, )
Will probaly replace this with :
BuildOperandUses(op2, BuildApxIncompatibleGPRMask(op2))
The biggest improvements are signed long and for platforms without BMI2.
A nice side effect is that the ready2run code can now emit a simple mul instead of having to fallback to the 32bit code.
This pull request introduces a internal
Multiply
hardware intrinsics (NI_X86Base_Multiply
andNI_X86Base_X64_Multiply
) for x86 and x64 architectures in the JIT compiler and calls them fromMath.BigMul
This improves the machine code for signed BigMul which should fix #75594 based on the API shape suggested in #58263
It can also help with implementing IntPtr.BigMul #114731
NOTES:
The code is heavily based on the DivRem code introduced in Implement DivRem intrinsic for X86 #66551 (I went through the current version of all the files touched and tried to add similar code for multiply).
I did not do Mono so I did try to use conditional compilation to exclude it from Mono (since it does not seem as straightforward and I do not know how to test the various combinations). Also it seems like it might already has special cases for bigmul
I have not tuched the jit compiler before, so while the code executes and seems to work fine i might have missed something.
Since it uses tuples it has some of the downsides of DivRem (especially on windows) where extra temp variables and stackspill, so there might be a few scenarios where performance is slighly worse or the same. (There was some discussion in Consume DivRem intrinsics from Math.DivRem #82194 )
There might be other better solutions including handing Math.BigMul as an instrinct in itself (or change it to a pair of MUL/HI_MUL with some extra logic) but that would probably to many new changes to the JTI for me to take on
Exampels of generated code code
Produces the following (code from just before rebase)
Current code according to goodbolt goodbolt
Further code samples with array access
Benchmarks
The Full Benchmark code is found here
The benchmarks are based on a becnhmark suggested for MultplyNoFlags below does the following
Note: The x64_bigmul toolchain variant is the code in #115182 which will be closed favor of this PR.
Results for Math.Bigmul
Hardware without BMI2, "~10 times faster"
Additional benchmarks results
Additional resutls can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/7acd61943336356fa363763914a5b963de962065/ClrDecimal/Benchmarks/BenchmarkDotNet.Artifacts/results , I mostly checked that there was no significant regressions to decimal performance since Math.BigMul is has several usages there. There were a few minor improvements, mostly in the composite "InterestBenchmarks" which contains a mix of operations similar to interest calculation.
Copilot Summary
Summary
JIT Compiler Enhancements
Multiply
intrinsics in the JIT compiler, including updates toContainCheckHWIntrinsic
,BuildHWIntrinsic
, andimpSpecialIntrinsic
to handle the new instructions and their constraints (src/coreclr/jit/lowerxarch.cpp
,src/coreclr/jit/lsraxarch.cpp
,src/coreclr/jit/hwintrinsicxarch.cpp
). [1] [2] [3]HWIntrinsicInfo
andGenTreeHWIntrinsic
to include theMultiply
intrinsics and their associated properties (src/coreclr/jit/hwintrinsic.h
,src/coreclr/jit/gentree.cpp
). [1] [2]hwintrinsiclistxarch.h
to define theMultiply
intrinsics and their characteristics, such as instruction mapping and flags (src/coreclr/jit/hwintrinsiclistxarch.h
). [1] [2]Runtime Library Updates
X86Base.Multiply
methods for both signed and unsigned multiplication in the runtime intrinsics API, providing platform-specific implementations or fallback behavior (src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs
,src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs
). [1] [2]Math
class to use the newMultiply
intrinsics for optimizedBigMul
operations, improving performance on supported platforms (src/libraries/System.Private.CoreLib/src/System/Math.cs
). [1] [2]Code Cleanup
Math
class (src/libraries/System.Private.CoreLib/src/System/Math.cs
). [1] [2]These changes collectively enhance the performance and capabilities of multiplication operations in .NET, leveraging hardware acceleration where available.Summary: