-
Notifications
You must be signed in to change notification settings - Fork 5k
Enable AVX-512 in Memmove unrolling #84348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsI hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for ; Method BufferMemmoveUnrolling:Test(ubyte[],ubyte[]):this
sub rsp, 40
+ vzeroupper
test rdx, rdx
je SHORT G_M53589_IG07
cmp dword ptr [rdx+08H], 200
jb SHORT G_M53589_IG07
add rdx, 16
test r8, r8
jne SHORT G_M53589_IG04
- xor rax, rax
- xor ecx, ecx
+ xor rcx, rcx
+ xor eax, eax
jmp SHORT G_M53589_IG05
G_M53589_IG04:
lea rax, bword ptr [r8+10H]
mov ecx, dword ptr [r8+08H]
G_M53589_IG05:
- cmp eax, 200
+ cmp ecx, 200
jb SHORT G_M53589_IG08
- mov r8d, 200
- call [System.Buffer:Memmove(byref,byref,ulong)]
- nop
+ vmovdqu32 zmm0, zmmword ptr[rdx]
+ vmovdqu32 zmm1, zmmword ptr[rdx+40H]
+ vmovdqu32 zmm2, zmmword ptr[rdx+80H]
+ vmovdqu xmm3, xmmword ptr [rdx+B8H] ;; handle remainder via XMM
+ vmovdqu32 zmmword ptr[rax], zmm0
+ vmovdqu32 zmmword ptr[rax+40H], zmm1
+ vmovdqu32 zmmword ptr[rax+80H], zmm2
+ vmovdqu xmmword ptr [rax+B8H], xmm3 ;; handle remainder via XMM
add rsp, 40
ret
G_M53589_IG07:
call [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
int3
G_M53589_IG08:
call [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
int3
-; Total bytes of code: 80
+; Total bytes of code: 127 so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call). Benchmarks: [Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst); I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?
|
NOTE: this PR also improves non-avx512 case a bit, e.g.: void Test() => Src.AsSpan(0, 42).CopyTo(Dst); codegen diff: https://www.diffchecker.com/svPDo4c7/ (switches to xmm for the remainder - presumably, it reduces chances to hit cacheline/page boundary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would be good to have @tannergooding and possibly @anthonycanino @DeepakRajendrakumaran comment on the generated code.
Add asserts
Failure is #84536 |
@dotnet/avx512-contrib Apparently this PR regressed ASP.NET TechEmpower (https://aka.ms/aspnet/benchmarks), probably because the machines in the TE lab have old enough hardware that do throttling when implementing AVX-512 |
@EgorBo Can you please share the original code you used for benchmarking this on Ryzen 7950x? I'd like to run this locally if possible? |
Um, do you mean my local benchmarks or TE? class Prog
{
static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(Prog).Assembly).Run(args);
}
// 8-byte aligned, probably better to use NativeMemory.AllocAligned here
byte[] Src = new byte[512];
byte[] Dst = new byte[512];
[Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);
} |
Fyi, machines used for the TE runs are described here: https://github.com/aspnet/Benchmarks/blob/main/scenarios/README.md#profiles |
Yeah, I filed a PR #84577 to sort of fix the issue and listed all details I found |
Thank you. This is what I am looking for for starters. Will probably try TE later.
|
@DeepakRajendrakumaran
The runtime has to be built with: .\build.cmd Clr+Libs -c Release
.\src\tests\build.cmd Release generatelayoutonly or same commands in bash if for linux/mac |
It works. I'm going to get numbers with |
we don't set any variables at all usually, BDN is expected to tier up all hot methods |
I hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for
Buffer.Memmove
unrolling, e.g.:so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call after 128b).
Benchmarks:
I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?
Tests: https://github.com/dotnet/runtime/blob/main/src/tests/JIT/opt/Vectorization/BufferMemmove.cs
PS: seems like I need to align my test array to a specific boundary for less noise)