Skip to content

Enable AVX-512 in Memmove unrolling #84348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 9, 2023
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 5, 2023

I hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for Buffer.Memmove unrolling, e.g.:

void Test(byte[] src, byte[] dst) => src.AsSpan(0, 200).CopyTo(dst);
; Method BufferMemmoveUnrolling:Test(ubyte[],ubyte[]):this
       sub      rsp, 40
+      vzeroupper 
       test     rdx, rdx
       je       SHORT G_M53589_IG07
       cmp      dword ptr [rdx+08H], 200
       jb       SHORT G_M53589_IG07
       add      rdx, 16
       test     r8, r8
       jne      SHORT G_M53589_IG04
-      xor      rax, rax
-      xor      ecx, ecx
+      xor      rcx, rcx
+      xor      eax, eax
       jmp      SHORT G_M53589_IG05
G_M53589_IG04:
       lea      rax, bword ptr [r8+10H]
       mov      ecx, dword ptr [r8+08H]
G_M53589_IG05:
-      cmp      eax, 200
+      cmp      ecx, 200
       jb       SHORT G_M53589_IG08
-      mov      r8d, 200	
-      call     [System.Buffer:Memmove(byref,byref,ulong)]
-      nop
+      vmovdqu32 zmm0, zmmword ptr [rdx]
+      vmovdqu32 zmm1, zmmword ptr [rdx+40H]
+      vmovdqu32 zmm2, zmmword ptr [rdx+80H]
+      vmovdqu  xmm3, xmmword ptr [rdx+B8H] ;; handle remainder via XMM
+      vmovdqu32 zmmword ptr [rax], zmm0
+      vmovdqu32 zmmword ptr [rax+40H], zmm1
+      vmovdqu32 zmmword ptr [rax+80H], zmm2
+      vmovdqu  xmmword ptr [rax+B8H], xmm3 ;; handle remainder via XMM
       add      rsp, 40
       ret      
G_M53589_IG07:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M53589_IG08:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     
-; Total bytes of code: 80
+; Total bytes of code: 127

so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call after 128b).

Benchmarks:

[Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);

image

I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?

Tests: https://github.com/dotnet/runtime/blob/main/src/tests/JIT/opt/Vectorization/BufferMemmove.cs

PS: seems like I need to align my test array to a specific boundary for less noise)

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 5, 2023
@ghost ghost assigned EgorBo Apr 5, 2023
@ghost
Copy link

ghost commented Apr 5, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

I hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for Buffer.Memmove unrolling, e.g.:

; Method BufferMemmoveUnrolling:Test(ubyte[],ubyte[]):this
       sub      rsp, 40
+      vzeroupper 
       test     rdx, rdx
       je       SHORT G_M53589_IG07
       cmp      dword ptr [rdx+08H], 200
       jb       SHORT G_M53589_IG07
       add      rdx, 16
       test     r8, r8
       jne      SHORT G_M53589_IG04
-      xor      rax, rax
-      xor      ecx, ecx
+      xor      rcx, rcx
+      xor      eax, eax
       jmp      SHORT G_M53589_IG05
G_M53589_IG04:
       lea      rax, bword ptr [r8+10H]
       mov      ecx, dword ptr [r8+08H]
G_M53589_IG05:
-      cmp      eax, 200
+      cmp      ecx, 200
       jb       SHORT G_M53589_IG08
-      mov      r8d, 200	
-      call     [System.Buffer:Memmove(byref,byref,ulong)]
-      nop
+      vmovdqu32 zmm0, zmmword ptr[rdx]
+      vmovdqu32 zmm1, zmmword ptr[rdx+40H]
+      vmovdqu32 zmm2, zmmword ptr[rdx+80H]
+      vmovdqu  xmm3, xmmword ptr [rdx+B8H] ;; handle remainder via XMM
+      vmovdqu32 zmmword ptr[rax], zmm0
+      vmovdqu32 zmmword ptr[rax+40H], zmm1
+      vmovdqu32 zmmword ptr[rax+80H], zmm2
+      vmovdqu  xmmword ptr [rax+B8H], xmm3 ;; handle remainder via XMM
       add      rsp, 40
       ret      
G_M53589_IG07:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M53589_IG08:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     
-; Total bytes of code: 80
+; Total bytes of code: 127

so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call).

Benchmarks:

[Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);

image

I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Apr 5, 2023

NOTE: this PR also improves non-avx512 case a bit, e.g.:

void Test() => Src.AsSpan(0, 42).CopyTo(Dst);

codegen diff: https://www.diffchecker.com/svPDo4c7/ (switches to xmm for the remainder - presumably, it reduces chances to hit cacheline/page boundary)

@EgorBo
Copy link
Member Author

EgorBo commented Apr 5, 2023

@BruceForstall @tannergooding PTAL

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would be good to have @tannergooding and possibly @anthonycanino @DeepakRajendrakumaran comment on the generated code.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 7, 2023

Diffs:
image

Regressions where we now expand 128..256 range (call memmove was smaller in terms of codegen size).
Improvements where we now need less instructions to unroll, e.g. just 2 instructions to handle len=64 (previously - 4 instructions).

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Apr 7, 2023
@EgorBo EgorBo merged commit 5147002 into dotnet:main Apr 9, 2023
@EgorBo EgorBo deleted the memmove-avx512 branch April 9, 2023 18:36
@EgorBo
Copy link
Member Author

EgorBo commented Apr 9, 2023

Failure is #84536

@BruceForstall BruceForstall mentioned this pull request Apr 10, 2023
56 tasks
@BruceForstall
Copy link
Contributor

@dotnet/avx512-contrib Apparently this PR regressed ASP.NET TechEmpower (https://aka.ms/aspnet/benchmarks), probably because the machines in the TE lab have old enough hardware that do throttling when implementing AVX-512

@DeepakRajendrakumaran
Copy link
Contributor

@EgorBo Can you please share the original code you used for benchmarking this on Ryzen 7950x? I'd like to run this locally if possible?

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

@EgorBo Can you please share the original code you used for benchmarking this on Ryzen 7950x? I'd like to run this locally if possible?

Um, do you mean my local benchmarks or TE?
Local benchmarks are:

class Prog
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Prog).Assembly).Run(args);
    }

    // 8-byte aligned, probably better to use NativeMemory.AllocAligned here
    byte[] Src = new byte[512];
    byte[] Dst = new byte[512];

    [Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
    [Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
    [Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
    [Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
    [Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
    [Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
    [Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
    [Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
    [Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
    [Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
    [Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
    [Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);
}

@BruceForstall
Copy link
Contributor

Fyi, machines used for the TE runs are described here:

https://github.com/aspnet/Benchmarks/blob/main/scenarios/README.md#profiles

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

Yeah, I filed a PR #84577 to sort of fix the issue and listed all details I found

@DeepakRajendrakumaran
Copy link
Contributor

 BenchmarkSwitcher

Thank you. This is what I am looking for for starters. Will probably try TE later.
Note - Have not used BenchmarkDotNet a lot. Like how easy it is to write a benchmark. The below seem to work

using System;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace MyBench
{
    public class MMOve
    {
        // 8-byte aligned, probably better to use NativeMemory.AllocAligned here
        byte[] Src = new byte[512];
        byte[] Dst = new byte[512];

        [Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
        [Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
        [Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
        [Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
        [Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
        [Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
        [Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
        [Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
        [Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
        [Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
        [Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
        [Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);
    }

    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<MMOve>();
            //BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
        }
    }
}

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

@DeepakRajendrakumaran
BenchmarkSwitcher.FromAssembly(typeof(MMove).Assembly).Run(args); should work in your case.
It allows to use a locally built runtime to run benchmarks via

dotnet run -c Release -- --coreRun C:\prj\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe
# or similar path on other OSes

The runtime has to be built with:

.\build.cmd Clr+Libs -c Release
.\src\tests\build.cmd Release generatelayoutonly 

or same commands in bash if for linux/mac

@DeepakRajendrakumaran
Copy link
Contributor

It works. I'm going to get numbers with TieredCompilation=0. I assume that's the only env variable you set when running experiments?

@EgorBo
Copy link
Member Author

EgorBo commented Apr 10, 2023

It works. I'm going to get numbers with TieredCompilation=0. I assume that's the only env variable you set when running experiments?

we don't set any variables at all usually, BDN is expected to tier up all hot methods

@EgorBo EgorBo restored the memmove-avx512 branch April 14, 2023 14:44
@ghost ghost locked as resolved and limited conversation to collaborators May 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants