Enable AVX-512 in Memmove unrolling #84348

EgorBo · 2023-04-05T12:44:32Z

I hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for Buffer.Memmove unrolling, e.g.:

void Test(byte[] src, byte[] dst) => src.AsSpan(0, 200).CopyTo(dst);

; Method BufferMemmoveUnrolling:Test(ubyte[],ubyte[]):this
       sub      rsp, 40
+      vzeroupper 
       test     rdx, rdx
       je       SHORT G_M53589_IG07
       cmp      dword ptr [rdx+08H], 200
       jb       SHORT G_M53589_IG07
       add      rdx, 16
       test     r8, r8
       jne      SHORT G_M53589_IG04
-      xor      rax, rax
-      xor      ecx, ecx
+      xor      rcx, rcx
+      xor      eax, eax
       jmp      SHORT G_M53589_IG05
G_M53589_IG04:
       lea      rax, bword ptr [r8+10H]
       mov      ecx, dword ptr [r8+08H]
G_M53589_IG05:
-      cmp      eax, 200
+      cmp      ecx, 200
       jb       SHORT G_M53589_IG08
-      mov      r8d, 200	
-      call     [System.Buffer:Memmove(byref,byref,ulong)]
-      nop
+      vmovdqu32 zmm0, zmmword ptr [rdx]
+      vmovdqu32 zmm1, zmmword ptr [rdx+40H]
+      vmovdqu32 zmm2, zmmword ptr [rdx+80H]
+      vmovdqu  xmm3, xmmword ptr [rdx+B8H] ;; handle remainder via XMM
+      vmovdqu32 zmmword ptr [rax], zmm0
+      vmovdqu32 zmmword ptr [rax+40H], zmm1
+      vmovdqu32 zmmword ptr [rax+80H], zmm2
+      vmovdqu  xmmword ptr [rax+B8H], xmm3 ;; handle remainder via XMM
       add      rsp, 40
       ret      
G_M53589_IG07:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M53589_IG08:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     
-; Total bytes of code: 80
+; Total bytes of code: 127

so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call after 128b).

Benchmarks:

[Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);

I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?

Tests: https://github.com/dotnet/runtime/blob/main/src/tests/JIT/opt/Vectorization/BufferMemmove.cs

PS: seems like I need to align my test array to a specific boundary for less noise)

ghost · 2023-04-05T12:44:43Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

I hope I don't step on anyone's toes (@dotnet/avx512-contrib), just wanted to play with the recently added AVX-512 support in runtime and enabled it for Buffer.Memmove unrolling, e.g.:

; Method BufferMemmoveUnrolling:Test(ubyte[],ubyte[]):this
       sub      rsp, 40
+      vzeroupper 
       test     rdx, rdx
       je       SHORT G_M53589_IG07
       cmp      dword ptr [rdx+08H], 200
       jb       SHORT G_M53589_IG07
       add      rdx, 16
       test     r8, r8
       jne      SHORT G_M53589_IG04
-      xor      rax, rax
-      xor      ecx, ecx
+      xor      rcx, rcx
+      xor      eax, eax
       jmp      SHORT G_M53589_IG05
G_M53589_IG04:
       lea      rax, bword ptr [r8+10H]
       mov      ecx, dword ptr [r8+08H]
G_M53589_IG05:
-      cmp      eax, 200
+      cmp      ecx, 200
       jb       SHORT G_M53589_IG08
-      mov      r8d, 200	
-      call     [System.Buffer:Memmove(byref,byref,ulong)]
-      nop
+      vmovdqu32 zmm0, zmmword ptr[rdx]
+      vmovdqu32 zmm1, zmmword ptr[rdx+40H]
+      vmovdqu32 zmm2, zmmword ptr[rdx+80H]
+      vmovdqu  xmm3, xmmword ptr [rdx+B8H] ;; handle remainder via XMM
+      vmovdqu32 zmmword ptr[rax], zmm0
+      vmovdqu32 zmmword ptr[rax+40H], zmm1
+      vmovdqu32 zmmword ptr[rax+80H], zmm2
+      vmovdqu  xmmword ptr [rax+B8H], xmm3 ;; handle remainder via XMM
       add      rsp, 40
       ret      
G_M53589_IG07:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M53589_IG08:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     
-; Total bytes of code: 80
+; Total bytes of code: 127

so it allows us to unroll 128..256 range now (we couldn't handle it previously because the algorithm is limitted to 4 temp regs so it used to fallback to memmove call).

Benchmarks:

[Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
[Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
[Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
[Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
[Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
[Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
[Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
[Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
[Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
[Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);

I was benchmarking it on Ryzen 7950x (the only hw I have with AVX-512) where, as far as I understand, 512b width is implemented via dual 256b dispatch so the difference is expected to be better on some modern intel server CPU?

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-04-05T13:28:29Z

NOTE: this PR also improves non-avx512 case a bit, e.g.:

void Test() => Src.AsSpan(0, 42).CopyTo(Dst);

codegen diff: https://www.diffchecker.com/svPDo4c7/ (switches to xmm for the remainder - presumably, it reduces chances to hit cacheline/page boundary)

EgorBo · 2023-04-05T17:06:53Z

@BruceForstall @tannergooding PTAL

BruceForstall

LGTM. Would be good to have @tannergooding and possibly @anthonycanino @DeepakRajendrakumaran comment on the generated code.

src/coreclr/jit/compiler.h

EgorBo · 2023-04-07T13:49:53Z

Diffs:

Regressions where we now expand 128..256 range (call memmove was smaller in terms of codegen size).
Improvements where we now need less instructions to unroll, e.g. just 2 instructions to handle len=64 (previously - 4 instructions).

src/coreclr/jit/compiler.h

Add asserts

EgorBo · 2023-04-09T18:37:11Z

Failure is #84536

BruceForstall · 2023-04-10T17:34:56Z

@dotnet/avx512-contrib Apparently this PR regressed ASP.NET TechEmpower (https://aka.ms/aspnet/benchmarks), probably because the machines in the TE lab have old enough hardware that do throttling when implementing AVX-512

DeepakRajendrakumaran · 2023-04-10T18:07:40Z

@EgorBo Can you please share the original code you used for benchmarking this on Ryzen 7950x? I'd like to run this locally if possible?

EgorBo · 2023-04-10T18:30:59Z

@EgorBo Can you please share the original code you used for benchmarking this on Ryzen 7950x? I'd like to run this locally if possible?

Um, do you mean my local benchmarks or TE?
Local benchmarks are:

class Prog
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Prog).Assembly).Run(args);
    }

    // 8-byte aligned, probably better to use NativeMemory.AllocAligned here
    byte[] Src = new byte[512];
    byte[] Dst = new byte[512];

    [Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
    [Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
    [Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
    [Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
    [Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
    [Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
    [Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
    [Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
    [Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
    [Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
    [Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
    [Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);
}

BruceForstall · 2023-04-10T19:00:43Z

Fyi, machines used for the TE runs are described here:

https://github.com/aspnet/Benchmarks/blob/main/scenarios/README.md#profiles

EgorBo · 2023-04-10T19:06:17Z

Yeah, I filed a PR #84577 to sort of fix the issue and listed all details I found

DeepakRajendrakumaran · 2023-04-10T19:09:10Z

 BenchmarkSwitcher

Thank you. This is what I am looking for for starters. Will probably try TE later.
Note - Have not used BenchmarkDotNet a lot. Like how easy it is to write a benchmark. The below seem to work

using System;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace MyBench
{
    public class MMOve
    {
        // 8-byte aligned, probably better to use NativeMemory.AllocAligned here
        byte[] Src = new byte[512];
        byte[] Dst = new byte[512];

        [Benchmark] public void Copy32() => Src.AsSpan(0, 32).CopyTo(Dst);
        [Benchmark] public void Copy50() => Src.AsSpan(0, 50).CopyTo(Dst);
        [Benchmark] public void Copy64() => Src.AsSpan(0, 64).CopyTo(Dst);
        [Benchmark] public void Copy80() => Src.AsSpan(0, 80).CopyTo(Dst);
        [Benchmark] public void Copy100() => Src.AsSpan(0, 100).CopyTo(Dst);
        [Benchmark] public void Copy128() => Src.AsSpan(0, 128).CopyTo(Dst);
        [Benchmark] public void Copy129() => Src.AsSpan(0, 129).CopyTo(Dst);
        [Benchmark] public void Copy200() => Src.AsSpan(0, 200).CopyTo(Dst);
        [Benchmark] public void Copy256() => Src.AsSpan(0, 256).CopyTo(Dst);
        [Benchmark] public void Copy257() => Src.AsSpan(0, 257).CopyTo(Dst);
        [Benchmark] public void Copy300() => Src.AsSpan(0, 300).CopyTo(Dst);
        [Benchmark] public void Copy400() => Src.AsSpan(0, 400).CopyTo(Dst);
    }

    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<MMOve>();
            //BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
        }
    }
}

EgorBo · 2023-04-10T19:16:17Z

@DeepakRajendrakumaran
BenchmarkSwitcher.FromAssembly(typeof(MMove).Assembly).Run(args); should work in your case.
It allows to use a locally built runtime to run benchmarks via

dotnet run -c Release -- --coreRun C:\prj\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe
# or similar path on other OSes

The runtime has to be built with:

.\build.cmd Clr+Libs -c Release
.\src\tests\build.cmd Release generatelayoutonly

or same commands in bash if for linux/mac

DeepakRajendrakumaran · 2023-04-10T21:39:02Z

It works. I'm going to get numbers with TieredCompilation=0. I assume that's the only env variable you set when running experiments?

EgorBo · 2023-04-10T23:51:29Z

It works. I'm going to get numbers with TieredCompilation=0. I assume that's the only env variable you set when running experiments?

we don't set any variables at all usually, BDN is expected to tier up all hot methods

EgorBo added 2 commits April 5, 2023 13:38

Enable AVX-512 for Memmove unrolling

5f7f0c1

Merge branch 'main' of github.com:dotnet/runtime into memmove-avx512

127cae4

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 5, 2023

ghost assigned EgorBo Apr 5, 2023

EgorBo added 2 commits April 5, 2023 15:00

Fix build

2f3c6f1

Fix build

e73cd3f

BruceForstall reviewed Apr 5, 2023

View reviewed changes

src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved

Address feedback

9256076

BruceForstall added the avx512 Related to the AVX-512 architecture label Apr 7, 2023

tannergooding reviewed Apr 9, 2023

View reviewed changes

src/coreclr/jit/compiler.h Show resolved Hide resolved

tannergooding approved these changes Apr 9, 2023

View reviewed changes

EgorBo added 2 commits April 9, 2023 17:22

Update compiler.h

3bf9271

Add asserts

Update compiler.h

609b28f

EgorBo merged commit 5147002 into dotnet:main Apr 9, 2023

EgorBo deleted the memmove-avx512 branch April 9, 2023 18:36

BruceForstall mentioned this pull request Apr 10, 2023

Implement AVX-512 support #77034

Closed

56 tasks

EgorBo mentioned this pull request Apr 10, 2023

Don't use AVX-512 for small inputs in Memmove due to throttle issues #84577

Merged

EgorBo restored the memmove-avx512 branch April 14, 2023 14:44

ghost locked as resolved and limited conversation to collaborators May 14, 2023

Enable AVX-512 in Memmove unrolling #84348

Enable AVX-512 in Memmove unrolling #84348

Conversation

EgorBo commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Apr 5, 2023

Uh oh!

EgorBo commented Apr 5, 2023

Uh oh!

EgorBo commented Apr 5, 2023

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EgorBo commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Apr 9, 2023

Uh oh!

BruceForstall commented Apr 10, 2023

Uh oh!

DeepakRajendrakumaran commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

BruceForstall commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

DeepakRajendrakumaran commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

DeepakRajendrakumaran commented Apr 10, 2023

Uh oh!

EgorBo commented Apr 10, 2023

Uh oh!

Uh oh!

EgorBo commented Apr 5, 2023 •

edited

Loading

EgorBo commented Apr 7, 2023 •

edited

Loading