Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GC] Could SVR::memcopy be more efficient? #110571

Open
sfiruch opened this issue Dec 10, 2024 · 14 comments
Open

[GC] Could SVR::memcopy be more efficient? #110571

sfiruch opened this issue Dec 10, 2024 · 14 comments
Labels
area-GC-coreclr untriaged New issue has not been triaged by the area owner

Comments

@sfiruch
Copy link
Contributor

sfiruch commented Dec 10, 2024

The ServerGC threads in my application spend approx. 50% of their time in SVR::memcopy

https://github.com/dotnet/runtime/blob/731a96b0018cda0c67bb3fab7a4d06538fe2ead5/src/coreclr/gc/gc.cpp#L1746-L1755

In x64 the relevant loop looks like this:

@loop:
mov rax, qword ptr [r10+r9*1]
mov qword ptr [r9], rax
lea r9, ptr [r9+0x8]
sub r11, 0x1
jnz 0x18016b901 <loop>

If I'm not mistaken, this is a regular memcpy, but without vectorization or other optimizations? Perhaps SVR::memcopy could be implemented with memcpy instead, which is more optimized?

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Dec 10, 2024

The loop in the SVR::memcopy C++ code of .NET 9.0.0 is unrolled to copy four pointers in each iteration, thus 4*64 bits on amd64; but disassembling the binary for Windows shows the compiler has replaced these four assignments with a nested loop that runs four iterations.

0:000> lmvm coreclr
start             end                 module name
00007ffe`77200000 00007ffe`776b3000   coreclr    (private pdb symbols)  C:\ProgramData\dbg\sym\coreclr.pdb\D5F6BC0E477548FEAA685A1872DEF3591\coreclr.pdb
    Loaded symbol image file: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0\coreclr.dll
    Image path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0\coreclr.dll
    Image name: coreclr.dll
    Timestamp:        Tue Oct 29 04:35:40 2024 (672049FC)
    CheckSum:         004A8D7D
    ImageSize:        004B3000
    File version:     9.0.24.52809
    Product version:  9.0.24.52809
    File flags:       0 (Mask 3F)
    File OS:          4 Unknown Win32
    File type:        0.0 Unknown
    File date:        00000000.00000000
    Translations:     0409.04b0
    Information from resource tables:
        CompanyName:      Microsoft Corporation
        ProductName:      Microsoft® .NET
        InternalName:     CoreCLR.dll
        OriginalFilename: CoreCLR.dll
        ProductVersion:   9,0,24,52809 @Commit: 9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3
        FileVersion:      9,0,24,52809 @Commit: 9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3
        FileDescription:  .NET Runtime
        LegalCopyright:   © Microsoft Corporation. All rights reserved.
        Comments:         Flavor=Retail
0:000> x coreclr!*::memcopy
00007ffe`77363ff8 coreclr!SVR::memcopy (unsigned char *, unsigned char *, unsigned int64)
00007ffe`77363ff8 coreclr!WKS::memcopy (unsigned char *, unsigned char *, unsigned int64)
0:000> uf coreclr!SVR::memcopy
coreclr!SVR::memcopy [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1698]:
 1698 00007ffe`77363ff8 eb2c            jmp     coreclr!SVR::memcopy+0x2e (00007ffe`77364026)

coreclr!SVR::memcopy+0x2 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1711]:
 1711 00007ffe`77363ffa 4c8bd2          mov     r10,rdx
 1711 00007ffe`77363ffd 4c8bc9          mov     r9,rcx
 1711 00007ffe`77364000 4c2bd1          sub     r10,rcx
 1711 00007ffe`77364003 41bb04000000    mov     r11d,4

coreclr!SVR::memcopy+0x11 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1711]:
 1711 00007ffe`77364009 4b8b040a        mov     rax,qword ptr [r10+r9]
 1711 00007ffe`7736400d 498901          mov     qword ptr [r9],rax
 1711 00007ffe`77364010 4d8d4908        lea     r9,[r9+8]
 1711 00007ffe`77364014 4983eb01        sub     r11,1
 1711 00007ffe`77364018 75ef            jne     coreclr!SVR::memcopy+0x11 (00007ffe`77364009)

coreclr!SVR::memcopy+0x22 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1715]:
 1715 00007ffe`7736401a 4883c120        add     rcx,20h
 1716 00007ffe`7736401e 4883c220        add     rdx,20h
 1718 00007ffe`77364022 4983e820        sub     r8,20h

coreclr!SVR::memcopy+0x2e [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1718]:
 1718 00007ffe`77364026 4983f820        cmp     r8,20h
 1718 00007ffe`7736402a 73ce            jae     coreclr!SVR::memcopy+0x2 (00007ffe`77363ffa)

coreclr!SVR::memcopy+0x34 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1722]:
 1722 00007ffe`7736402c 41f6c010        test    r8b,10h
 1722 00007ffe`77364030 7428            je      coreclr!SVR::memcopy+0x62 (00007ffe`7736405a)

coreclr!SVR::memcopy+0x3a [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1724]:
 1724 00007ffe`77364032 4c8bd2          mov     r10,rdx
 1724 00007ffe`77364035 4c8bc9          mov     r9,rcx
 1724 00007ffe`77364038 4c2bd1          sub     r10,rcx
 1724 00007ffe`7736403b 41bb02000000    mov     r11d,2

coreclr!SVR::memcopy+0x49 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1724]:
 1724 00007ffe`77364041 4b8b0411        mov     rax,qword ptr [r9+r10]
 1724 00007ffe`77364045 498901          mov     qword ptr [r9],rax
 1724 00007ffe`77364048 4d8d4908        lea     r9,[r9+8]
 1724 00007ffe`7736404c 4983eb01        sub     r11,1
 1724 00007ffe`77364050 75ef            jne     coreclr!SVR::memcopy+0x49 (00007ffe`77364041)

coreclr!SVR::memcopy+0x5a [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1726]:
 1726 00007ffe`77364052 4883c110        add     rcx,10h
 1727 00007ffe`77364056 4883c210        add     rdx,10h

coreclr!SVR::memcopy+0x62 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1731]:
 1731 00007ffe`7736405a 41f6c008        test    r8b,8
 1731 00007ffe`7736405e 7406            je      coreclr!SVR::memcopy+0x6e (00007ffe`77364066)

coreclr!SVR::memcopy+0x68 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1733]:
 1733 00007ffe`77364060 488b02          mov     rax,qword ptr [rdx]
 1733 00007ffe`77364063 488901          mov     qword ptr [rcx],rax

coreclr!SVR::memcopy+0x6e [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 1735]:
 1735 00007ffe`77364066 c3              ret

Compiler Explorer shows x64 msvc v19.30 VS17.0 -O1 doing this kind of conversion, but not -O2.

@KalleOlaviNiemitalo
Copy link

coreclr!SVR::memcopy in .NET 6.0.2 for Windows AMD64 has likewise been de-unrolled (rerolled?) by the compiler.

coreclr!SVR::memcopy in .NET Core 2.1.30 for Windows AMD64 however preserves the unrolling.

@tannergooding
Copy link
Member

O1 is shorthand for Minimize Size, so it makes sense that optimizations include finding duplicate patterns and merging them more aggressively.

O2 is shorthand for Maximize Speed and would be the thing to use if we want the best performance (albeit at the cost of larger codegen in many cases, naturally)

(https://learn.microsoft.com/en-us/cpp/build/reference/o1-o2-minimize-size-maximize-speed?view=msvc-170)

@KalleOlaviNiemitalo
Copy link

I guess the next step should be to add some #pragma that prevents this specific function from being optimised for size, and measure how that change affects the throughput of an application.

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 10, 2024

Calling memcpy might be easier than optimizing SVR::memcopy. Typical memcpy implementations use SIMD registers. I don't know enough about GC-specific requirements - but the compiler optimized version makes it look as if using SIMD registers should be possible.

@tannergooding
Copy link
Member

SIMD may not be possible to use due to atomicity requirements.

For example, on x64 only 128-bit aligned vectors are guaranteed atomic; there is no such guarantee for 256-bit aligned vectors or unaligned vectors. This could theoretically result in tears of ptr-width values in various edge cases.

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 10, 2024

SIMD may not be possible to use due to atomicity requirements.

For example, on x64 only 128-bit aligned vectors are guaranteed atomic; there is no such guarantee for 256-bit aligned vectors or unaligned vectors. This could theoretically result in tears of ptr-width values in various edge cases.

That's a good point! Even in that case 128-bit registers and unrolling might be applicable.

@tannergooding
Copy link
Member

Sorry, missed a comment in there; they are only atomic if AVX is also implemented. They are not guaranteed atomic on downlevel hardware. So you effectively need to manually code it with a dynamic (cached) check for AVX support.

@mangod9
Copy link
Member

mangod9 commented Dec 14, 2024

Hi @sfiruch, do you have a repro and/or traces you can share which show 50% time being spent in memcpy?

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 14, 2024

Hi @sfiruch, do you have a repro and/or traces you can share which show 50% time being spent in memcpy?

Sure thing! Here's a trace, with a ServerGC pre-selected: https://share.firefox.dev/4iAnIzg. It's from a data-processing pipeline that allocates a lot.

(I can also share the .etl if you prefer. Just let me know how to share it privately)

@mrsharm
Copy link
Member

mrsharm commented Jan 9, 2025

Hi @sfiruch, just took a look at the link you sent and it seems like the % of samples from SVR::memcopy is 13% - it's probably best if you share the trace to [email protected]? We can take a look from there. Thanks!

@sfiruch
Copy link
Contributor Author

sfiruch commented Jan 10, 2025

For example, on x64 only 128-bit aligned vectors are guaranteed atomic; there is no such guarantee for 256-bit aligned vectors or unaligned vectors. This could theoretically result in tears of ptr-width values in various edge cases.

For context, according to "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3A" a qword accesses is not guaranteed atomic if it spans a cache-line boundary. Does anyone know if the runtime never ever uses pointers spanning a cache-line-boundary in the memory blocks passed to SVR::memcopy? If that's not always the case, even the current implementation might be broken.

Update: Just tried various things, and couldn't get the runtime to use unaligned pointers. It appears to be a non-issue.

TL;DR - The copy on x64 could use qword copies for the unaligned first/last elements, and copy using 128bit SSE otherwise, ideally unrolled a bit.

@sfiruch
Copy link
Contributor Author

sfiruch commented Jan 10, 2025

Hi @sfiruch, just took a look at the link you sent and it seems like the % of samples from SVR::memcopy is 13% - it's probably best if you share the trace to [email protected]? We can take a look from there. Thanks!

Sure thing, I sent you the ETL via mail.

Also, did you notice that SVR::memcopy occurs three times in the flamegraph? The first one is only 13%, but including the other two occurrences it represents approx. 50%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants