-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GC] Could SVR::memcopy be more efficient? #110571
Comments
Tagging subscribers to this area: @dotnet/gc |
The loop in the SVR::memcopy C++ code of .NET 9.0.0 is unrolled to copy four pointers in each iteration, thus 4*64 bits on amd64; but disassembling the binary for Windows shows the compiler has replaced these four assignments with a nested loop that runs four iterations.
Compiler Explorer shows x64 msvc v19.30 VS17.0 |
coreclr!SVR::memcopy in .NET 6.0.2 for Windows AMD64 has likewise been de-unrolled (rerolled?) by the compiler. coreclr!SVR::memcopy in .NET Core 2.1.30 for Windows AMD64 however preserves the unrolling. |
|
I guess the next step should be to add some |
Calling |
SIMD may not be possible to use due to atomicity requirements. For example, on x64 only 128-bit aligned vectors are guaranteed atomic; there is no such guarantee for 256-bit aligned vectors or unaligned vectors. This could theoretically result in tears of |
That's a good point! Even in that case 128-bit registers and unrolling might be applicable. |
Sorry, missed a comment in there; they are only atomic if AVX is also implemented. They are not guaranteed atomic on downlevel hardware. So you effectively need to manually code it with a dynamic (cached) check for AVX support. |
Hi @sfiruch, do you have a repro and/or traces you can share which show 50% time being spent in memcpy? |
Sure thing! Here's a trace, with a ServerGC pre-selected: https://share.firefox.dev/4iAnIzg. It's from a data-processing pipeline that allocates a lot. (I can also share the |
Hi @sfiruch, just took a look at the link you sent and it seems like the % of samples from SVR::memcopy is 13% - it's probably best if you share the trace to [email protected]? We can take a look from there. Thanks! |
For context, according to "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3A" a qword accesses is not guaranteed atomic if it spans a cache-line boundary. Does anyone know if the runtime never ever uses pointers spanning a cache-line-boundary in the memory blocks passed to SVR::memcopy? If that's not always the case, even the current implementation might be broken. Update: Just tried various things, and couldn't get the runtime to use unaligned pointers. It appears to be a non-issue. TL;DR - The copy on x64 could use qword copies for the unaligned first/last elements, and copy using 128bit SSE otherwise, ideally unrolled a bit. |
Sure thing, I sent you the ETL via mail. Also, did you notice that SVR::memcopy occurs three times in the flamegraph? The first one is only 13%, but including the other two occurrences it represents approx. 50%. |
The ServerGC threads in my application spend approx. 50% of their time in
SVR::memcopy
https://github.com/dotnet/runtime/blob/731a96b0018cda0c67bb3fab7a4d06538fe2ead5/src/coreclr/gc/gc.cpp#L1746-L1755
In x64 the relevant loop looks like this:
If I'm not mistaken, this is a regular memcpy, but without vectorization or other optimizations? Perhaps SVR::memcopy could be implemented with
memcpy
instead, which is more optimized?The text was updated successfully, but these errors were encountered: