-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Python is 2x~3x slower than official binary on simple benchmarks (gcc emutls) #22917
Comments
you are not using the native python, try:
|
The python at I installed the UCRT64-specific one according to your recommendation and it has the same problem, just slightly less bad:
|
Ok, thanks |
It seems to be a gcc vs clang thing, no idea why:
|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 (Implement Windows native TLS) edit: I've heart that hopefully we'll see some progress there in the near future. |
Anyone who cares about performance should switch to clang64, instead of continuing to use mingw64/ucrt64 due to some outdated information. MSYS2 should make clang64 the new default environment. |
That is a very premature conclusion to draw from gcc not natively supporting thread local storage. Put that enhancement in and it'll do just fine. |
TLS is just a drop in the ocean of features not supported by GCC/BFD/libstdc++. GCC doesn't support AVX512 correctly on Windows GCC/BFD LTO on Windows may ICE at any time BFD --gc-sections doesn't work on Windows and ICF is not supported GCC doesn't support Windows ControlFlowGuard GCC has no sanitizers that work on Windows Win32 thread model for libstdc++ still not enabled in MSYS2 |
gcc also has the MCF thread model- The only reason it or win32 aren't enabled by default is due to how everything would probably break if it were just suddenly swapped out like that. If you want the win32 thread model, you can just --enable-threads=win32 and compile gcc, like I do. Main point is that this isn't the fault of gcc, it's a choice made by MSYS2, for good reason. Besides, doesn't MSYS2 clang also use POSIX threads, making that point moot? In addition:
The only thing that can get annoying is the lack of sanitizer support with gcc, which I plan to rectify once thread local storage support goes in |
Don't you lose threading support in libstdc++ if you use Win32 threads? At least in the past, this was why everyone used winpthreads. As soon as you enable function sections in Rust, you will see a huge number of crashes in the test suite. Clang works fine: https://github.com/rust-lang/rust/blob/a8953d83cfcb7caacc8d68951a32455f28265467/compiler/rustc_target/src/spec/base/windows_gnu.rs#L77 LTO crashes are certainly a thing. Just search in this repository: https://github.com/msys2/MINGW-packages/issues?q=is%3Aissue+is%3Aopen+lto That said, this discussion is not related to the issue. Maybe we can move this topic somewhere else not to pollute this bug report? |
clang64 libc++ uses win32 threads directly, and the “Thread model: posix” shown by clang itself is just a legacy.
Simple small projects can usually enable LTO, but large projects with complex nested dependencies will always ICE, you can verify this with btbn's FFmpeg build tool. https://github.com/BtbN/FFmpeg-Builds
https://sourceware.org/bugzilla/show_bug.cgi?id=11539 |
Alright, let's move it off this issue, as mati says. Although, just wanted to point out that win32 threading was fixed in gcc 13 and it now fully supports libstdc++ threads. At least with the Windows JDK, LTO hasn't really been an issue, I'm not sure how big of a project the JDK counts as when compared to others |
GCC with native TLS has been available since December: https://gcc-mcf.lhmouse.com/ |
|
That's an unexpectedly big difference between new GCC and Clang. Can you verify it with some kind of benchmarking tool that takes care of caching?
|
Using toy script from msys2#22917 this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
Using a toy script from msys2#22917 this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
@mati865 Yes it seems mostly the cost of a cold start. However after warming up there's still a minor difference:
|
Using a toy script from #22917 this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
Description / Steps to reproduce the issue
Using a numeric pi calculation microbenchmark, timed using hyperfine (but
time
works ok too):Benchmark program:
Expected behavior
Roughly same performance.
Actual behavior
Wildly different performance.
Verification
Windows Version
MINGW64_NT-10.0-19045
Are you willing to submit a PR?
No response
The text was updated successfully, but these errors were encountered: