Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Python is 2x~3x slower than official binary on simple benchmarks (gcc emutls) #22917

Open
1 task done
wareya opened this issue Dec 24, 2024 · 18 comments
Open
1 task done
Labels

Comments

@wareya
Copy link

wareya commented Dec 24, 2024

Description / Steps to reproduce the issue

Using a numeric pi calculation microbenchmark, timed using hyperfine (but time works ok too):

wareya@Toriaezu UCRT64 ~/dev/flinch
$ hyperfine.exe '"/c/Program Files/Python310/python.exe" etc/too_simple.py' "python etc/too_simple.py" --warmup 3
Benchmark 1: "C:/Program Files/Python310/python.exe" etc/too_simple.py
  Time (mean ± σ):     587.7 ms ±   5.0 ms    [User: 580.9 ms, System: 8.1 ms]
  Range (min … max):   581.3 ms … 599.0 ms    10 runs

Benchmark 2: python etc/too_simple.py
  Time (mean ± σ):      1.673 s ±  0.029 s    [User: 1.659 s, System: 0.010 s]
  Range (min … max):    1.649 s …  1.730 s    10 runs

Summary
  "C:/Program Files/Python310/python.exe" etc/too_simple.py ran
    2.85 ± 0.06 times faster than python etc/too_simple.py

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -Q|grep python
python 3.12.8-1

wareya@Toriaezu UCRT64 ~/dev/flinch
$ which python
/usr/bin/python

wareya@Toriaezu UCRT64 ~/dev/flinch
$ python -c "import sysconfig; print(sysconfig.get_config_var('CFLAGS'))"
-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall -march=nocona -msahf -mtune=generic -O2 -pipe -march=nocona -msahf -mtune=generic -O2 -pipe

Benchmark program:

#!/usr/bin/env python

def main():
    sumval = 0.0
    flip = -1.0
    for i in range(1, 10000001):
        flip = -flip
        sumval += flip / ((i << 1) - 1)
    print(f"{sumval * 4.0:.16f}")

if __name__ == "__main__":
    main()

Expected behavior

Roughly same performance.

Actual behavior

Wildly different performance.

Verification

Windows Version

MINGW64_NT-10.0-19045

Are you willing to submit a PR?

No response

@wareya wareya added the bug label Dec 24, 2024
@lazka
Copy link
Member

lazka commented Dec 24, 2024

you are not using the native python, try: pacman -S mingw-w64-ucrt-x86_64-python

$ which python
/ucrt64/bin/python

@wareya
Copy link
Author

wareya commented Dec 24, 2024

The python at /usr/bin/python is msys2's python; that's where the python from the root-level python package goes.

I installed the UCRT64-specific one according to your recommendation and it has the same problem, just slightly less bad:

wareya@Toriaezu UCRT64 ~/dev/flinch
$ which python
/ucrt64/bin/python

wareya@Toriaezu UCRT64 ~/dev/flinch
$ hyperfine.exe '"/c/Program Files/Python310/python.exe" etc/too_simple.py' "/ucrt64/bin/python etc/too_simple.py" "/usr/bin/python etc/too_simple.py" --warmup 3
Benchmark 1: "C:/Program Files/Python310/python.exe" etc/too_simple.py
  Time (mean ± σ):     641.3 ms ±  25.1 ms    [User: 624.7 ms, System: 12.8 ms]
  Range (min … max):   623.7 ms … 710.0 ms    10 runs

Benchmark 2: C:/msys64/ucrt64/bin/python etc/too_simple.py
  Time (mean ± σ):      1.401 s ±  0.009 s    [User: 1.382 s, System: 0.007 s]
  Range (min … max):    1.390 s …  1.421 s    10 runs

Benchmark 3: C:/msys64/usr/bin/python etc/too_simple.py
  Time (mean ± σ):      1.815 s ±  0.024 s    [User: 1.789 s, System: 0.013 s]
  Range (min … max):    1.789 s …  1.855 s    10 runs

Summary
  "C:/Program Files/Python310/python.exe" etc/too_simple.py ran
    2.18 ± 0.09 times faster than C:/msys64/ucrt64/bin/python etc/too_simple.py
    2.83 ± 0.12 times faster than C:/msys64/usr/bin/python etc/too_simple.py

wareya@Toriaezu UCRT64 ~/dev/flinch
$ time python etc/too_simple.py
3.1415925535897915

real    0m1.852s
user    0m1.781s
sys     0m0.000s

/usr/bin/python being the root/msys2-level python package:

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -R python
checking dependencies...
:: git optionally requires python: various helper scripts
:: subversion optionally requires python: for some hook scripts

Packages (1) python-3.12.8-1

Total Removed Size:  182.76 MiB

:: Do you want to remove these packages? [Y/n]
:: Processing package changes...
(1/1) removing python                                                                                [###########################################################] 100%

wareya@Toriaezu UCRT64 ~/dev/flinch
$ ls /usr/bin/python
ls: cannot access '/usr/bin/python': No such file or directory

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -S python
resolving dependencies...
looking for conflicting packages...

Packages (1) python-3.12.8-1

Total Installed Size:  182.76 MiB

:: Proceed with installation? [Y/n]
(1/1) checking keys in keyring                                                                       [###########################################################] 100%
(1/1) checking package integrity                                                                     [###########################################################] 100%
(1/1) loading package files                                                                          [###########################################################] 100%
(1/1) checking for file conflicts                                                                    [###########################################################] 100%
(1/1) checking available disk space                                                                  [###########################################################] 100%
:: Processing package changes...
(1/1) installing python                                                                              [###########################################################] 100%

wareya@Toriaezu UCRT64 ~/dev/flinch
$ ls /usr/bin/python
/usr/bin/python

@lazka
Copy link
Member

lazka commented Dec 24, 2024

Ok, thanks

@lazka
Copy link
Member

lazka commented Dec 24, 2024

It seems to be a gcc vs clang thing, no idea why:

Benchmark 1: C:/Python312/python.exe too_simple.py
  Time (mean ± σ):      1.203 s ±  0.010 s    [User: 1.182 s, System: 0.016 s]
  Range (min … max):    1.194 s …  1.221 s    10 runs

Benchmark 2: C:/msys64/ucrt64/bin/python too_simple.py
  Time (mean ± σ):      2.462 s ±  0.025 s    [User: 2.428 s, System: 0.021 s]
  Range (min … max):    2.408 s …  2.479 s    10 runs

Benchmark 3: C:/msys64/clang64/bin/python too_simple.py
  Time (mean ± σ):      1.123 s ±  0.006 s    [User: 1.094 s, System: 0.021 s]
  Range (min … max):    1.115 s …  1.136 s    10 runs

Benchmark 4: C:/msys64/mingw64/bin/python too_simple.py
  Time (mean ± σ):      2.471 s ±  0.014 s    [User: 2.430 s, System: 0.028 s]
  Range (min … max):    2.457 s …  2.493 s    10 runs

@Morilli
Copy link

Morilli commented Dec 27, 2024

I did some profiling using the Intel VTune profiler, and it looks like the main issue is related to thread local storage:
image
A lot of functions are calling __emutls_get_address which looks relatively expensive. I sadly don't know much about how tls works in general or what might differ between compilers, but I cannot see such calls in either the official python executable or the clang variant.
image

@lazka
Copy link
Member

lazka commented Dec 27, 2024

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 (Implement Windows native TLS)

edit: I've heart that hopefully we'll see some progress there in the near future.

@lazka lazka transferred this issue from msys2/MSYS2-packages Dec 27, 2024
@lazka lazka changed the title Python is 2x~3x slower than official binary on simple benchmarks [python] Python is 2x~3x slower than official binary on simple benchmarks (gcc emutls) Dec 27, 2024
@TheShermanTanker
Copy link

I did some profiling using the Intel VTune profiler, and it looks like the main issue is related to thread local storage: image A lot of functions are calling __emutls_get_address which looks relatively expensive. I sadly don't know much about how tls works in general or what might differ between compilers, but I cannot see such calls in either the official python executable or the clang variant. image

The difference is that gcc uses emulated TLS, while clang and VC are able to use the following assembly sequence to load TLS variables directly:

mov eax, DWORD PTR _tls_index[rip]
mov rcx, QWORD PTR gs:88
mov rax, QWORD PTR [rcx+rax*8]
mov eax, DWORD PTR local@secrel32[rax]

(The assembly above assumes we're loading a variable of type int named local)

I've been working on enabling native TLS, as it's called, for gcc, but it's a very breaking change and some work needs to be done so that everything compiled by gcc doesn't suddenly break and cease to work once it's enabled for MINGW

@Andarwinux
Copy link
Contributor

Anyone who cares about performance should switch to clang64, instead of continuing to use mingw64/ucrt64 due to some outdated information. MSYS2 should make clang64 the new default environment.

@TheShermanTanker
Copy link

Anyone who cares about performance should switch to clang64, instead of continuing to use mingw64/ucrt64 due to some outdated information. MSYS2 should make clang64 the new default environment.

That is a very premature conclusion to draw from gcc not natively supporting thread local storage. Put that enhancement in and it'll do just fine.

@Andarwinux
Copy link
Contributor

TLS is just a drop in the ocean of features not supported by GCC/BFD/libstdc++.

GCC doesn't support AVX512 correctly on Windows

GCC/BFD LTO on Windows may ICE at any time

BFD --gc-sections doesn't work on Windows and ICF is not supported

GCC doesn't support Windows ControlFlowGuard

GCC has no sanitizers that work on Windows

Win32 thread model for libstdc++ still not enabled in MSYS2

@TheShermanTanker
Copy link

gcc also has the MCF thread model- The only reason it or win32 aren't enabled by default is due to how everything would probably break if it were just suddenly swapped out like that. If you want the win32 thread model, you can just --enable-threads=win32 and compile gcc, like I do. Main point is that this isn't the fault of gcc, it's a choice made by MSYS2, for good reason. Besides, doesn't MSYS2 clang also use POSIX threads, making that point moot? In addition:

  • I use gcc LTO all the time. It isn't as unstable as it may seem, minus like 1 bug I was informed of once and then never heard about again (I still want to hunt that one down the fix it anyway later)
  • This is the first I've heard that --gc-sections doesn't work either

The only thing that can get annoying is the lack of sanitizer support with gcc, which I plan to rectify once thread local storage support goes in

@mati865
Copy link
Collaborator

mati865 commented Jan 1, 2025

Don't you lose threading support in libstdc++ if you use Win32 threads? At least in the past, this was why everyone used winpthreads.

As soon as you enable function sections in Rust, you will see a huge number of crashes in the test suite. Clang works fine: https://github.com/rust-lang/rust/blob/a8953d83cfcb7caacc8d68951a32455f28265467/compiler/rustc_target/src/spec/base/windows_gnu.rs#L77

LTO crashes are certainly a thing. Just search in this repository: https://github.com/msys2/MINGW-packages/issues?q=is%3Aissue+is%3Aopen+lto
Some of the issues, like #11726, are plain ICE.

That said, this discussion is not related to the issue. Maybe we can move this topic somewhere else not to pollute this bug report?

@Andarwinux
Copy link
Contributor

gcc also has the MCF thread model- The only reason it or win32 aren't enabled by default is due to how everything would probably break if it were just suddenly swapped out like that. If you want the win32 thread model, you can just --enable-threads=win32 and compile gcc, like I do. Main point is that this isn't the fault of gcc, it's a choice made by MSYS2, for good reason

#20830 (comment)

doesn't MSYS2 clang also use POSIX threads, making that point moot? In addition

clang64 libc++ uses win32 threads directly, and the “Thread model: posix” shown by clang itself is just a legacy.

  • I use gcc LTO all the time. It isn't as unstable as it may seem, minus like 1 bug I was informed of once and then never heard about again (I still want to hunt that one down the fix it anyway later)

Simple small projects can usually enable LTO, but large projects with complex nested dependencies will always ICE, you can verify this with btbn's FFmpeg build tool.

https://github.com/BtbN/FFmpeg-Builds
BtbN/FFmpeg-Builds@fcc6136

  • This is the first I've heard that --gc-sections doesn't work either

https://sourceware.org/bugzilla/show_bug.cgi?id=11539
https://reviews.llvm.org/D101568
The most common phenomenon is that when you pass -ffunction-sections -fdata-sections -Wl,--gc-sections the size of the final exe/dll increases.

@TheShermanTanker
Copy link

Alright, let's move it off this issue, as mati says. Although, just wanted to point out that win32 threading was fixed in gcc 13 and it now fully supports libstdc++ threads. At least with the Windows JDK, LTO hasn't really been an issue, I'm not sure how big of a project the JDK counts as when compared to others

@lhmouse
Copy link
Contributor

lhmouse commented Jan 8, 2025

GCC with native TLS has been available since December: https://gcc-mcf.lhmouse.com/
I rebuilt Python for the latest two builds, which you may have a look.

@lhmouse
Copy link
Contributor

lhmouse commented Jan 8, 2025

  • MSYS ~/Desktop $ time -p /ucrt64/bin/python test.py
    3.1415925535897915
    real 0.52
    user 0.01
    sys 0.00

  • MSYS ~/Desktop $ time -p /clang64/bin/python test.py
    3.1415925535897915
    real 0.87
    user 0.00
    sys 0.03

  • MSYS ~/Desktop $ time -p /usr/bin/python test.py
    3.1415925535897915
    real 2.78
    user 2.62
    sys 0.07

@mati865
Copy link
Collaborator

mati865 commented Jan 8, 2025

That's an unexpectedly big difference between new GCC and Clang. Can you verify it with some kind of benchmarking tool that takes care of caching?
This is on 5950X box:

$ hyperfine '/h/ucrt64/bin/python3.12 too_simple.py' '/h/msys64/clang64/bin/python3.12 too_simple.py' '/h/msys64/ucrt64/bin/python3.12 too_simple.py' '/c/Users/mateusz/AppData/Local/Microsoft/WindowsApps/python3.12 too_simple.py' --warmup 3
Benchmark 1: H:/ucrt64/bin/python3.12 too_simple.py
  Time (mean ± σ):     691.7 ms ±  32.9 ms    [User: 657.8 ms, System: 3.1 ms]
  Range (min … max):   671.8 ms … 780.4 ms    10 runs

Benchmark 2: H:/msys64/clang64/bin/python3.12 too_simple.py
  Time (mean ± σ):     723.7 ms ±  14.9 ms    [User: 693.8 ms, System: 4.6 ms]
  Range (min … max):   709.2 ms … 760.5 ms    10 runs

Benchmark 3: H:/msys64/ucrt64/bin/python3.12 too_simple.py
  Time (mean ± σ):      1.560 s ±  0.015 s    [User: 1.514 s, System: 0.009 s]
  Range (min … max):    1.538 s …  1.597 s    10 runs

Benchmark 4: C:/Users/mateusz/AppData/Local/Microsoft/WindowsApps/python3.12 too_simple.py
  Time (mean ± σ):     929.7 ms ±   9.7 ms    [User: 0.0 ms, System: 6.1 ms]
  Range (min … max):   916.0 ms … 945.0 ms    10 runs

Summary
  H:/ucrt64/bin/python3.12 too_simple.py ran
    1.05 ± 0.05 times faster than H:/msys64/clang64/bin/python3.12 too_simple.py
    1.34 ± 0.07 times faster than C:/Users/mateusz/AppData/Local/Microsoft/WindowsApps/python3.12 too_simple.py
    2.26 ± 0.11 times faster than H:/msys64/ucrt64/bin/python3.12 too_simple.py

mati865 added a commit to mati865/MINGW-packages that referenced this issue Jan 9, 2025
Using toy script from msys2#22917
this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
mati865 added a commit to mati865/MINGW-packages that referenced this issue Jan 9, 2025
Using a toy script from msys2#22917
this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
@lhmouse
Copy link
Contributor

lhmouse commented Jan 9, 2025

@mati865 Yes it seems mostly the cost of a cold start. However after warming up there's still a minor difference:

  • CLANG64 ~/Desktop $ hyperfine '/mingw64/bin/python too_simple.py' '/clang64/bin/python too_simple.py' '/ucrt64/bin/python too_simple.py' --warmup 3
    Benchmark 1: C:/MSYS64/mingw64/bin/python too_simple.py
    Time (mean ± σ): 526.7 ms ± 5.4 ms [User: 455.3 ms, System: 29.3 ms]
    Range (min … max): 516.9 ms … 531.5 ms 10 runs

    Benchmark 2: C:/MSYS64/clang64/bin/python too_simple.py
    Time (mean ± σ): 558.6 ms ± 4.2 ms [User: 489.7 ms, System: 27.7 ms]
    Range (min … max): 552.8 ms … 568.1 ms 10 runs

    Benchmark 3: C:/MSYS64/ucrt64/bin/python too_simple.py
    Time (mean ± σ): 572.9 ms ± 5.5 ms [User: 505.3 ms, System: 27.7 ms]
    Range (min … max): 563.5 ms … 577.7 ms 10 runs

    Summary
    C:/MSYS64/mingw64/bin/python too_simple.py ran
    1.06 ± 0.01 times faster than C:/MSYS64/clang64/bin/python too_simple.py
    1.09 ± 0.02 times faster than C:/MSYS64/ucrt64/bin/python too_simple.py

lazka pushed a commit that referenced this issue Jan 11, 2025
Using a toy script from #22917
this reduces the time it takes from 723.7 ms to 670.4 ms on my PC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants