Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threadpool: poor performance scaling on MacOS / Apple Silicon #515

Open
mratsim opened this issue Jan 8, 2025 · 0 comments
Open

Threadpool: poor performance scaling on MacOS / Apple Silicon #515

mratsim opened this issue Jan 8, 2025 · 0 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Jan 8, 2025

On a M4 Max (12 Perf cores and 4 Efficiency cores) the performance scaling of MSM or batch bls verification is bad and does not reach more than 8x.

In fact, with spawning 16 threads is slower than 12 threads
image

Apple mentions a QOS API but it doesn't help https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html

It seems like this impacts also Go as Gnark suffers from similar scaling woes.
image

  • 32768: speedup parallel vs serial is 9.53x
  • 131072: speedup is 8.74x

Users of Microsoft .Net mention issues as well dotnet/runtime#59866
and OpenSCAD when focus changes: openscad/openscad#4850

Trying to change QoS doesn't seem to help. Code:

when defined(ios) or defined(macosx) or defined(macos):
  # On Apple Silicon, the threadpool performance is bad.
  # In fact on a M4 MAX with 12P + 4E, spawning only 12 threads improves performance
  # This seems to be recurrent: https://github.com/dotnet/runtime/issues/59866
  #
  # We increase QoS of the threadpool from default to user-initiated.
  # https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html
  # https://github.com/apple-oss-distributions/libpthread/blob/libpthread-535/include/sys/qos.h#L130-L143

  type MacOS_QOS_Class {.size: sizeof(cint).} = enum
    qosUnspecified = 0x00
    qosBackground = 0x09
    qosUtility = 0x11
    qosDefault = 0x15
    qosUserInitiated = 0x19
    qosUserInteractive = 0x21

  proc pthread_set_qos_class_self_np(qosClass: MacOS_QOS_Class, relativePrio: cint = 0): cint {.importc, header: "<pthread.h>".}
    ## Sets the requested QOS class and relative priority of the current thread.

Unfortunately Asahi Linux only supports M1/M2 so I can't test on Linux to rule out the OS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant