-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Madvise Transparent Huge Pages for large allocations #59858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
What kernel are you on? THP is usually automatic. |
IIUC it requires alignment so you don't get it unless you ask for them. The transparent part is that they aren't specially segmented memory. |
huge pages always need to be aligned. |
@Keno, @oscardssmith thanks for looking into this! I am testing on a dedicated server running Ubuntu 25.04, kernel 6.14
From my experiments the performance jump happens only when either explicit import Mmap: MADV_HUGEPAGE
function memory_from_mmap(n)
capacity = n*8
# PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS
ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, capacity::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
retcode = @ccall madvise(ptr::Ptr{Cvoid}, capacity::Csize_t, MADV_HUGEPAGE::Cint)::Cint
iszero(retcode) || @warn "Madvise HUGEPAGE failed"
ptr_int = convert(Ptr{Int}, ptr)
mem = unsafe_wrap(Memory{Int}, ptr_int, n; own=false)
end
function f(N; with_mmap=false)
if with_mmap
mem = memory_from_mmap(N)
else
mem = Memory{Int}(undef, N)
end
mem .= 0
mem[end]
end
f(1; with_mmap=true)
f(1; with_mmap=false)
N = 10_000_000
GC.enable(false)
@time f(N; with_mmap=true) # 0.015535 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false) # 0.043966 seconds (2 allocations: 76.297 MiB) With I've also tried commenting out As for whether this is done in python:
|
Seems like it's almost a bug that this doesn't just work by default, but 2x perf is 2x perf, so I say we merge this with a note that when linux starts doing the not dumb thing by default we can delete it. |
I'm confused why glibc isn't doing this, but then again their allocator is middling at best |
To get things going with #56521 I've made a minimal implementation that mirrors one from numpy (https://github.com/numpy/numpy/blob/7c0e2e4224c6feb04a2ac4aa851f49a2c2f6189f/numpy/_core/src/multiarray/alloc.c#L113).
What this does: changes
jl_gc_managed_malloc(size_t sz)
to check whether requested allocation is big enough to benefit from huge pages. And if so, we ensure the allocation is page aligned and then appropriatemadvise
is called on the memory pointer.For a simple "fill memory" test I see around 2x timing improvement.
I would appreciate help with this PR as I have no experience writing C code and little knowledge of julia internals. In particular I think it would make sense to have a startup option controlling minimal eligible allocation size which should default to system's hugepage size - for this initial implementation the same constant as in numpy is hardcoded.