Skip to content

Conversation

artemsolod
Copy link

To get things going with #56521 I've made a minimal implementation that mirrors one from numpy (https://github.com/numpy/numpy/blob/7c0e2e4224c6feb04a2ac4aa851f49a2c2f6189f/numpy/_core/src/multiarray/alloc.c#L113).

What this does: changes jl_gc_managed_malloc(size_t sz) to check whether requested allocation is big enough to benefit from huge pages. And if so, we ensure the allocation is page aligned and then appropriate madvise is called on the memory pointer.

For a simple "fill memory" test I see around 2x timing improvement.

function f(N)
    mem =  Memory{Int}(undef, N)
    mem .= 0
    mem[end]
end

f(1)
@time f(1_000_000)
0.001464 seconds (2 allocations: 7.633 MiB) # this branch
0.003431 seconds (2 allocations: 7.633 MiB) # master

I would appreciate help with this PR as I have no experience writing C code and little knowledge of julia internals. In particular I think it would make sense to have a startup option controlling minimal eligible allocation size which should default to system's hugepage size - for this initial implementation the same constant as in numpy is hardcoded.

@oscardssmith oscardssmith added performance Must go faster arrays [a, r, r, a, y, s] labels Oct 15, 2025
@Keno
Copy link
Member

Keno commented Oct 15, 2025

What kernel are you on? THP is usually automatic.

@oscardssmith
Copy link
Member

IIUC it requires alignment so you don't get it unless you ask for them. The transparent part is that they aren't specially segmented memory.

@Keno
Copy link
Member

Keno commented Oct 16, 2025

huge pages always need to be aligned. transparent means that they're ordinary pages, rather than being mmap'd from hugetlb which is the (very) old way to get hugepages. But regardless, modern kernel should automatically assign huge pages to sufficiently large mappings that it thinks are used. My suspicion here is that the reported perf difference isn't actually due to hugepages, but rather the fact that for the initial allocation the hugepage advice overwrites the fault granularity. We might see even better performance by prefaulting the pages. However, if that's the case, then that's a more general concern and in particular is workload dependent. Does python actually do this madvise by default?

@artemsolod
Copy link
Author

@Keno, @oscardssmith thanks for looking into this!

I am testing on a dedicated server running Ubuntu 25.04, kernel 6.14

uname -a
Linux ubuntu-c-8-intel-ams3-01 6.14.0-32-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 29 14:21:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

From my experiments the performance jump happens only when either explicit madvise is called or /sys/kernel/mm/transparent_hugepage/enabled is set to always (by default it's set to madvise). I was first suspecting that using mmap to allocate could be sufficient but this does not seem to work. Here is a test script comparing manual madivse and usual julia memory allocation, it can be run on 1.12 or master.

import Mmap: MADV_HUGEPAGE

function memory_from_mmap(n)
    capacity = n*8
    # PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS  
    ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, capacity::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
    retcode = @ccall madvise(ptr::Ptr{Cvoid}, capacity::Csize_t, MADV_HUGEPAGE::Cint)::Cint
    iszero(retcode) || @warn "Madvise HUGEPAGE failed"

    ptr_int = convert(Ptr{Int}, ptr)
    mem = unsafe_wrap(Memory{Int}, ptr_int, n; own=false)
end

function f(N; with_mmap=false)
    if with_mmap
        mem = memory_from_mmap(N)
    else
        mem =  Memory{Int}(undef, N)
    end
    mem .= 0
    mem[end]
end

f(1; with_mmap=true)
f(1; with_mmap=false)
N = 10_000_000
GC.enable(false)
@time f(N; with_mmap=true)  # 0.015535 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false) # 0.043966 seconds (2 allocations: 76.297 MiB)

With echo always > /sys/kernel/mm/transparent_hugepage/enabled both versions are fast, echo never > /sys/kernel/mm/transparent_hugepage/enabled both are slow. For default one echo madvise > /sys/kernel/mm/transparent_hugepage/enabled performance is very different depending on with_mmap=true .

I've also tried commenting out madvise in this PR branch, this shows the same performance as master, i.e. makes it slower.

As for whether this is done in python:

  • numpy definitely does it and relies on madvise being activated in the system, the threshold is hardcoded - source and documentation
  • CPython also has explicit madvise (or rather I see it their mimalloc code source). However, the mechanism is more sophisticated and they mention in comments that they expect it to not be necessary:
      // Many Linux systems don't allow MAP_HUGETLB but they support instead
      // transparent huge pages (THP). Generally, it is not required to call `madvise` with MADV_HUGE
      // though since properly aligned allocations will already use large pages if available
      // in that case -- in particular for our large regions (in `memory.c`).
      // However, some systems only allow THP if called with explicit `madvise`, so
      // when large OS pages are enabled for mimalloc, we call `madvise` anyways.

@oscardssmith
Copy link
Member

Seems like it's almost a bug that this doesn't just work by default, but 2x perf is 2x perf, so I say we merge this with a note that when linux starts doing the not dumb thing by default we can delete it.

@gbaraldi
Copy link
Member

I'm confused why glibc isn't doing this, but then again their allocator is middling at best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants