metal : make the backend async #15832

ggerganov · 2025-09-06T09:28:47Z

Implement async support for the Metal backend. Benefits very large contexts since we can now prepare the KQ mask while the previous batch is being computed on the GPU.

Model	Test	t/s master	t/s gg/metal-async	Speedup
gemma3 4B Q8_0	pp65536	1837.95	2115.10	1.15
gpt-oss 20B MXFP4 MoE	pp65536	1095.32	1178.38	1.08
qwen3 4B Q8_0	pp65536	559.52	581.84	1.04

slaren · 2025-09-06T10:44:30Z

Making graph_compute async and implementing set_async and get_async could help, but cpy_tensor_async and events are only useful for multi GPU scenarios.

ggerganov · 2025-09-06T11:02:08Z

Thanks that's good to know. I think I got confused by seeing cpy_tensor_async being called, but this is probably because the Metal backend does not register a host buffer type?

slaren · 2025-09-06T11:05:46Z

You can see a cpy_tensor_async call when copying the inputs from the CPU backend, however that's not something that can really be handled by the backend, and you must return false in this case to indicate that it is not supported. This function is only truly asynchronous when copying tensors between GPUs.

ggerganov · 2025-09-06T11:44:17Z

Got it - thanks. The main motivation is to allow to overlap the llm_graph_result::set_inputs() for the next ubatch while the current ubatch is being processed. From testing with this branch, looks like implementing set_async, get_async and synchronize should do the job.

ggerganov · 2025-09-08T13:57:22Z

@slaren Should I expect the set_tensor_async to ever get called in on a Mac? I got things working with get_tensor_async and synchronize, but can't figure out a way to test that set_tensor_async works.

slaren · 2025-09-08T14:07:02Z

I don't think there are any situations currently in llama.cpp in which set_tensor_async would be called when using the Metal backend. Maybe enable offload_op (implement it and return true) and keep some layers on the CPU with a MoE model, ggml_backend_tensor_set_async would be called when copying the experts.

ggerganov · 2025-09-08T17:46:03Z

ggml/src/ggml-metal/ggml-metal.m

 static bool ggml_backend_metal_buffer_type_is_host(ggml_backend_buffer_type_t buft) {
-    return true;
+    // TODO: not sure why, but without setting this to `false`, op offloading does not work correctly
+    //       to reproduce, do the following:
+    //
+    //   build with: cmake -DGGML_BLAS=OFF -DGGML_METAL=ON
+    //   run:        ./bin/llama-cli -m ggml-model-mxfp4.gguf -p "$(printf 'hello %.0s' {1..100})" --n-cpu-moe 10
+    //
+    return false;



I'm not completely sure I understand why we cannot return true here. If set to true it seems to result in a data race at some point.

@slaren Is setting false here a good solution, or is there a better approach?

That's probably a bug in the way the scheduler handles this. By returning true, the CPU backend will use the results from the Metal backend without copies, but there is no synchronization preventing a race when the backend is async.

Having a full solution for this will be tricky, I think it will require tracking the source of every tensor used in a split.

One immediate solution would be to stop using newBufferWithBytesNoCopy in Metal, create device-only buffers, and accept the cost of copying to the CPU backend. This will also likely improve the performance of Metal with discrete GPUs.

If that's not acceptable and there is a significant cost of these copies, this temporary patch should allow it to work without impacting negatively other cases:

diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp index 7646f3f13..e88974121 100644 --- a/ggml/src/ggml-backend.cpp +++ b/ggml/src/ggml-backend.cpp @@ -1426,6 +1426,21 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s int split_backend_id = split->backend_id; ggml_backend_t split_backend = sched->backends[split_backend_id]; + // HACK: if the split backend does not support async, synchronize the previous backend to avoid data races + // a full solution would require keeping track of the dependencies between splits, and synchronizing only when needed + // normally this is not a problem because backends have incompatible buffer types, and thus cannot share tensors + // however, in some cases such as Metal, the CPU backend can use the same buffer type as the GPU backend (i.e. host memory) + if (split_id > 0) { + ggml_backend_dev_props props; + auto * dev = ggml_backend_get_device(split_backend); + ggml_backend_dev_get_props(dev, &props); + if (!props.caps.async) { + int prev_backend_id = splits[split_id - 1].backend_id; + ggml_backend_t prev_backend = sched->backends[prev_backend_id]; + ggml_backend_synchronize(prev_backend); + } + } + // copy the input tensors to the split backend for (int input_id = 0; input_id < split->n_inputs; input_id++) { ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);

In 85aaf52 I removed the newBufferWithBytesNoCopy usage and now buffers are created with Private Metal memory type. Setting is_host == true now works correctly.

I will be testing the performance now to see if there is any significant impact from this change.

ggml-ci

ggerganov · 2025-09-09T11:00:47Z

ggml/src/ggml-metal/ggml-metal.m

+#if 1
+    // TODO: tmp hack
+    static void * p_base = (void *) 0x000000400ULL;
+
+    ctx->all_data = p_base;
+
+    p_base = (void *) ((uintptr_t) p_base + size_aligned);
+#else
    ctx->all_data = ggml_metal_host_malloc(size_aligned);
+#endif


What would be the proper way to handle this?

Previously, we used the allocated buffer in host memory as the base address. What should I use now after we no longer allocate host memory?

If Metal does not give a pointer to the buffers that's fine, just use a non-zero aligned pointer as the base address.

ggml-ci

ggerganov · 2025-09-09T11:01:58Z

ggml/src/ggml-metal/ggml-metal.m

    static struct ggml_backend_buffer_type ggml_backend_buffer_from_ptr_type_metal = {
        /* .iface = */ {
            /* .get_name         = */ ggml_backend_metal_buffer_from_ptr_type_get_name,
-            /* .alloc_buffer     = */ ggml_backend_metal_buffer_type_alloc_buffer,
+            /* .alloc_buffer     = */ NULL,
            /* .get_alignment    = */ ggml_backend_metal_buffer_type_get_alignment,


The buffer_from_ptr_type should not need to implement .alloc_buffer, correct?

Yes, it should be ok to leave the alloc_buffer function unimplemented.

Without this function, one of the server tests fails during llama-server loading like this:

bin/llama-server --host 127.0.0.1 --port 8080 --temp 0.0 --seed 42 --hf-repo ggml-org/stories15M_MOE --hf-file stories15M_MOE-F16.gguf --batch-size 1024 --no-slots --alias stories15m-moe --ctx-size 2048 --parallel 1 --n-predict 64 --lora ./tmp/moe_shakespeare15M.gguf

Process 8535 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x0000000000000000 error: memory read failed for 0x0 Target 0: (llama-server) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) * frame #0: 0x0000000000000000 frame #1: 0x0000000101ba28fc libggml-base.dylib`ggml_backend_buft_alloc_buffer(buft=0x00000001015f0a70, size=15634432) at ggml-backend.cpp:45:12 frame #2: 0x0000000101ba14b0 libggml-base.dylib`alloc_tensor_range(ctx=0x00006000009cc900, first=0x0000000101ff4020, last=0x0000000000000000, buft=0x00000001015f0a70, size=15634432, buffers=0x000000016fdf81e0, n_buffers=0x000000016fdf81d8) at ggml-alloc.c:937:36 frame #3: 0x0000000101ba13dc libggml-base.dylib`ggml_backend_alloc_ctx_tensors_from_buft(ctx=0x00006000009cc900, buft=0x00000001015f0a70) at ggml-alloc.c:1004:14 frame #4: 0x0000000102186f50 libllama.dylib`llama_adapter_lora_init_impl(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf", adapter=0x00006000036cc140) at llama-adapter.cpp:385:43 frame #5: 0x0000000102185160 libllama.dylib`llama_adapter_lora_init(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf") at llama-adapter.cpp:421:9 frame #6: 0x000000010039f5a0 llama-server`common_init_from_params(params=0x000000016fdfbf48) at common.cpp:985:20 frame #7: 0x0000000100007694 llama-server`server_context::load_model(this=0x000000016fdfbf48, params=0x000000016fdfd600) at server.cpp:2144:22 frame #8: 0x000000010000598c llama-server`main(argc=26, argv=0x000000016fdfef70) at server.cpp:5319:21 frame #9: 0x000000019ac9ab98 dyld`start + 6076

Bringing back the function fixes this.

Um yeah, this is because the lora tensors use the same buffer type as the base tensors. It takes the buffer type from the tensor and allocates a new buffer of the same type. I didn't think of this case, it is better to leave it implemented then.

ggerganov · 2025-09-09T11:08:34Z

ggml/src/ggml-metal/ggml-metal.m

 static struct ggml_backend_metal_device_context {
-    id<MTLDevice>  mtl_device;
-    int            mtl_device_ref_count;
-    id<MTLLibrary> mtl_library;
+    id<MTLDevice>       mtl_device;
+    int                 mtl_device_ref_count;
+    id<MTLCommandQueue> mtl_queue;
+    id<MTLLibrary>      mtl_library;


At some point during the refactoring, decided to move the MTLQueue from the backend context to the device context. I.e. a single queue is now shared by all backend contexts.

Not sure if this is better. Maybe will move it back to the backend context and create separate queues for each synchronous get_tensor, set_tensor and memset_tensor operations.

The intention is that the ggml_backend object represents a queue or stream. In practice, it probably doesn't matter too much if it is shared for Apple Mx iGPUs, but it would prevent some use cases such as running a computation in one backend and a memory transfer in another simultaneously.

ggml-ci

slaren · 2025-09-09T15:53:23Z

ggml/src/ggml-metal/ggml-metal.m

+        id<MTLCommandQueue>       queue   = ctx->queue;
+        id<MTLCommandBuffer>      cmd_buf = [queue commandBuffer];
+        id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder];
+        [cmd_buf enqueue];


As far as I can tell from the doc, there is no reason to call enqueue unless you plan to commit multiple command buffers and want to "reserve a place in the queue". commit automatically calls enqueue if necessary:

If your app calls commit for a command buffer that isn’t enqueued, the method effectively calls enqueue for you.

Yes, these enqueue calls are not needed - reomving.

slaren · 2025-09-09T15:54:56Z

ggml/src/ggml-metal/ggml-metal.m

+        // here we guarantee the full previous graph has finished computing
+        // but note that we have already enqueued the first part of the new graph so it can start processing, while
+        //   continue to encode the rest of the graph
+        if (ctx->cmd_buf_last) {
+            [ctx->cmd_buf_last waitUntilCompleted];
+            ctx->cmd_buf_last = nil;
+        }


I am not sure why is this necessary, or why there are both a cmd_buf_last and a cmd_buf_ext_last. It seems that a single reference to the last queued buffer should be enough.

I am not sure why is this necessary

Technically here, instead of waiting for the previous command buffers to finish, we could create new command buffers and proceed directly with enqueuing the new graph. But the limitation is that for each cmd_buf we have an associated ggml_metal_mem_pool used for allocating fleeting buffers that are needed for some of the operators. So creating a new command buffer, would require creating new memory pool which is not desired as the idea of the pools is to reuse previously allocated memory.

why there are both a cmd_buf_last and a cmd_buf_ext_last

Single reference will work, but I thought that this is slightly more optimal. In the following scenario:

# encoding a graph using 2 command buffers: # first cb is a small cb with the first 128 nodes # second cb is the rest of the graph enqueue compute graph 0 (cb0) enqueue compute graph 0 (cb1) <--- cmd_buf_last points to this CB wait until scheduled cb1 async set tensor A async set tensor B ... async set tensor X <--- cmd_buf_ext_last points to this CB wait until complete cb0 enqueue compute graph 1 (cb0) wait until complete cb1 <--- here no need to wait for the async copies enqueue compute graph 1 (cb1)

The last wait only has to wait for the previous graph to be finished and does not need to wait for any async copies that might be ongoing.

But the limitation is that for each cmd_buf we have an associated ggml_metal_mem_pool used for allocating fleeting buffers that are needed for some of the operators.

Wouldn't it be possible to use the same pool for all the command buffers? Since they are executed in sequence, not in parallel, I don't see why each one needs a different one.

On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer, and remove all the need of using pools completely. This could be done by allocating the memory necessary at the end of the tensor data, by changing get_alloc_size to consider the OP of the tensor, and add any extra memory necessary to the size of the tensor. Maybe we could add some kind of API to ggml-backend to make this nicer to use, but essentially all the infrastructure necessary to do this is already there.

There is a logic for freeing unused heaps (MTLHeap) from the pool when they haven't been used:

llama.cpp/ggml/src/ggml-metal/ggml-metal.m

Lines 664 to 696 in 4f63cd7

static void ggml_metal_mem_pool_reset(struct ggml_metal_mem_pool * mem_pool) {

for (NSUInteger i = 0; i < [mem_pool->heaps count]; i++) {

ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:i];

struct ggml_metal_heap * heap = ptr.data;

ggml_metal_heap_reset(heap);

// if the heap hasn't been used for a while, remove it

if (heap->n_unused >= 128) {

[mem_pool->heaps_to_remove addObject:@(i)];

}

}

if (mem_pool->heaps_to_remove.count > 0) {

// remove in reverse order

for (NSUInteger i = [mem_pool->heaps_to_remove count] - 1; ; --i) {

NSUInteger index = [[mem_pool->heaps_to_remove objectAtIndex:i] intValue];

ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:index];

struct ggml_metal_heap * heap = ptr.data;

ggml_metal_heap_free(heap);

[mem_pool->heaps removeObjectAtIndex:index];

[ptr release];

if (i == 0) {

break;

}

}

[mem_pool->heaps_to_remove removeAllObjects];

}

}

Not sure what would happen if during enqueuing the new graph, we decide that a heap should be released, while in the previous graph we are using a buffer allocated in that heap.

On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer

Nice. Would be great to remove the memory pools. Without the pools, I think there would be no reason to wait here.

slaren · 2025-09-09T16:00:17Z

ggml/src/ggml-metal/ggml-metal.m

+        [encoder endEncoding];
+
+        [cmd_buf commit];
+        [cmd_buf waitUntilScheduled];


It is not obvious to me why waitUntilScheduled is necessary here, from what I can tell this is something that only needs to be done on iOS when the app is about to be moved to the background.

Was worried that exiting the autoreleasepool section before fully scheduled might be a problem, but my guess it that the it's destructor likely has this call in it, so probably we can remove it.

ggerganov · 2025-09-10T13:34:39Z

Superseded with #15906

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 6, 2025

ggerganov mentioned this pull request Sep 7, 2025

metal : refactor + optimize #15857

Merged

4 tasks

ggerganov force-pushed the gg/metal-async branch from 2b8074d to 84ae836 Compare September 8, 2025 10:56

ggerganov force-pushed the gg/metal-async branch from 84ae836 to c4fe8b0 Compare September 8, 2025 15:15

ggerganov marked this pull request as ready for review September 8, 2025 17:44

ggerganov commented Sep 8, 2025

View reviewed changes

ggerganov added 4 commits September 9, 2025 09:29

metal : make the backend async

97b96c1

ggml-ci

cont : add comments, extend op offload, clean up

c5637cf

ggml-ci

metal : fix batch size for MUL_MAT_ID

bdff772

metal : remove deprecated ggml_backend_metal_buffer_from_ptr

d91ba85

ggerganov force-pushed the gg/metal-async branch 2 times, most recently from f369bdb to 8d4835d Compare September 9, 2025 10:58

ggerganov commented Sep 9, 2025

View reviewed changes

metal : create only metal buffers, no wrapping of host memory

85aaf52

ggml-ci

ggerganov force-pushed the gg/metal-async branch from 8d4835d to 85aaf52 Compare September 9, 2025 11:01

ggerganov commented Sep 9, 2025

View reviewed changes

ggerganov added 4 commits September 9, 2025 14:28

metal : restore .alloc_buffer for buffer_from_ptr_type

7fc2b3d

ggml-ci

metal : remove broken implementation of GGML_OP_SET

f288225

ggml-ci

metal : clean-up loose ends, ready for tests

0926cb4

ggml-ci

metal : back to a single queue per device

3f62ee8

ggml-ci

slaren reviewed Sep 9, 2025

View reviewed changes

ggerganov mentioned this pull request Sep 9, 2025

metal : make the backend async v2 #15906

Merged

ggerganov closed this Sep 10, 2025

	static void ggml_metal_mem_pool_reset(struct ggml_metal_mem_pool * mem_pool) {
	for (NSUInteger i = 0; i < [mem_pool->heaps count]; i++) {
	ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:i];

	struct ggml_metal_heap * heap = ptr.data;
	ggml_metal_heap_reset(heap);

	// if the heap hasn't been used for a while, remove it
	if (heap->n_unused >= 128) {
	[mem_pool->heaps_to_remove addObject:@(i)];
	}
	}

	if (mem_pool->heaps_to_remove.count > 0) {
	// remove in reverse order
	for (NSUInteger i = [mem_pool->heaps_to_remove count] - 1; ; --i) {
	NSUInteger index = [[mem_pool->heaps_to_remove objectAtIndex:i] intValue];
	ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:index];

	struct ggml_metal_heap * heap = ptr.data;
	ggml_metal_heap_free(heap);

	[mem_pool->heaps removeObjectAtIndex:index];
	[ptr release];

	if (i == 0) {
	break;
	}
	}

	[mem_pool->heaps_to_remove removeAllObjects];
	}
	}

metal : make the backend async #15832

metal : make the backend async #15832

Uh oh!

Conversation

ggerganov commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 6, 2025

Uh oh!

ggerganov commented Sep 6, 2025

Uh oh!

slaren commented Sep 6, 2025

Uh oh!

ggerganov commented Sep 6, 2025

Uh oh!

ggerganov commented Sep 8, 2025

Uh oh!

slaren commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Sep 10, 2025

Uh oh!

Uh oh!

ggerganov commented Sep 6, 2025 •

edited

Loading

ggerganov Sep 9, 2025 •

edited

Loading

ggerganov Sep 9, 2025 •

edited

Loading

slaren Sep 9, 2025 •

edited

Loading

ggerganov Sep 9, 2025 •

edited

Loading