Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 6, 2025

Implement async support for the Metal backend. Benefits very large contexts since we can now prepare the KQ mask while the previous batch is being computed on the GPU.

Model Test t/s master t/s gg/metal-async Speedup
gemma3 4B Q8_0 pp65536 1837.95 2115.10 1.15
gpt-oss 20B MXFP4 MoE pp65536 1095.32 1178.38 1.08
qwen3 4B Q8_0 pp65536 559.52 581.84 1.04

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 6, 2025
@slaren
Copy link
Member

slaren commented Sep 6, 2025

Making graph_compute async and implementing set_async and get_async could help, but cpy_tensor_async and events are only useful for multi GPU scenarios.

@ggerganov
Copy link
Member Author

Thanks that's good to know. I think I got confused by seeing cpy_tensor_async being called, but this is probably because the Metal backend does not register a host buffer type?

@slaren
Copy link
Member

slaren commented Sep 6, 2025

You can see a cpy_tensor_async call when copying the inputs from the CPU backend, however that's not something that can really be handled by the backend, and you must return false in this case to indicate that it is not supported. This function is only truly asynchronous when copying tensors between GPUs.

@ggerganov
Copy link
Member Author

Got it - thanks. The main motivation is to allow to overlap the llm_graph_result::set_inputs() for the next ubatch while the current ubatch is being processed. From testing with this branch, looks like implementing set_async, get_async and synchronize should do the job.

@ggerganov
Copy link
Member Author

@slaren Should I expect the set_tensor_async to ever get called in on a Mac? I got things working with get_tensor_async and synchronize, but can't figure out a way to test that set_tensor_async works.

@slaren
Copy link
Member

slaren commented Sep 8, 2025

I don't think there are any situations currently in llama.cpp in which set_tensor_async would be called when using the Metal backend. Maybe enable offload_op (implement it and return true) and keep some layers on the CPU with a MoE model, ggml_backend_tensor_set_async would be called when copying the experts.

@ggerganov ggerganov marked this pull request as ready for review September 8, 2025 17:44
Comment on lines 6094 to 6102
static bool ggml_backend_metal_buffer_type_is_host(ggml_backend_buffer_type_t buft) {
return true;
// TODO: not sure why, but without setting this to `false`, op offloading does not work correctly
// to reproduce, do the following:
//
// build with: cmake -DGGML_BLAS=OFF -DGGML_METAL=ON
// run: ./bin/llama-cli -m ggml-model-mxfp4.gguf -p "$(printf 'hello %.0s' {1..100})" --n-cpu-moe 10
//
return false;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure I understand why we cannot return true here. If set to true it seems to result in a data race at some point.

@slaren Is setting false here a good solution, or is there a better approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a bug in the way the scheduler handles this. By returning true, the CPU backend will use the results from the Metal backend without copies, but there is no synchronization preventing a race when the backend is async.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a full solution for this will be tricky, I think it will require tracking the source of every tensor used in a split.

One immediate solution would be to stop using newBufferWithBytesNoCopy in Metal, create device-only buffers, and accept the cost of copying to the CPU backend. This will also likely improve the performance of Metal with discrete GPUs.

If that's not acceptable and there is a significant cost of these copies, this temporary patch should allow it to work without impacting negatively other cases:

diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index 7646f3f13..e88974121 100644
--- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp
@@ -1426,6 +1426,21 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
         int split_backend_id = split->backend_id;
         ggml_backend_t split_backend = sched->backends[split_backend_id];

+        // HACK: if the split backend does not support async, synchronize the previous backend to avoid data races
+        // a full solution would require keeping track of the dependencies between splits, and synchronizing only when needed
+        // normally this is not a problem because backends have incompatible buffer types, and thus cannot share tensors
+        // however, in some cases such as Metal, the CPU backend can use the same buffer type as the GPU backend (i.e. host memory)
+        if (split_id > 0) {
+            ggml_backend_dev_props props;
+            auto * dev = ggml_backend_get_device(split_backend);
+            ggml_backend_dev_get_props(dev, &props);
+            if (!props.caps.async) {
+                int prev_backend_id = splits[split_id - 1].backend_id;
+                ggml_backend_t prev_backend = sched->backends[prev_backend_id];
+                ggml_backend_synchronize(prev_backend);
+            }
+        }
+
         // copy the input tensors to the split backend
         for (int input_id = 0; input_id < split->n_inputs; input_id++) {
             ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);

Copy link
Member Author

@ggerganov ggerganov Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 85aaf52 I removed the newBufferWithBytesNoCopy usage and now buffers are created with Private Metal memory type. Setting is_host == true now works correctly.

I will be testing the performance now to see if there is any significant impact from this change.

@ggerganov ggerganov force-pushed the gg/metal-async branch 2 times, most recently from f369bdb to 8d4835d Compare September 9, 2025 10:58
Comment on lines 6153 to 6162
#if 1
// TODO: tmp hack
static void * p_base = (void *) 0x000000400ULL;

ctx->all_data = p_base;

p_base = (void *) ((uintptr_t) p_base + size_aligned);
#else
ctx->all_data = ggml_metal_host_malloc(size_aligned);
#endif
Copy link
Member Author

@ggerganov ggerganov Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the proper way to handle this?

Previously, we used the allocated buffer in host memory as the base address. What should I use now after we no longer allocate host memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Metal does not give a pointer to the buffers that's fine, just use a non-zero aligned pointer as the base address.

Comment on lines 6246 to 6250
static struct ggml_backend_buffer_type ggml_backend_buffer_from_ptr_type_metal = {
/* .iface = */ {
/* .get_name = */ ggml_backend_metal_buffer_from_ptr_type_get_name,
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer,
/* .alloc_buffer = */ NULL,
/* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer_from_ptr_type should not need to implement .alloc_buffer, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be ok to leave the alloc_buffer function unimplemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this function, one of the server tests fails during llama-server loading like this:

bin/llama-server --host 127.0.0.1 --port 8080 --temp 0.0 --seed 42 --hf-repo ggml-org/stories15M_MOE --hf-file stories15M_MOE-F16.gguf --batch-size 1024 --no-slots --alias stories15m-moe --ctx-size 2048 --parallel 1 --n-predict 64 --lora ./tmp/moe_shakespeare15M.gguf
Process 8535 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (llama-server) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x0000000101ba28fc libggml-base.dylib`ggml_backend_buft_alloc_buffer(buft=0x00000001015f0a70, size=15634432) at ggml-backend.cpp:45:12
    frame #2: 0x0000000101ba14b0 libggml-base.dylib`alloc_tensor_range(ctx=0x00006000009cc900, first=0x0000000101ff4020, last=0x0000000000000000, buft=0x00000001015f0a70, size=15634432, buffers=0x000000016fdf81e0, n_buffers=0x000000016fdf81d8) at ggml-alloc.c:937:36
    frame #3: 0x0000000101ba13dc libggml-base.dylib`ggml_backend_alloc_ctx_tensors_from_buft(ctx=0x00006000009cc900, buft=0x00000001015f0a70) at ggml-alloc.c:1004:14
    frame #4: 0x0000000102186f50 libllama.dylib`llama_adapter_lora_init_impl(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf", adapter=0x00006000036cc140) at llama-adapter.cpp:385:43
    frame #5: 0x0000000102185160 libllama.dylib`llama_adapter_lora_init(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf") at llama-adapter.cpp:421:9
    frame #6: 0x000000010039f5a0 llama-server`common_init_from_params(params=0x000000016fdfbf48) at common.cpp:985:20
    frame #7: 0x0000000100007694 llama-server`server_context::load_model(this=0x000000016fdfbf48, params=0x000000016fdfd600) at server.cpp:2144:22
    frame #8: 0x000000010000598c llama-server`main(argc=26, argv=0x000000016fdfef70) at server.cpp:5319:21
    frame #9: 0x000000019ac9ab98 dyld`start + 6076

Bringing back the function fixes this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um yeah, this is because the lora tensors use the same buffer type as the base tensors. It takes the buffer type from the tensor and allocates a new buffer of the same type. I didn't think of this case, it is better to leave it implemented then.

Comment on lines 46 to 50
static struct ggml_backend_metal_device_context {
id<MTLDevice> mtl_device;
int mtl_device_ref_count;
id<MTLLibrary> mtl_library;
id<MTLDevice> mtl_device;
int mtl_device_ref_count;
id<MTLCommandQueue> mtl_queue;
id<MTLLibrary> mtl_library;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point during the refactoring, decided to move the MTLQueue from the backend context to the device context. I.e. a single queue is now shared by all backend contexts.

Not sure if this is better. Maybe will move it back to the backend context and create separate queues for each synchronous get_tensor, set_tensor and memset_tensor operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is that the ggml_backend object represents a queue or stream. In practice, it probably doesn't matter too much if it is shared for Apple Mx iGPUs, but it would prevent some use cases such as running a computation in one backend and a memory transfer in another simultaneously.

id<MTLCommandQueue> queue = ctx->queue;
id<MTLCommandBuffer> cmd_buf = [queue commandBuffer];
id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder];
[cmd_buf enqueue];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell from the doc, there is no reason to call enqueue unless you plan to commit multiple command buffers and want to "reserve a place in the queue". commit automatically calls enqueue if necessary:

If your app calls commit for a command buffer that isn’t enqueued, the method effectively calls enqueue for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these enqueue calls are not needed - reomving.

Comment on lines +5753 to +5759
// here we guarantee the full previous graph has finished computing
// but note that we have already enqueued the first part of the new graph so it can start processing, while
// continue to encode the rest of the graph
if (ctx->cmd_buf_last) {
[ctx->cmd_buf_last waitUntilCompleted];
ctx->cmd_buf_last = nil;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why is this necessary, or why there are both a cmd_buf_last and a cmd_buf_ext_last. It seems that a single reference to the last queued buffer should be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why is this necessary

Technically here, instead of waiting for the previous command buffers to finish, we could create new command buffers and proceed directly with enqueuing the new graph. But the limitation is that for each cmd_buf we have an associated ggml_metal_mem_pool used for allocating fleeting buffers that are needed for some of the operators. So creating a new command buffer, would require creating new memory pool which is not desired as the idea of the pools is to reuse previously allocated memory.

why there are both a cmd_buf_last and a cmd_buf_ext_last

Single reference will work, but I thought that this is slightly more optimal. In the following scenario:

# encoding a graph using 2 command buffers:
#   first cb is a small cb with the first 128 nodes
#   second cb is the rest of the graph
enqueue compute graph 0 (cb0)
enqueue compute graph 0 (cb1)   <--- cmd_buf_last points to this CB
  wait until scheduled cb1
async set tensor A
async set tensor B
...
async set tensor X              <--- cmd_buf_ext_last points to this CB
  wait until complete cb0
enqueue compute graph 1 (cb0)
  wait until complete cb1       <--- here no need to wait for the async copies
enqueue compute graph 1 (cb1)

The last wait only has to wait for the previous graph to be finished and does not need to wait for any async copies that might be ongoing.

Copy link
Member

@slaren slaren Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the limitation is that for each cmd_buf we have an associated ggml_metal_mem_pool used for allocating fleeting buffers that are needed for some of the operators.

Wouldn't it be possible to use the same pool for all the command buffers? Since they are executed in sequence, not in parallel, I don't see why each one needs a different one.

On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer, and remove all the need of using pools completely. This could be done by allocating the memory necessary at the end of the tensor data, by changing get_alloc_size to consider the OP of the tensor, and add any extra memory necessary to the size of the tensor. Maybe we could add some kind of API to ggml-backend to make this nicer to use, but essentially all the infrastructure necessary to do this is already there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a logic for freeing unused heaps (MTLHeap) from the pool when they haven't been used:

static void ggml_metal_mem_pool_reset(struct ggml_metal_mem_pool * mem_pool) {
for (NSUInteger i = 0; i < [mem_pool->heaps count]; i++) {
ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:i];
struct ggml_metal_heap * heap = ptr.data;
ggml_metal_heap_reset(heap);
// if the heap hasn't been used for a while, remove it
if (heap->n_unused >= 128) {
[mem_pool->heaps_to_remove addObject:@(i)];
}
}
if (mem_pool->heaps_to_remove.count > 0) {
// remove in reverse order
for (NSUInteger i = [mem_pool->heaps_to_remove count] - 1; ; --i) {
NSUInteger index = [[mem_pool->heaps_to_remove objectAtIndex:i] intValue];
ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:index];
struct ggml_metal_heap * heap = ptr.data;
ggml_metal_heap_free(heap);
[mem_pool->heaps removeObjectAtIndex:index];
[ptr release];
if (i == 0) {
break;
}
}
[mem_pool->heaps_to_remove removeAllObjects];
}
}

Not sure what would happen if during enqueuing the new graph, we decide that a heap should be released, while in the previous graph we are using a buffer allocated in that heap.

On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer

Nice. Would be great to remove the memory pools. Without the pools, I think there would be no reason to wait here.

[encoder endEncoding];

[cmd_buf commit];
[cmd_buf waitUntilScheduled];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not obvious to me why waitUntilScheduled is necessary here, from what I can tell this is something that only needs to be done on iOS when the app is about to be moved to the background.

Copy link
Member Author

@ggerganov ggerganov Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was worried that exiting the autoreleasepool section before fully scheduled might be a problem, but my guess it that the it's destructor likely has this call in it, so probably we can remove it.

@ggerganov
Copy link
Member Author

Superseded with #15906

@ggerganov ggerganov closed this Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants