llama: print memory breakdown on exit #15860

JohannesGaessler · 2025-09-07T21:35:23Z

This PR makes it so that on exit a breakdown of the memory use is printed. For example:

llama_print_memory_breakdown: memory breakdown:      total   free     self   model   context   compute    unaccounted
llama_print_memory_breakdown:   - CUDA0 (RTX 4090):  24080 = 9436 + (14193 = 13169 +      38 +     985) +         451
llama_print_memory_breakdown:   - CUDA1 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CUDA2 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CPU (EPYC 7742):  515628              72 =     0 +      57 +      15

Explanation:

size_t memory_total;             // total memory as reported by the device
size_t memory_free;              // free memory as reported by the device
size_t memory_used_self;         // sum of model, context, and compute
size_t memory_used_self_model;   // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)

The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.

Open problems:

I added a function llama_print_memory_breakdown to the llama API which produces the above table on the console. Internally this function uses another new function llama_backend_info which returns a struct with information about the backends used by a llama_context. I'm not sure whether the latter should be part of the public API, and if yes, in what form.
I added methods like llama_model::memory_use(ggml_backend_dev_t dev) which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass a ggml_backend_buffer_type_t? In particular, I think this is the only correct way to handle e.g. CUDA_Host buffers.
The memory for e.g. the CUDA pools is currently under "unaccounted", but it should be under "compute". Currently it is not possible for llama.cpp to retrieve this information. I think it would make sense to extend the ggml backend interface with a function that returns the total amount of device memory allocated by the backend.
I'm not sure what to show, if anything, for the CPU. "Free" memory in this context does not have a clear-cut definition so I'm only showing total memory and memory that is definitely allocated for the CPU backend.

slaren · 2025-09-09T12:05:54Z

include/llama.h

-        int32_t n_eval;
-        int32_t n_reused; // number of times a ggml compute graph had been reused
+        // ms == milliseconds
+        double t_start_ms;  // time needed for startup


I don't think this is correct, it's just the timestamp at startup.

slaren · 2025-09-09T22:29:40Z

include/llama.h

+
+    LLAMA_API size_t llama_backend_count(const struct llama_context * ctx);
+
+    struct llama_backend_info_data {
+        const char * name;
+
+        struct {
+            const char * name;
+            const char * description;
+
+            // device memory is in bytes
+            size_t memory_total;             // total memory as reported by the device
+            size_t memory_free;              // free memory as reported by the device
+            size_t memory_used_self;         // sum of model, context, and compute
+            size_t memory_used_self_model;   // memory allocated for the model
+            size_t memory_used_self_context; // memory allocated for the context
+            size_t memory_used_self_compute; // memory allocated for temporary compute buffers
+            size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)
+        } device;
+    };


I don't think tracking memory usage per-backend is the right way to do this. There are two reasonable options:

Tracking memory per device

Tracking memory per buffer type

In practice, it can be hard to map a buffer type to a device. For example should a CUDA_Host buffer count as CPU device, or as a CUDA device? What device should a CUDA_Split buffer belong to? It allocates memory from multiple devices.

Therefore, I think the only reasonable way to do this is per buffer type.

llama: print memory breakdown on exit

3d03a7a

slaren reviewed Sep 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: print memory breakdown on exit #15860

llama: print memory breakdown on exit #15860

JohannesGaessler commented Sep 7, 2025

Uh oh!

slaren Sep 9, 2025

Uh oh!

slaren Sep 9, 2025

Uh oh!

Uh oh!

llama: print memory breakdown on exit #15860

Are you sure you want to change the base?

llama: print memory breakdown on exit #15860

Conversation

JohannesGaessler commented Sep 7, 2025

Uh oh!

slaren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!