llama: print memory breakdown on exit #15860
Open
+219
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes it so that on exit a breakdown of the memory use is printed. For example:
Explanation:
The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.
Open problems:
llama_print_memory_breakdown
to the llama API which produces the above table on the console. Internally this function uses another new functionllama_backend_info
which returns a struct with information about the backends used by allama_context
. I'm not sure whether the latter should be part of the public API, and if yes, in what form.llama_model::memory_use(ggml_backend_dev_t dev)
which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass aggml_backend_buffer_type_t
? In particular, I think this is the only correct way to handle e.g.CUDA_Host
buffers.