Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

This PR makes it so that on exit a breakdown of the memory use is printed. For example:

llama_print_memory_breakdown: memory breakdown:      total   free     self   model   context   compute    unaccounted
llama_print_memory_breakdown:   - CUDA0 (RTX 4090):  24080 = 9436 + (14193 = 13169 +      38 +     985) +         451
llama_print_memory_breakdown:   - CUDA1 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CUDA2 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CPU (EPYC 7742):  515628              72 =     0 +      57 +      15

Explanation:

size_t memory_total;             // total memory as reported by the device
size_t memory_free;              // free memory as reported by the device
size_t memory_used_self;         // sum of model, context, and compute
size_t memory_used_self_model;   // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)

The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.

Open problems:

  • I added a function llama_print_memory_breakdown to the llama API which produces the above table on the console. Internally this function uses another new function llama_backend_info which returns a struct with information about the backends used by a llama_context. I'm not sure whether the latter should be part of the public API, and if yes, in what form.
  • I added methods like llama_model::memory_use(ggml_backend_dev_t dev) which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass a ggml_backend_buffer_type_t? In particular, I think this is the only correct way to handle e.g. CUDA_Host buffers.
  • The memory for e.g. the CUDA pools is currently under "unaccounted", but it should be under "compute". Currently it is not possible for llama.cpp to retrieve this information. I think it would make sense to extend the ggml backend interface with a function that returns the total amount of device memory allocated by the backend.
  • I'm not sure what to show, if anything, for the CPU. "Free" memory in this context does not have a clear-cut definition so I'm only showing total memory and memory that is definitely allocated for the CPU backend.

int32_t n_eval;
int32_t n_reused; // number of times a ggml compute graph had been reused
// ms == milliseconds
double t_start_ms; // time needed for startup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, it's just the timestamp at startup.

Comment on lines +1361 to +1380

LLAMA_API size_t llama_backend_count(const struct llama_context * ctx);

struct llama_backend_info_data {
const char * name;

struct {
const char * name;
const char * description;

// device memory is in bytes
size_t memory_total; // total memory as reported by the device
size_t memory_free; // free memory as reported by the device
size_t memory_used_self; // sum of model, context, and compute
size_t memory_used_self_model; // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted; // memory with unknown use, e.g. drivers or other programs, total - (free + self)
} device;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think tracking memory usage per-backend is the right way to do this. There are two reasonable options:

  • Tracking memory per device
  • Tracking memory per buffer type

In practice, it can be hard to map a buffer type to a device. For example should a CUDA_Host buffer count as CPU device, or as a CUDA device? What device should a CUDA_Split buffer belong to? It allocates memory from multiple devices.

Therefore, I think the only reasonable way to do this is per buffer type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants