Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b3079
ggml : prevent builds with -ffinite-math-only (#7726) This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
b3078
llama : offload to RPC in addition to other backends (#7640) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>
b3077
ggml : use OpenMP as a thread pool (#7606) * ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b3076
make: fix debug options not being applied to NVCC (#7714)
b3075
Vulkan Mixture of Experts (MoE) support (#7628) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU
b3074
cmake : add pkg-config spec file for llama.cpp (#7702)
b3073
llama : MiniCPM support tied embeddings (#7664) * support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <[email protected]>
b3072
llama : avoid double token-to-piece cache (#7654) ggml-ci
b3071
kompute : implement op_getrows_f32 (#6403) op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.
b3070
fix bug introduced in using calloc (#7701) compilade pointed this out on the previous MR