Skip to content

Releases: ggerganov/llama.cpp

b3079

04 Jun 11:50
6d16169
Compare
Choose a tag to compare
ggml : prevent builds with -ffinite-math-only (#7726)

This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825

b3078

03 Jun 19:33
bde7cd3
Compare
Choose a tag to compare
llama : offload to RPC in addition to other backends (#7640)

* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

b3077

03 Jun 19:10
a5735e4
Compare
Choose a tag to compare
ggml : use OpenMP as a thread pool (#7606)

* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

b3076

03 Jun 19:08
0b832d5
Compare
Choose a tag to compare
make: fix debug options not being applied to NVCC (#7714)

b3075

03 Jun 11:29
3d7ebf6
Compare
Choose a tag to compare
Vulkan Mixture of Experts (MoE) support (#7628)

* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU

b3074

03 Jun 11:16
a10cda5
Compare
Choose a tag to compare
cmake : add pkg-config spec file for llama.cpp (#7702)

b3073

03 Jun 09:47
6f28a33
Compare
Choose a tag to compare
llama : MiniCPM support tied embeddings (#7664)

* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <[email protected]>

b3072

03 Jun 07:20
549279d
Compare
Choose a tag to compare
llama : avoid double token-to-piece cache (#7654)

ggml-ci

b3071

03 Jun 07:09
9e405b6
Compare
Choose a tag to compare
kompute : implement op_getrows_f32 (#6403)

op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122
for the Vulkan w/ Kompute backend to be functional.

As such, implement this op to make this backend functional again.

b3070

02 Jun 22:36
3413ae2
Compare
Choose a tag to compare
fix bug introduced in using calloc (#7701)

compilade pointed this out on the previous MR