Releases · ggerganov/llama.cpp

04 Jun 11:50

6d16169

b3079

ggml : prevent builds with -ffinite-math-only (#7726)

This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825

Assets 21

03 Jun 19:33

github-actions

b3078

bde7cd3

b3078

llama : offload to RPC in addition to other backends (#7640)

* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

Assets 21

03 Jun 19:10

github-actions

b3077

a5735e4

b3077

ggml : use OpenMP as a thread pool (#7606)

* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 21

03 Jun 19:08

github-actions

b3076

0b832d5

b3076

make: fix debug options not being applied to NVCC (#7714)

Assets 21

03 Jun 11:29

github-actions

b3075

3d7ebf6

b3075

Vulkan Mixture of Experts (MoE) support (#7628)

* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU

Assets 21

03 Jun 11:16

github-actions

b3074

a10cda5

b3074

cmake : add pkg-config spec file for llama.cpp (#7702)

Assets 21

03 Jun 09:47

github-actions

b3073

6f28a33

b3073

llama : MiniCPM support tied embeddings (#7664)

* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <[email protected]>

Assets 21

03 Jun 07:20

github-actions

b3072

549279d

b3072

llama : avoid double token-to-piece cache (#7654)

ggml-ci

Assets 21

03 Jun 07:09

github-actions

b3071

9e405b6

b3071

kompute : implement op_getrows_f32 (#6403)

op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122
for the Vulkan w/ Kompute backend to be functional.

As such, implement this op to make this backend functional again.

Assets 21

02 Jun 22:36

github-actions

b3070

3413ae2

b3070

fix bug introduced in using calloc (#7701)

compilade pointed this out on the previous MR

Assets 21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b3079

b3078

b3077

b3076

b3075

b3074

b3073

b3072

b3071

b3070