Skip to content

Releases: CodeLinaro/llama.cpp

b3637

27 Aug 14:50
3246fe8
Compare
Choose a tag to compare
Fix minicpm example directory (#9111)

b3622

26 Aug 05:45
93bc383
Compare
Choose a tag to compare
common: fixed not working find argument --n-gpu-layers-draft (#9175)

b3620

24 Aug 22:33
e11bd85
Compare
Choose a tag to compare
CPU/CUDA: Gemma 2 FlashAttention support (#8542)

* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check

b3616

23 Aug 01:18
11b84eb
Compare
Choose a tag to compare
[SYCL] Add a space to supress a cmake warning (#9133)

b3614

22 Aug 02:40
a1631e5
Compare
Choose a tag to compare
llama : simplify Mamba with advanced batch splits (#8526)

* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b3608

20 Aug 17:12
50addec
Compare
Choose a tag to compare
[SYCL] fallback mmvq (#9088)

* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

b3605

19 Aug 18:44
cfac111
Compare
Choose a tag to compare
cann: add doc for cann backend (#8867)

Co-authored-by: xuedinge233 <[email protected]>
Co-authored-by: hipudding <[email protected]>

b3599

17 Aug 01:23
8b3befc
Compare
Choose a tag to compare
server : refactor middleware and /health endpoint (#9056)

* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <[email protected]>

b3590

15 Aug 20:10
4b9afbb
Compare
Choose a tag to compare
retrieval : fix memory leak in retrieval query handling (#8955)

* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space

b3584

15 Aug 06:15
5fd89a7
Compare
Choose a tag to compare
Vulkan Optimizations and Fixes (#8959)

* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces