Releases: CodeLinaro/llama.cpp
Releases · CodeLinaro/llama.cpp
b3637
Fix minicpm example directory (#9111)
b3622
common: fixed not working find argument --n-gpu-layers-draft (#9175)
b3620
CPU/CUDA: Gemma 2 FlashAttention support (#8542) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check
b3616
[SYCL] Add a space to supress a cmake warning (#9133)
b3614
llama : simplify Mamba with advanced batch splits (#8526) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <[email protected]> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <[email protected]>
b3608
[SYCL] fallback mmvq (#9088) * fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <[email protected]> --------- Co-authored-by: Alberto Cabrera Pérez <[email protected]>
b3605
cann: add doc for cann backend (#8867) Co-authored-by: xuedinge233 <[email protected]> Co-authored-by: hipudding <[email protected]>
b3599
server : refactor middleware and /health endpoint (#9056) * server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <[email protected]> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <[email protected]>
b3590
retrieval : fix memory leak in retrieval query handling (#8955) * retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space
b3584
Vulkan Optimizations and Fixes (#8959) * Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces