Skip to content

Conversation

taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Sep 6, 2025

fixes #15414

Not sure if my .supports_buft is implemented inaccurately but the weights tensor are not going through the .set_tensor function, and thus, will have to re-initialise the weight zTensors on-the-fly during matmul. Not ideal though.

Activates the following data types:

  1. FP16
  2. BF16

Fixes:

  1. LLAMA_SET_ROWS=1 causing the inference to be incorrect (see: Eval bug: zDNN backend not inferencing correctly after LLAMA_SET_ROWS enablement #15414)
  2. zTensor not freeing correctly and would exhaust all available memory when llama-bench was used with more than 1 model
  3. Moved bias zTensor to .init_tensor for performance improvements

Performance

model size params threads test t/s master t/s PR speedup
granite 3B all F32 9.44 GiB 2.53 B 1 pp512 52.14 51.92 1.00
granite 3B all F32 9.44 GiB 2.53 B 1 tg128 3.92 3.86 0.98
granite 3B all F32 9.44 GiB 2.53 B 2 pp512 92.60 81.92 0.88
granite 3B all F32 9.44 GiB 2.53 B 2 tg128 4.44 4.48 1.01
granite 3B all F32 9.44 GiB 2.53 B 4 pp512 141.14 144.85 1.03
granite 3B all F32 9.44 GiB 2.53 B 4 tg128 4.83 4.86 1.01
granite 3B all F32 9.44 GiB 2.53 B 8 pp512 216.55 215.82 1.00
granite 3B all F32 9.44 GiB 2.53 B 8 tg128 4.97 4.95 1.00
granite 3B F16 4.72 GiB 2.53 B 1 pp512 10.42 51.68 4.96
granite 3B F16 4.72 GiB 2.53 B 1 tg128 0.45 3.43 7.62
granite 3B F16 4.72 GiB 2.53 B 2 pp512 19.61 81.78 4.17
granite 3B F16 4.72 GiB 2.53 B 2 tg128 0.89 4.17 4.69
granite 3B F16 4.72 GiB 2.53 B 4 pp512 38.99 138.58 3.55
granite 3B F16 4.72 GiB 2.53 B 4 tg128 1.73 4.67 2.70
granite 3B F16 4.72 GiB 2.53 B 8 pp512 74.60 213.83 2.87
granite 3B F16 4.72 GiB 2.53 B 8 tg128 3.17 4.9 1.55
granite 3B BF16 4.72 GiB 2.53 B 1 pp512 11.30 51.6 4.57
granite 3B BF16 4.72 GiB 2.53 B 1 tg128 0.31 3.08 9.94
granite 3B BF16 4.72 GiB 2.53 B 2 pp512 21.40 82.45 3.85
granite 3B BF16 4.72 GiB 2.53 B 2 tg128 0.61 3.88 6.36
granite 3B BF16 4.72 GiB 2.53 B 4 pp512 42.28 142.97 3.38
granite 3B BF16 4.72 GiB 2.53 B 4 tg128 1.22 4.41 3.61
granite 3B BF16 4.72 GiB 2.53 B 8 pp512 80.90 213.85 2.64
granite 3B BF16 4.72 GiB 2.53 B 8 tg128 2.40 4.79 2.00

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

test-backend-ops

build/bin/test-backend-ops -b zDNN | grep -v "not supported"
ggml_zdnn_init: allocating
ggml_zdnn_init: found 1 device
ggml_zdnn_init: picking default device: zDNN
ggml_zdnn_init: NNPA name: zDNN
ggml_zdnn_init: NNPA_PARMBLKFORMAT_0 = true
ggml_zdnn_init: NNPA_PARMBLKFORMAT_1 = true
Testing 3 devices

Backend 1/3: zDNN
  Device description: IBM Z Neural Network Processing Assist (NNPA)
  Device memory: 0 MB (0 MB free)

  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=1,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
ggml_zdnn_free: deallocating
  12353/12353 tests passed
  Backend zDNN: OK
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator labels Sep 6, 2025
@taronaeo taronaeo requested a review from slaren September 6, 2025 18:25
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 7, 2025
@slaren
Copy link
Member

slaren commented Sep 8, 2025

Not sure if my .supports_buft is implemented inaccurately but the weights tensor are not going through the .set_tensor function, and thus, will have to re-initialise the weight zTensors on-the-fly during matmul. Not ideal though.

Are you sure that you are looking at a weight? It might be part of the attention computation.

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 9, 2025

Sorry I missed this. Yep I can confirm that I am looking at a weight tensor, unless my debugging code is wrong.

Debug Patch

diff --git a/ggml/src/ggml-zdnn/ggml-zdnn.cpp b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
index 7947aab87..bd04beb2d 100644
--- a/ggml/src/ggml-zdnn/ggml-zdnn.cpp
+++ b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
@@ -130,7 +130,11 @@ static void ggml_zdnn_mul_mat_op(ggml_backend_zdnn_context * ctx, const ggml_ten
     // TODO: Weights are somehow not going through `ggml_backend_zdnn_buffer_set_tensor` during model loading.
     //       So we need to load the weights here. Remove this when the issue is fixed.
     //       Problem might be residing in `ggml_backend_zdnn_device_supports_buft`.
-    if (weights_extra->ztensor.is_transformed == false) ggml_zdnn_load_tensor(weights_extra->ztensor, weights->data);
+    if (weights_extra->ztensor.is_transformed == false) {
+       GGML_LOG_INFO("%s: tensor->name = %s | tensor->buffer->usage = %d\n", __func__, weights->name, weights->buffer->usage);
+       ggml_zdnn_load_tensor(weights_extra->ztensor, weights->data);
+       std::raise(SIGINT);
+    }
 
     // GGML_LOG_INFO("%s: tensor '%s' tensor dimensions: [%ld, %ld, %ld, %ld] pre_tfm_desc dimensions: [%ld, %ld, %ld, %ld]\n",
     //               __func__, weights_extra->name,

And as logged, the buffer usage is 1, which equates to GGML_BACKEND_BUFFER_USAGE_WEIGHTS.

$ gdb --args build/bin/llama-cli -m hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874

ggml_zdnn_mul_mat_op: tensor->name = blk.0.attn_q.weight | tensor->buffer->usage = 1

Thread 1 "llama-cli" received signal SIGINT, Interrupt.
0x000003fff6b98c26 in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-168.el9_6.23.s390x

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 9, 2025

I did some digging as well and found out that setting .buffer_from_host_ptr = false allows the weight tensors to go through .set_tensor whereas before, only the compute tensors were going through.

.buffer_from_host_ptr = false

diff --git a/ggml/src/ggml-zdnn/ggml-zdnn.cpp b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
index 7947aab87..d6d1d06c8 100644
--- a/ggml/src/ggml-zdnn/ggml-zdnn.cpp
+++ b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
@@ -432,9 +432,14 @@ static void ggml_backend_zdnn_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
     memcpy((char *)tensor->data + offset, data, size);
 
     ggml_backend_zdnn_buffer * extra = (ggml_backend_zdnn_buffer *)tensor->extra;
+    GGML_LOG_INFO("%s: tensor->name = %s | tensor->buffer->usage = %d | tensor->extra->ztensor.is_transformed = %d\n", __func__, tensor->name, tensor->buffer->usage, extra->ztensor.is_transformed);
+
     if (extra->ztensor.is_transformed) zdnn_reset_ztensor(&extra->ztensor);
     ggml_zdnn_load_tensor(extra->ztensor, tensor->data);
 
+    GGML_LOG_INFO("%s: tensor->name = %s | tensor->buffer->usage = %d | tensor->extra->ztensor.is_transformed = %d\n", __func__, tensor->name, tensor->buffer->usage, extra->ztensor.is_transformed);
+    std::raise(SIGINT);
+
     GGML_UNUSED(buffer);
 }
 
@@ -647,7 +652,7 @@ static void ggml_backend_zdnn_device_get_props(ggml_backend_dev_t dev, ggml_back
     props->caps = (ggml_backend_dev_caps) {
         /* .async                = */ false,
         /* .host_buffer          = */ false,
-        /* .buffer_from_host_ptr = */ true,
+        /* .buffer_from_host_ptr = */ false,
         /* .events               = */ false
     };
 }

First tensor to call .set_tensor

ggml_backend_zdnn_buffer_set_tensor: tensor->name = blk.0.attn_q.weight | tensor->buffer->usage = 1 | tensor->extra->ztensor.is_transformed = 0
ggml_backend_zdnn_buffer_set_tensor: tensor->name = blk.0.attn_q.weight | tensor->buffer->usage = 1 | tensor->extra->ztensor.is_transformed = 1

.buffer_from_host_ptr = true (Current PR)

First tensor to call .set_tensor

ggml_backend_zdnn_buffer_set_tensor: tensor->name = zDNN#attn_norm-0#0 | tensor->buffer->usage = 2 | tensor->extra->ztensor.is_transformed = 0
ggml_backend_zdnn_buffer_set_tensor: tensor->name = zDNN#attn_norm-0#0 | tensor->buffer->usage = 2 | tensor->extra->ztensor.is_transformed = 1

Do let me know if this is weird. I intend on fixing the weight tensor problem in another PR, while this PR is mainly to fix the issues that have been preventing zDNN from inferencing correctly using the latest upstream code.

@slaren
Copy link
Member

slaren commented Sep 9, 2025

That's expected, of course you cannot enable user mapped buffers if you need to modify the tensor data.

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 9, 2025

Got it. Will create a separate PR by tomorrow to fix it. Do let me know if I need to make any changes to this PR

@@ -593,27 +603,6 @@ static ggml_guid_t ggml_backend_zdnn_guid(void) {
return reinterpret_cast<ggml_guid_t>((void *)guid_str);
}

// TODO: remove in the future
ggml_backend_t ggml_backend_zdnn_init(void) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is still in the header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in latest push.

@taronaeo taronaeo requested a review from slaren September 11, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: zDNN backend not inferencing correctly after LLAMA_SET_ROWS enablement
2 participants