Skip to content

Commit 745cbcf

Browse files
authored
llama-quant : fix the verification of attention layers for encoder-decoder models (#16023)
Signed-off-by: Jie Fu <[email protected]>
1 parent 1cbd80f commit 745cbcf

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

src/llama-quant.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -725,7 +725,9 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::
725725
// attention layers have a non-zero number of kv heads
726726
int32_t n_attn_layer = model.hparams.n_layer - std::count(n_head_kv_iter, n_head_kv_iter + model.hparams.n_layer, 0);
727727
if (llama_model_has_encoder(&model)) {
728-
n_attn_layer *= 3;
728+
// now n_attn_layer is the number of attention layers in the encoder
729+
// for each decoder block, there are 2 attention layers
730+
n_attn_layer += 2 * model.hparams.dec_n_layer;
729731
}
730732
GGML_ASSERT((qs.n_attention_wv == n_attn_layer - pruned_attention_w) && "n_attention_wv is unexpected");
731733
}

0 commit comments

Comments
 (0)