Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731

pggPL · 2024-03-22T22:45:17Z

I'm working on tutorial similar to tutorial with Llama, but with Gemma model. Most parts are similar, but I have encountered few differences:

Official weights are in Safetensor format, not as in the case of Llama - torch checkpoints. I modified the loading function.
There is geglu activation function instead of swiglu, but this is also supported by the TE. It was one simple change in config.
Gemma hidden dimension is different than attention dimension - attention projections change dimension from hidden(3072) to key/query/value(4096). There is parameter kv_channels of the TransformerLayer in TE, which seems to be responsible for exactly that. Nevertheless, changing it to value different than hidden_dim / num_heads causes Assertion Error and is not changing number of parameters.

I propose a change of argument kv_channels to the attention_hidden_dims. I changed the description and number of neurons in attention projection layers in transforemer.py and attention.py files.

The results of Gemma with bf16, bf16 and TE, fp8 and TE are proportional to the results of the Llama.

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2024-03-22T23:16:27Z

I was thinking about adding two parameters attention_hidden_dims_v and attention_hidden_dims_kq, because it is mathematically possible that hidden dimension of key vector and query vector is not equal to dimension of value vectors.

Nevertheless, FlashAttention assumes that this numbers are equal, so I gave up this idea.

ptrendx · 2024-03-25T16:51:18Z

Hmm, can we get the same effect without changing the API? That would be a breaking change and we would like to avoid that if possible.

sudhakarsingh27 · 2024-03-25T17:57:39Z

transformer_engine/pytorch/attention.py

                ])
            if self.input_layernorm:
                self.layernorm_qkv = LayerNormLinear(
                    hidden_size,
-                    hidden_size + 2 * self.hidden_size_kv,
+                    3 * attention_hidden_size,


Suggested change

3 * attention_hidden_size,

self.hidden_size_per_attention_head * num_attention_heads + 2 * self.hidden_size_kv

@pggPL, we'd still need the separate hidden size (total) for k and v heads to handle group query attention

Here's what we could do alternately:

# We still keep this line self.hidden_size_per_attention_head = kv_channels # Use `kv_channels` to calculate the `attention_hidden_size` self.attention_hidden_size = self.hidden_size_per_attention_head * num_attention_heads # Use `self.attention_hidden_size` to calculate `hidden_size_kv` (which is used in case of GQA) self.hidden_size_kv = int(self.attention_hidden_size* self.num_gqa_groups // num_attention_heads) ... LayerNormLinear( hidden_size, self.attention_hidden_size+ 2 * self.hidden_size_kv, ... )

We could probably also rename attention_hidden_size to hidden_size_q because essentially that's what it's used for in the layer/weight creation (in the call to LayerNormLinear)

I see the point, but I'm not sure whether using kv_channels is best option. Currently kv_channels argument does exactly nothing if it is set to correct value. If it is set to incorrect value, the error is raised. Changing the behavior of this argument is an option, but it will be unrelated to the name - we change also q_channels.

I propose to add optional argument attn_hidden_size - legacy code will work correctly without it. We can leave kv_channels in the code - but for example only print some "this argument is deprecated" warning. Or even simply do nothing. I believe (maybe wrongly) that this will not lead to less problems with legacy code, but the argument's names will be more accurate. Let me know what do you think.

Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: root <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2024-05-03T20:13:25Z

I split this PR into 2:
#833 (kv_channels part)
#829 (gemma part)
Thus I close it.

root and others added 6 commits March 22, 2024 22:55

Fixed Llama tutorial. Changed batch size and added fused=True.

c6b701a

Signed-off-by: Pawel Gadzinski <[email protected]>

Not completely done gemma

0bcc659

Signed-off-by: Pawel Gadzinski <[email protected]>

something

61e95bb

Signed-off-by: Pawel Gadzinski <[email protected]>

Version which works

9c003a5

Signed-off-by: Pawel Gadzinski <[email protected]>

Reset of te_llama in this brach

83ae6aa

Signed-off-by: Pawel Gadzinski <[email protected]>

Fixed kv_channels

baa7051

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the Gemma-tutorial branch from 3656068 to baa7051 Compare March 22, 2024 22:56

sudhakarsingh27 reviewed Mar 25, 2024

View reviewed changes

pggPL and others added 4 commits March 27, 2024 20:51

Fixed potential bug with fc1 loading

1a8041b

Signed-off-by: Pawel Gadzinski <[email protected]>

Gemma generation

8d9e0c8

Signed-off-by: Pawel Gadzinski <[email protected]>

Fp8 generation and evaluation

177d02d

Signed-off-by: root <[email protected]>

Fp8 generation and evaluation

615bb63

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL mentioned this pull request May 3, 2024

Different dimension for attention #833

Merged

11 tasks

pggPL closed this May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731

Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731

pggPL commented Mar 22, 2024

pggPL commented Mar 22, 2024

ptrendx commented Mar 25, 2024

sudhakarsingh27 Mar 25, 2024

sudhakarsingh27 Mar 25, 2024 •

edited

Loading

pggPL Apr 2, 2024

pggPL commented May 3, 2024

	3 * attention_hidden_size,
	self.hidden_size_per_attention_head * num_attention_heads + 2 * self.hidden_size_kv

Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731

Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731

Conversation

pggPL commented Mar 22, 2024

pggPL commented Mar 22, 2024

ptrendx commented Mar 25, 2024

sudhakarsingh27 Mar 25, 2024

Choose a reason for hiding this comment

sudhakarsingh27 Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

pggPL Apr 2, 2024

Choose a reason for hiding this comment

pggPL commented May 3, 2024

sudhakarsingh27 Mar 25, 2024 •

edited

Loading