-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix of bug of kv_channels in TransformerLayer and added Gemma tutorial #731
Conversation
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
I was thinking about adding two parameters Nevertheless, FlashAttention assumes that this numbers are equal, so I gave up this idea. |
Hmm, can we get the same effect without changing the API? That would be a breaking change and we would like to avoid that if possible. |
]) | ||
if self.input_layernorm: | ||
self.layernorm_qkv = LayerNormLinear( | ||
hidden_size, | ||
hidden_size + 2 * self.hidden_size_kv, | ||
3 * attention_hidden_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 * attention_hidden_size, | |
self.hidden_size_per_attention_head * num_attention_heads + 2 * self.hidden_size_kv |
@pggPL, we'd still need the separate hidden size (total) for k
and v
heads to handle group query attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what we could do alternately:
# We still keep this line
self.hidden_size_per_attention_head = kv_channels
# Use `kv_channels` to calculate the `attention_hidden_size`
self.attention_hidden_size = self.hidden_size_per_attention_head * num_attention_heads
# Use `self.attention_hidden_size` to calculate `hidden_size_kv` (which is used in case of GQA)
self.hidden_size_kv = int(self.attention_hidden_size* self.num_gqa_groups // num_attention_heads)
...
LayerNormLinear(
hidden_size,
self.attention_hidden_size+ 2 * self.hidden_size_kv,
...
)
We could probably also rename attention_hidden_size
to hidden_size_q
because essentially that's what it's used for in the layer/weight creation (in the call to LayerNormLinear
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point, but I'm not sure whether using kv_channels
is best option. Currently kv_channels
argument does exactly nothing if it is set to correct value. If it is set to incorrect value, the error is raised. Changing the behavior of this argument is an option, but it will be unrelated to the name - we change also q_channels
.
I propose to add optional argument attn_hidden_size
- legacy code will work correctly without it. We can leave kv_channels
in the code - but for example only print some "this argument is deprecated" warning. Or even simply do nothing. I believe (maybe wrongly) that this will not lead to less problems with legacy code, but the argument's names will be more accurate. Let me know what do you think.
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
I'm working on tutorial similar to tutorial with Llama, but with Gemma model. Most parts are similar, but I have encountered few differences:
Official weights are in Safetensor format, not as in the case of Llama - torch checkpoints. I modified the loading function.
There is
geglu
activation function instead ofswiglu
, but this is also supported by the TE. It was one simple change in config.Gemma hidden dimension is different than attention dimension - attention projections change dimension from hidden(3072) to key/query/value(4096). There is parameter
kv_channels
of the TransformerLayer in TE, which seems to be responsible for exactly that. Nevertheless, changing it to value different thanhidden_dim / num_heads
causes Assertion Error and is not changing number of parameters.I propose a change of argument
kv_channels
to theattention_hidden_dims
. I changed the description and number of neurons in attention projection layers intransforemer.py
andattention.py
files.The results of Gemma with bf16, bf16 and TE, fp8 and TE are proportional to the results of the Llama.