[Question] scale for pos_embed in Halo and Bottleneck attention #912
-
Just noticed a small difference in qquery= self.to_q(q_inp)
query *= scale
position = relative_position_logits(query) But here in timm bottleneck_attn.py and halo_attn.py, I think query = self.q(x)
attention = (query @ key.transpose(-1, -2)) * self.scale
attention = attention + self.pos_embed(query) I did some basic tests using
This behavior may not matter, model may fit its own weights for it. I'm just wondering if any backgrounds for this? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
@leondgarse you are correct that as per the botnet gist and more importantly, the paper that this form of rel posembed was based on (https://arxiv.org/abs/1904.09925) I should have scaled q for the relative position as well. It was an oversight, but it worked well and seems stable enough. I've thought about fixing it, or at least providing an option to apply scale to both but have yet to do that. Thanks for the comparison table. Perhaps I should at least add a comment. |
Beta Was this translation helpful? Give feedback.
-
@leondgarse in 02daf2a I added a bool flag, need to investigate further before making any recommendations or changing defaults. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@leondgarse I've just finished two runs that compare the two, same h-params, seed, etc just the scale_pos_embed toggled. In one run, for a Next one was a re-run of I will leave it at False and probably won't revisit anytime soon as it seems at best slightly better, at worst, the same. |
Beta Was this translation helpful? Give feedback.
@leondgarse I've just finished two runs that compare the two, same h-params, seed, etc just the scale_pos_embed toggled.
In one run, for a
haloregnetz_b
model the end result was within the run to run noise 81.03 (scale_pos_embed=False), vs 81.04 (scale_pos_embed=True).Next one was a re-run of
eca_botnext26ts_256
, here I see closer to your results, the original configFalse
edges out thescale_pos_embed=True
by a small amount 79.27 vs 79.13.I will leave it at False and probably won't revisit anytime soon as it seems at best slightly better, at worst, the same.