RotAttentionPool2d
Performance Discrepancy and Comparison with naver-ai/rope-vit
#2528
Unanswered
ryan-minato
asked this question in
Q&A
Replies: 1 comment 1 reply
-
@ryan-minato I haven't looked too closely at the naver impl, there are often subtle differences in impl of ROPE though they usually equivalent. It might be possible to port those vits to timm using an existing vit as base, or make a new model if it's sufficiently different. The comment there was specific to the ROPE attention pool. I tried it once as a replacement for a standard attention pool with a ResNet model or similar and it didn't generalize well to other resolutions. I think this might have been before I added resolution scaling support to ROPE though, it was some time ago. However, the ROPE embedding impl ( ) does work well in a ViT model. Most (all?) of the ROPE ViT's in timm are in the EVA ViT as that was the first model to use ROPE, and I've based a number of other (non EVA) models on it since, including the Meta Perception Encoder ViTs. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
A note within the class documentation for RotAttentionPool2d, which was added approximately 4 years ago:
pytorch-image-models/timm/layers/attention_pool2d.py
Lines 29 to 30 in 7101adb
This note suggests a significant performance degradation in downstream tasks at different resolutions when using
RotAttentionPool2d
.However, from my understanding, the implementation here appears to be similar to what is done in naver-ai/rope-vit.
According to naver-ai/rope-vit, Rotational Positional Embeddings (RoPE) not only outperform Absolute Positional Embeddings (APE) in Vision Transformers (ViT) but also surpass Relative Positional Biases (RPB) in Swin Transformers.
Any insights or clarification on this matter would be greatly appreciated. Thank you! Or. is my understanding incorrect, and are the implementations of RotAttentionPool2d here and the RoPE in naver-ai/rope-vit fundamentally different in a way that would explain this discrepancy?
Beta Was this translation helpful? Give feedback.
All reactions