`RotAttentionPool2d` Performance Discrepancy and Comparison with `naver-ai/rope-vit` #2528

ryan-minato · 2025-06-27T02:46:13Z

ryan-minato
Jun 27, 2025

A note within the class documentation for RotAttentionPool2d, which was added approximately 4 years ago:

pytorch-image-models/timm/layers/attention_pool2d.py

Lines 29 to 30 in 7101adb

    
               NOTE: While this impl does not require a fixed feature size, performance at differeing resolutions from 
        
               train varies widely and falls off dramatically. I'm not sure if there is a way around this... -RW

This note suggests a significant performance degradation in downstream tasks at different resolutions when using RotAttentionPool2d.

However, from my understanding, the implementation here appears to be similar to what is done in naver-ai/rope-vit.
According to naver-ai/rope-vit, Rotational Positional Embeddings (RoPE) not only outperform Absolute Positional Embeddings (APE) in Vision Transformers (ViT) but also surpass Relative Positional Biases (RPB) in Swin Transformers.

Any insights or clarification on this matter would be greatly appreciated. Thank you! Or. is my understanding incorrect, and are the implementations of RotAttentionPool2d here and the RoPE in naver-ai/rope-vit fundamentally different in a way that would explain this discrepancy?

rwightman · 2025-06-27T14:56:14Z

rwightman
Jun 27, 2025
Maintainer

@ryan-minato I haven't looked too closely at the naver impl, there are often subtle differences in impl of ROPE though they usually equivalent. It might be possible to port those vits to timm using an existing vit as base, or make a new model if it's sufficiently different.

The comment there was specific to the ROPE attention pool. I tried it once as a replacement for a standard attention pool with a ResNet model or similar and it didn't generalize well to other resolutions. I think this might have been before I added resolution scaling support to ROPE though, it was some time ago.

However, the ROPE embedding impl (

pytorch-image-models/timm/layers/pos_embed_sincos.py

Line 289 in 7101adb

class RotaryEmbedding(nn.Module):

) does work well in a ViT model. Most (all?) of the ROPE ViT's in timm are in the EVA ViT as that was the first model to use ROPE, and I've based a number of other (non EVA) models on it since, including the Meta Perception Encoder ViTs.

https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/eva.py

1 reply

rwightman Jun 27, 2025
Maintainer

There are close to 40 vit weights based on eva that use ROPE pos embedding there...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`RotAttentionPool2d` Performance Discrepancy and Comparison with `naver-ai/rope-vit` #2528

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

RotAttentionPool2d Performance Discrepancy and Comparison with naver-ai/rope-vit #2528

Uh oh!

ryan-minato Jun 27, 2025

Replies: 1 comment · 1 reply

Uh oh!

rwightman Jun 27, 2025 Maintainer

Uh oh!

rwightman Jun 27, 2025 Maintainer

`RotAttentionPool2d` Performance Discrepancy and Comparison with `naver-ai/rope-vit` #2528

ryan-minato
Jun 27, 2025

Replies: 1 comment 1 reply

rwightman
Jun 27, 2025
Maintainer

rwightman Jun 27, 2025
Maintainer