Question about TnT pixel embed implementation #662
Unanswered
alexander-soare
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi community! I'd love to get your thoughts on a paper <> code discrepancy I've picked up.
First of all, let's talk about ViT. There we do a patch embed by taking 16x16 non-overlapping windows, flattening, and projecting to a target dimension (to match the transformer model dimension). We could actually do these steps, or be clever like in the code and use:
Because
kernel_size == stride == patch_size
we are effectively doing the same thing as the steps I mentioned.Now as I understand from the TnT paper, "each patch is further transformed into the target size (p', p') with pixel unfold, and with a linear projection". The code looks like it utilises the same trick with
Conv2d
but if you look carefully, you realise thatkernel_size != stride
, and now we usepadding
. So now our windows are overlappingI couldn't find where in the paper they refer to this. Don't get me wrong, I like it because it it adds more inductive bias into the mix, but I do want to understand the discrepancy.
Beta Was this translation helpful? Give feedback.
All reactions