is paged attention an exact attention #13902

CharlesJu1 · 2025-02-26T14:49:44Z

CharlesJu1
Feb 26, 2025

For the blockwise computation of paged attention as in the above equation, it is not an exact attention. In an exact attention, all the attention scores sum up to 1. Here the attention scores in each block sum up to 1. So is it correct to say that paged attention is not an exact attention?

hongboshi1234 · 2025-02-28T16:33:51Z

hongboshi1234
Feb 28, 2025

Yes, it is exact attention.

For prefilling stage. say if you have prompt of len 10. then you need to compute those self attention scores.

a11, a21, a22, a31, a32, a33. .... A1010. So you need to compute those (1+10)*10/2 = 55 combinations right?

for i=5, you need to compute a61, a62, a63, a64, a65 , a66 . Without page attetnion. you can simply do things in a sinlge block
Now with page attetnion, you might devide token 1,2,3 to one block, and 4,5,6 to another block.

The equation above means. B means block size, which is 3 in this example. Then j is block index, which goes from 1 to 3

block j=1 stores token 1,2,3
block j=2 stores token 4,5,6
block j=3 stores token 7,8,9
block j=4 stores token 10

so without page attention, you want to compute (a61, a62, a63, a64, a65 , a66) . then with page attention. you are computing
A61 = {a61, a62, a63} and A62 = {a64, a65, a66} . You will get the same attetnion matrix with or without page attention. The only difference is the way how KV cache is being stored for each token in the memory system(with or without blocks)

0 replies

CharlesJu1 · 2025-03-02T09:43:46Z

CharlesJu1
Mar 2, 2025
Author

ok, I got it. The denominator summation index goes from the beginning of all the blocks before index i。 Previously somehow I read it as going through only the 'current' block.

1 reply

si-yu-aa Mar 6, 2025

But why is the denominator exp(q^T K 1) rather than exp(q^T K)1？

si-yu-aa · 2025-03-06T04:53:05Z

si-yu-aa
Mar 6, 2025

exp(q^TK1)=exp (q^T *K_1+q^T K_2 + ...) ，而 exp(q^TK)*1=exp (q^T *K_1)+exp(q^T *K_2 )+ ..., it appears that the latter is the correct calculation method

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is paged attention an exact attention #13902

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

is paged attention an exact attention #13902

CharlesJu1 Feb 26, 2025

Replies: 3 comments · 1 reply

hongboshi1234 Feb 28, 2025

CharlesJu1 Mar 2, 2025 Author

si-yu-aa Mar 6, 2025

si-yu-aa Mar 6, 2025

CharlesJu1
Feb 26, 2025

Replies: 3 comments 1 reply

hongboshi1234
Feb 28, 2025

CharlesJu1
Mar 2, 2025
Author

si-yu-aa
Mar 6, 2025