Skip to content

Commit db43f8b

Browse files
committed
Update notes
1 parent 4410e91 commit db43f8b

File tree

5 files changed

+45
-1
lines changed

5 files changed

+45
-1
lines changed

base/science-tech-maths/image-processing/resampling/resampling.md

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Resampling is the process of changing the resolution of an image. This is done b
66

77
Be careful with `align_corners` parameter in interpolation algorithms (should be set to `true` in most cases):
88

9+
- <https://bartwronski.com/2021/02/15/bilinear-down-upsampling-pixel-grids-and-that-half-pixel-offset/>
910
- <https://leimao.github.io/article/Interpolation/>
1011
- <https://hackernoon.com/how-tensorflows-tf-image-resize-stole-60-days-of-my-life-aba5eb093f35>
1112
- <https://bartwronski.com/2021/02/15/bilinear-down-upsampling-pixel-grids-and-that-half-pixel-offset/>

base/science-tech-maths/machine-learning/algorithms/neural-nets/conv-neural-nets/diffusion-models/diffusion-models.md

+1
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ What conditions do diffusion model architectures need to fulfill?
3232
- <https://www.chenyang.co/diffusion.html>
3333
- <https://andrewkchan.dev/posts/diffusion.html>
3434
- <https://sander.ai/2024/06/14/noise-schedules.html>
35+
- <https://baincapitalventures.notion.site/Diffusion-Without-Tears-14e1469584c180deb0a9ed9aa6ff7a4c>

base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md

+36-1
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,8 @@ If the embedding space consists of more than two dimensions (which it almost alw
461461
In GPT2, there are two matrixes called WTE (word token embedding) and WPE (word position embedding).
462462
WPE is 1024×768. It means that the maximum number of tokens that we can use in a prompt to GPT2 is 1024.
463463

464+
More information about the reasoning behind the positional encoding: <https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding>
465+
464466
### Transformer decoder
465467

466468
<img src="transformer-decoder.png" width="200">
@@ -569,9 +571,42 @@ This pairwise communication means a forward pass is O(n²) time complexity in tr
569571

570572
## KV cache
571573

574+
Imagine you're writing a story, and for each new word you write, you need to re-read the entire story so far to maintain consistency. The longer your story gets, the more time you spend re-reading.
575+
576+
The key insight behind KV caching is that we're doing a lot of redundant work. When generating each new token, we're recomputing things for all previous tokens that we've already processed before.
577+
578+
For each token, we compute and store two things:
579+
580+
- A key (kk): Think of this as an addressing mechanism - it helps determine how relevant this token is to future tokens
581+
- A value (vv): Think of this as the actual information that gets used when this token is found to be relevant
582+
572583
The KV cache is a cache of the key-value pairs of the encoder output. It is used to speed up the inference process.
573584

574-
storing this KV cache requires O(n) space.
585+
This is a dramatic improvement over O(n3)! While we still have to do the fundamental work of looking at all previous tokens (O(n2)), we avoid the costly recomputation at each step.
586+
587+
Let's look at the memory cost of KV caching with a concrete example.
588+
589+
For a modern large language model like Llama3 70B with:
590+
591+
- $L=80$ layers
592+
- $H=64$ attention heads
593+
- $B=8$ batch size
594+
- $d_k=128$ key/value dimension
595+
- $2$ K and V
596+
- 16-bit precision
597+
598+
For a batch of 8 sequences of 1000 tokens each, the memory required would be:
599+
600+
$L \times H \times B \times n \times d_k \times 2 \times 2$ bytes $= 80 \times 64 \times 8 \times 1000 \times 128 \times 2 \times 2$ bytes $= 20.97$GB
601+
602+
Where:
603+
604+
- $L \times H \times B \times n$ gives us the total number of key-value pairs
605+
- $d_k$ is the dimension of each key/value vector
606+
- First $\times 2$ is for storing both keys and values
607+
- Second $\times 2$ is for 16-bit precision (2 bytes per value)
608+
609+
This shows that while KV caching provides significant speedup by avoiding redundant computations, it comes with substantial memory requirements that grow linearly with sequence length and batch size.
575610

576611
![](./KVCache.jpeg)
577612

base/science-tech-maths/programming/algorithms/data-structures/data-structures.md

+6
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ Binary tree != binary search tree!
1818

1919
![](./DFS-BFS.jpg)
2020

21+
### Find shortest path in a graph
22+
23+
Use BFS, it will find the shortest path in an unweighted graph.
24+
25+
If the graph is weighted, use Dijkstra's algorithm. Dijkstra's algorithm is a BFS on steroids that handles weighted edges.
26+
2127
## LRU Cache
2228

2329
![](./lru-cache.png)

base/science-tech-maths/programming/high-performance-computing/hpc.md

+1
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,4 @@ All writes to memory go through the data cache3. When a write is made, the cache
2929
- <https://theartofhpc.com/>
3030
- <https://thechipletter.substack.com/p/demystifying-gpu-compute-architectures>
3131
- <https://blog.codingconfessions.com/p/gpu-computing>
32+
- <https://www.pyspur.dev/blog/introduction_cuda_programming>

0 commit comments

Comments
 (0)