Update notes

Jonas1312 · Jonas1312 · commit db43f8b7719a · 2025-02-25T11:09:54.000+01:00
diff --git a/base/science-tech-maths/image-processing/resampling/resampling.md b/base/science-tech-maths/image-processing/resampling/resampling.md
@@ -6,6 +6,7 @@ Resampling is the process of changing the resolution of an image. This is done b
 
 Be careful with `align_corners` parameter in interpolation algorithms (should be set to `true` in most cases):
 
+- <https://bartwronski.com/2021/02/15/bilinear-down-upsampling-pixel-grids-and-that-half-pixel-offset/>
 - <https://leimao.github.io/article/Interpolation/>
 - <https://hackernoon.com/how-tensorflows-tf-image-resize-stole-60-days-of-my-life-aba5eb093f35>
 - <https://bartwronski.com/2021/02/15/bilinear-down-upsampling-pixel-grids-and-that-half-pixel-offset/>
diff --git a/base/science-tech-maths/machine-learning/algorithms/neural-nets/conv-neural-nets/diffusion-models/diffusion-models.md b/base/science-tech-maths/machine-learning/algorithms/neural-nets/conv-neural-nets/diffusion-models/diffusion-models.md
@@ -32,3 +32,4 @@ What conditions do diffusion model architectures need to fulfill?
 - <https://www.chenyang.co/diffusion.html>
 - <https://andrewkchan.dev/posts/diffusion.html>
 - <https://sander.ai/2024/06/14/noise-schedules.html>
+- <https://baincapitalventures.notion.site/Diffusion-Without-Tears-14e1469584c180deb0a9ed9aa6ff7a4c>
diff --git a/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md b/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
@@ -461,6 +461,8 @@ If the embedding space consists of more than two dimensions (which it almost alw
 In GPT2, there are two matrixes called WTE (word token embedding) and WPE (word position embedding).
 WPE is 1024×768. It means that the maximum number of tokens that we can use in a prompt to GPT2 is 1024.
 
+More information about the reasoning behind the positional encoding: <https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding>
+
 ### Transformer decoder
 
 <img src="transformer-decoder.png" width="200">
@@ -569,9 +571,42 @@ This pairwise communication means a forward pass is O(n²) time complexity in tr
 
 ## KV cache
 
+Imagine you're writing a story, and for each new word you write, you need to re-read the entire story so far to maintain consistency. The longer your story gets, the more time you spend re-reading.
+
+The key insight behind KV caching is that we're doing a lot of redundant work. When generating each new token, we're recomputing things for all previous tokens that we've already processed before.
+
+For each token, we compute and store two things:
+
+- A key (kk): Think of this as an addressing mechanism - it helps determine how relevant this token is to future tokens
+- A value (vv): Think of this as the actual information that gets used when this token is found to be relevant
+
 The KV cache is a cache of the key-value pairs of the encoder output. It is used to speed up the inference process.
 
-storing this KV cache requires O(n) space.
+This is a dramatic improvement over O(n3)! While we still have to do the fundamental work of looking at all previous tokens (O(n2)), we avoid the costly recomputation at each step.
+
+Let's look at the memory cost of KV caching with a concrete example.
+
+For a modern large language model like Llama3 70B with:
+
+- $L=80$ layers
+- $H=64$ attention heads
+- $B=8$ batch size
+- $d_k=128$ key/value dimension
+- $2$ K and V
+- 16-bit precision
+
+For a batch of 8 sequences of 1000 tokens each, the memory required would be:
+
+$L \times H \times B \times n \times d_k \times 2 \times 2$ bytes $= 80 \times 64 \times 8 \times 1000 \times 128 \times 2 \times 2$ bytes $= 20.97$GB
+
+Where:
+
+- $L \times H \times B \times n$ gives us the total number of key-value pairs
+- $d_k$ is the dimension of each key/value vector
+- First $\times 2$ is for storing both keys and values
+- Second $\times 2$ is for 16-bit precision (2 bytes per value)
+
+This shows that while KV caching provides significant speedup by avoiding redundant computations, it comes with substantial memory requirements that grow linearly with sequence length and batch size.
 
 ![](./KVCache.jpeg)
 
diff --git a/base/science-tech-maths/programming/algorithms/data-structures/data-structures.md b/base/science-tech-maths/programming/algorithms/data-structures/data-structures.md
@@ -18,6 +18,12 @@ Binary tree != binary search tree!
 
 ![](./DFS-BFS.jpg)
 
+### Find shortest path in a graph
+
+Use BFS, it will find the shortest path in an unweighted graph.
+
+If the graph is weighted, use Dijkstra's algorithm. Dijkstra's algorithm is a BFS on steroids that handles weighted edges.
+
 ## LRU Cache
 
 ![](./lru-cache.png)
diff --git a/base/science-tech-maths/programming/high-performance-computing/hpc.md b/base/science-tech-maths/programming/high-performance-computing/hpc.md
@@ -29,3 +29,4 @@ All writes to memory go through the data cache3. When a write is made, the cache
 - <https://theartofhpc.com/>
 - <https://thechipletter.substack.com/p/demystifying-gpu-compute-architectures>
 - <https://blog.codingconfessions.com/p/gpu-computing>
+- <https://www.pyspur.dev/blog/introduction_cuda_programming>