upd

yzh119 · yzh119 · commit ab6a58d33392 · 2025-03-10T14:26:52.000-07:00
diff --git a/_posts/2025-03-10-sampling.md b/_posts/2025-03-10-sampling.md
@@ -106,7 +106,7 @@ Implementation side, the 2. and 3. parts are orchestrated for better parallelism
 2. If not, we add $\texttt{a\_local}$ to $\texttt{a}$ and move on to the next block.
 3. Once we know the correct block, we perform a prefix sum over its tokens to pinpoint the exact token index.
 
-The per-block partial sum rrand prefix sums are computed leveraging CUB collective primitives like `BlockReduce` and `BlockScan` to maximize  efficiency.
+The per-block partial sum and prefix sums are computed leveraging [CUB collective primitives](https://docs.nvidia.com/cuda/cub/index.html) (now part of [CCCL](https://github.com/NVIDIA/cccl)) like [BlockReduce](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockReduce.html#_CPPv4I0_i_20BlockReduceAlgorithm_i_iEN3cub11BlockReduceE) and [BlockScan](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockScan.html#_CPPv4I0_i_18BlockScanAlgorithm_i_iEN3cub9BlockScanE) to maximize  efficiency.
 
 ### Rejection Sampling