upd

yzh119 · yzh119 · commit 2e00301cc362 · 2025-03-10T14:23:35.000-07:00
diff --git a/_posts/2025-03-10-sampling.md b/_posts/2025-03-10-sampling.md
@@ -1,11 +1,16 @@
 ---
 layout: post
-title:  "Use FlashInfer For Fast(er) LLM Sampling"
+title:  "Sorting-Free Rejection Sampling GPU-Kernels in FlashInfer for Faster Inference"
 date:  2025-03-10
 comments: true
-author: FlashInfer Community
+author: Shanli Xing (UW), Zihao Ye (UW), Bohan Hou (CMU), Luis Ceze (UW), Tianqi Chen (CMU)
 ---
 
+## Background
+
+As vocabulary size grows in Large Language Models (LLMs), the sampling (token selection) process becomes a performance bottleneck. Sampling is key operator in LLM Inference Serving, the [sampling operators](https://docs.flashinfer.ai/api/sampling.html) in FlashInfer were first introduced in [v0.0.5](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.5)
+and FlashInfer team has been improving the robustness and performance of the sampling operators since then. In this blog, we'll walk you through the algorithm and implementation details of sampling operators in FlashInfer.
+
 ## LLM Sampling
 
 Sampling is the process that picks a specific next token from the vector of model logits (one per token). In practice, heuristics such as Top-P, Top-K, or Min-P thresholds are usually applied to pass  tokens with negligible probability, control generation behaviors, and enforce minimum probabilities.