Skip to content

Commit 2e00301

Browse files
committed
upd
1 parent 2680cd0 commit 2e00301

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

_posts/2025-03-10-sampling.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
---
22
layout: post
3-
title: "Use FlashInfer For Fast(er) LLM Sampling"
3+
title: "Sorting-Free Rejection Sampling GPU-Kernels in FlashInfer for Faster Inference"
44
date: 2025-03-10
55
comments: true
6-
author: FlashInfer Community
6+
author: Shanli Xing (UW), Zihao Ye (UW), Bohan Hou (CMU), Luis Ceze (UW), Tianqi Chen (CMU)
77
---
88

9+
## Background
10+
11+
As vocabulary size grows in Large Language Models (LLMs), the sampling (token selection) process becomes a performance bottleneck. Sampling is key operator in LLM Inference Serving, the [sampling operators](https://docs.flashinfer.ai/api/sampling.html) in FlashInfer were first introduced in [v0.0.5](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.5)
12+
and FlashInfer team has been improving the robustness and performance of the sampling operators since then. In this blog, we'll walk you through the algorithm and implementation details of sampling operators in FlashInfer.
13+
914
## LLM Sampling
1015

1116
Sampling is the process that picks a specific next token from the vector of model logits (one per token). In practice, heuristics such as Top-P, Top-K, or Min-P thresholds are usually applied to pass tokens with negligible probability, control generation behaviors, and enforce minimum probabilities.

0 commit comments

Comments
 (0)