Complete step-by-step explanation of text generation: how models generate text using autoregressive generation, sampling, and decoding strategies.
- What is Generation?
- Autoregressive Generation
- Sampling Strategies
- Temperature
- Top-k Sampling
- Top-p (Nucleus) Sampling
- Step-by-Step Generation Process
- Exercise: Complete Generation Example
- Key Takeaways
Generation (text generation) is the process of using a trained model to produce new text, one token at a time, based on a given prompt.
Think of generation like writing a story:
Prompt: "Once upon a time"
Model generates:
"Once upon a time" → "there"
"Once upon a time there" → "was"
"Once upon a time there was" → "a"
"Once upon a time there was a" → "princess"
...
Final: "Once upon a time there was a princess..."
Model predicts next word, one at a time!
Generation:
- Takes a prompt (starting text)
- Predicts next token probabilities
- Samples a token from distribution
- Appends token to sequence
- Repeats until complete
Result: Generated text continuation!
Autoregressive means the model uses its own previous outputs as inputs for the next prediction.
Step 1: Initial Prompt
Prompt: "Hello"
Sequence: ["Hello"]
Step 2: First Prediction
Input: ["Hello"]
Model output: Probabilities for next token
"World": 0.4
"there": 0.3
"friend": 0.2
...
Step 3: Sample Token
Sample: "World" (selected)
Sequence: ["Hello", "World"]
Step 4: Second Prediction
Input: ["Hello", "World"]
Model output: Probabilities for next token
"!": 0.5
".": 0.3
",": 0.1
...
Step 5: Continue
Sample: "!"
Sequence: ["Hello", "World", "!"]
Continue until max length or stop token...
For prompt
Initialization:
For each step
-
Forward pass:
$$\mathbf{L}_t = \text{Model}(\mathbf{T}_{t-1})$$ -
Get next token probabilities:
$$\mathbf{p}_t = \text{softmax}(\mathbf{L}_t[:, -1, :])$$ -
Sample token:
$$t_t \sim \text{Categorical}(\mathbf{p}_t)$$ -
Append token:
$$\mathbf{T}_t = [\mathbf{T}_{t-1}, t_t]$$
Repeat until stop condition!
Deterministic (Greedy):
Always pick highest probability:
"World": 0.4 ← Highest
"there": 0.3
"friend": 0.2
→ Always picks "World"
→ Same output every time
Stochastic (Sampling):
Sample from distribution:
"World": 0.4 (40% chance)
"there": 0.3 (30% chance)
"friend": 0.2 (20% chance)
→ Different output each time
→ More diverse generations
Greedy (Deterministic):
- Same output every time
- Can be repetitive
- Less creative
Sampling:
- Different outputs each time
- More diverse
- More creative
- Better for creative tasks
Temperature controls the randomness of sampling by scaling the logits before applying softmax.
Where:
-
$\mathbf{l}_t$ = logits (raw scores) -
$T$ = temperature -
$\mathbf{p}_t$ = probabilities
T = 0.5 (Low Temperature - More Deterministic):
Logits: [2.0, 1.0, 0.5]
After scaling: [4.0, 2.0, 1.0]
After softmax: [0.88, 0.11, 0.01]
→ Sharp distribution (one token dominates)
→ More deterministic
T = 1.0 (Standard Temperature):
Logits: [2.0, 1.0, 0.5]
After scaling: [2.0, 1.0, 0.5]
After softmax: [0.66, 0.24, 0.10]
→ Moderate distribution
→ Balanced
T = 2.0 (High Temperature - More Random):
Logits: [2.0, 1.0, 0.5]
After scaling: [1.0, 0.5, 0.25]
After softmax: [0.52, 0.31, 0.17]
→ Flat distribution (more uniform)
→ More random
Probability
│
1.0│ T=0.5: ●
│
0.8│
│
0.6│ T=1.0: ●
│
0.4│
│
0.2│ T=2.0: ●
│
0.0├───────────────────────── Token
"World" "there" "friend"
Lower T = Sharper distribution = More deterministic
Higher T = Flatter distribution = More random
Low Temperature (T < 1.0):
- Factual tasks
- Reproducible outputs
- When you want consistent results
Standard Temperature (T = 1.0):
- Default setting
- Balanced behavior
- Good for most tasks
High Temperature (T > 1.0):
- Creative writing
- Diverse outputs
- When you want variety
Top-k sampling limits the sampling to only the top k most likely tokens.
Step 1: Get Probabilities
All tokens:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
Step 2: Select Top-k (e.g., k=3)
Top 3:
"World": 0.4
"there": 0.3
"friend": 0.2
Step 3: Remove Others
Set others to 0:
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.0
"cat": 0.0
"dog": 0.0
...
Step 4: Renormalize
Sum = 0.4 + 0.3 + 0.2 = 0.9
Renormalize:
"World": 0.4/0.9 = 0.44
"there": 0.3/0.9 = 0.33
"friend": 0.2/0.9 = 0.22
Step 5: Sample from Top-k
Sample from these 3 tokens only
Given probabilities
Benefits:
- Removes low-probability tokens
- Focuses on likely candidates
- Reduces randomness from unlikely tokens
- Better quality generations
Example:
Without top-k: Might sample "xyz" (very unlikely)
With top-k=50: Only samples from top 50 tokens
→ Better quality!
Top-p (nucleus) sampling keeps the smallest set of tokens whose cumulative probability is at least p.
Step 1: Sort Probabilities
Sorted (descending):
"World": 0.4
"there": 0.3
"friend": 0.2
"hello": 0.05
"cat": 0.03
"dog": 0.02
...
Step 2: Compute Cumulative Probabilities
Cumulative:
"World": 0.4
"there": 0.7 (0.4 + 0.3)
"friend": 0.9 (0.7 + 0.2)
"hello": 0.95 (0.9 + 0.05)
"cat": 0.98 (0.95 + 0.03)
...
Step 3: Find Nucleus (e.g., p=0.9)
Find smallest set where sum ≥ 0.9:
"World": 0.4
"there": 0.3
"friend": 0.2
Cumulative: 0.9 ✓
→ Keep these 3 tokens
Step 4: Remove Others
Keep:
"World": 0.4
"there": 0.3
"friend": 0.2
Others: 0.0
Step 5: Renormalize and Sample
Renormalize and sample
Given probabilities
Find smallest set S:
Then:
Benefits:
- Adapts to distribution shape
- Keeps relevant tokens dynamically
- Better than fixed k in some cases
- More flexible than top-k
Example:
Sharp distribution: Top-p=0.9 might keep 3 tokens
Flat distribution: Top-p=0.9 might keep 50 tokens
→ Adapts automatically!
Given prompt: "Hello"
Prompt: "Hello"
Token IDs: [72]
Input: [72]
Model processes through layers
Output: Logits for all tokens
Token 72: 5.2
Token 87: 4.8 ← "World"
Token 101: 3.2 ← "there"
Token 108: 2.1 ← "friend"
...
Temperature: T = 1.0
Scaled logits: Same as above
Top-k: k = 50
Keep top 50 tokens, remove others
Top-p: p = 0.95
Keep tokens with cumulative prob ≥ 0.95
Apply softmax:
"World": 0.4
"there": 0.3
"friend": 0.2
...
Sample from distribution:
Selected: "World" (token 87)
Sequence: [72, 87]
Text: "Hello World"
Input: [72, 87]
→ Predict next token
→ Sample
→ Append
→ Repeat...
Given:
- Prompt: "The"
- Model logits for next token:
[10.0, 8.0, 5.0, 2.0, 1.0, 0.5, ...](for tokens: "cat", "dog", "car", "house", "tree", "book", ...) - Temperature: T = 1.0
- Top-k: k = 3
- Top-p: p = 0.9
Generate the next token step-by-step.
Prompt:
"The"
Token IDs: [32] (assuming "The" = token 32)
Logits:
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": 2.0
Token "tree": 1.0
Token "book": 0.5
...
Temperature: T = 1.0
Scaled logits (divide by T):
Token "cat": 10.0 / 1.0 = 10.0
Token "dog": 8.0 / 1.0 = 8.0
Token "car": 5.0 / 1.0 = 5.0
Token "house": 2.0 / 1.0 = 2.0
Token "tree": 1.0 / 1.0 = 1.0
Token "book": 0.5 / 1.0 = 0.5
No change (T=1.0 is identity)
Top-k: k = 3
Select top 3 tokens:
Top 3:
"cat": 10.0
"dog": 8.0
"car": 5.0
Set others to -∞:
Token "cat": 10.0
Token "dog": 8.0
Token "car": 5.0
Token "house": -∞
Token "tree": -∞
Token "book": -∞
First, compute probabilities from top-k tokens:
Apply softmax:
exp(10.0) = 22026.47
exp(8.0) = 2980.96
exp(5.0) = 148.41
Sum = 25155.84
P("cat") = 22026.47 / 25155.84 ≈ 0.875
P("dog") = 2980.96 / 25155.84 ≈ 0.119
P("car") = 148.41 / 25155.84 ≈ 0.006
Cumulative probabilities:
"cat": 0.875
"dog": 0.994 (0.875 + 0.119)
"car": 1.000 (0.994 + 0.006)
Find smallest set where sum ≥ 0.9:
"cat": 0.875 < 0.9
"cat" + "dog": 0.994 ≥ 0.9 ✓
→ Keep "cat" and "dog"
→ Remove "car"
Result:
Token "cat": 10.0
Token "dog": 8.0
Token "car": -∞ (removed)
Apply softmax to remaining tokens:
exp(10.0) = 22026.47
exp(8.0) = 2980.96
Sum = 25007.43
P("cat") = 22026.47 / 25007.43 ≈ 0.881
P("dog") = 2980.96 / 25007.43 ≈ 0.119
Sample from distribution:
Random number: 0.75
Cumulative:
"cat": 0.881 ← 0.75 falls here
"dog": 1.000
→ Selected: "cat"
Generated token: "cat"
Final sequence:
Prompt: "The"
Generated: "cat"
Full text: "The cat"
| Step | Operation | Result |
|---|---|---|
| 1 | Initial logits | [10.0, 8.0, 5.0, 2.0, ...] |
| 2 | Apply temperature (T=1.0) | [10.0, 8.0, 5.0, 2.0, ...] |
| 3 | Top-k filtering (k=3) | Keep top 3: [10.0, 8.0, 5.0] |
| 4 | Top-p filtering (p=0.9) | Keep cumulative ≥0.9: [10.0, 8.0] |
| 5 | Compute probabilities | [0.881, 0.119] |
| 6 | Sample | "cat" selected |
The model generated "cat" following "The"!
✅ Generation produces text one token at a time
✅ Autoregressive: uses previous outputs as inputs
✅ Iterative process: predict → sample → append → repeat
✅ Temperature: Controls randomness (lower = deterministic, higher = random)
✅ Top-k: Limits to top k tokens
✅ Top-p: Keeps smallest set with cumulative probability ≥ p
✅ Combined: Often use temperature + top-k or top-p
✅ Enables text generation from trained models
✅ Different strategies produce different outputs
✅ Essential for language model deployment
Initialization:
For each step
This document provides a comprehensive explanation of text generation, including autoregressive generation, sampling strategies, temperature, top-k, and top-p with mathematical formulations and solved exercises.