Complete mathematical control system formulation of the SheepOp Language Model, treating the entire system as a unified mathematical control system with state-space representations, transfer functions, and step-by-step explanations.
- System Overview
- State-Space Representation
- Tokenizer as Input Encoder
- Seed Control System
- Embedding Layer Control
- Positional Encoding State
- Self-Attention Control System
- Feed-Forward Control
- Layer Normalization Feedback
- Complete System Dynamics
- Training as Optimization Control
- Inference Control Loop
The SheepOp LLM can be modeled as a nonlinear dynamical control system with:
-
Input: Character sequence
$\mathbf{c} = [c_1, c_2, ..., c_n]$ - State: Hidden representations $\mathbf{h}_t $at each layer and time step
-
Control: Model parameters
$\theta = {W_Q, W_K, W_V, W_1, W_2, ...} $ -
Output: Probability distribution over vocabulary
$\mathbf{p}_t \in \mathbb{R}^V$
System Block Diagram:
Input Sequence → Tokenizer → Embeddings → Positional Encoding →
↓
[Transformer Layer 1] → [Transformer Layer 2] → ... → [Transformer Layer L]
↓
Output Projection → Logits → Softmax → Output Probabilities
The complete system can be expressed as:
where:
-
$\mathbf{x}_t $ = input at time$ t$ -
$\mathbf{h}_t $ = hidden state at time$ t$ -
$\theta $ = system parameters (weights) -
$\mathbf{s} $ = seed for randomness -
$\mathcal{F} $ = complete forward function
For a transformer with L layers and sequence length n :
State Vector:
where
State Update Equation:
Output Equation:
The system is nonlinear due to:
- Attention mechanism (softmax)
- Activation functions (GELU)
- Layer normalization
However, individual components can be analyzed as piecewise linear systems.
The tokenizer maps a character sequence to a discrete token sequence:
Mathematical Formulation:
For input sequence
Control Properties:
- Deterministic: Same input always produces same output
-
Invertible: For most tokens,
$V^{-1}$ exists - Bijective: Each character maps to unique token ID
The tokenizer maintains internal state:
State Transition:
Step 1: Character Extraction
- Input: Raw text string "Hello"
- Process: Extract each character
$c \in {'H', 'e', 'l', 'l', 'o'}$ - Meaning: Break down text into atomic units
Step 2: Vocabulary Lookup
- Process: Apply
$V(c)$ to each character - Example:
$V('H') = 72, V('e') = 101, V('l') = 108, V('o') = 111$ - Meaning: Convert characters to numerical indices
Step 3: Sequence Formation
- Output:
$\mathbf{t} = [72, 101, 108, 108, 111]$ - Meaning: Numerical representation ready for embedding
Control Impact: Tokenizer creates the foundation for all subsequent processing. Any error here propagates through the entire system.
The seed
Initialization:
Mathematical Model:
For weight initialization:
Example - Normal Initialization:
Step 1: Seed Input
- Input:
$s = 42$ - Meaning: Provides reproducibility guarantee
Step 2: RNG State Initialization
- Process: Set all random number generators to state based on
$s$ - Meaning: Ensures deterministic behavior
Step 3: Weight Initialization
- Process: Generate all weights using RNG with seed
$s$ - Example:
$W_{ij} = \text{normal}(0, 0.02, \text{seed}=42)$ - Meaning: Starting point for optimization
Step 4: Training Determinism
- Process: Same seed + same data → same gradients → same updates
- Meaning: Complete reproducibility
Control Impact: Seed controls initial conditions and stochastic processes throughout training. It's the control parameter for reproducibility.
The embedding layer performs a lookup operation:
Mathematical Formulation:
Batch Processing:
Control Function:
Gradient Flow:
Step 1: Token ID Input
- Input:
$t = 72$ (token ID for 'H') - Meaning: Discrete index into vocabulary
Step 2: Matrix Lookup
- Process:
$\mathbf{x} = \mathbf{E}[72]$ - Example:
$\mathbf{x} = [0.1, -0.2, 0.3, ..., 0.05] \in \mathbb{R}^{512}$ - Meaning: Continuous vector representation
Step 3: Semantic Encoding
- Property: Similar tokens have similar embeddings (after training)
- Meaning: Embeddings capture semantic relationships
Control Impact: Embedding layer projects discrete tokens into continuous space, enabling gradient-based optimization.
Additive Control:
Meaning: Positional encoding injects positional information into the embeddings.
Step 1: Position Index
- Input: Position
$pos = 0, 1, 2, ..., n-1$ - Meaning: Absolute position in sequence
Step 2: Encoding Generation
- Process: Compute
$PE_{(pos, i)}$ for each dimension $ i$ - Example:
$PE*{(0, 0)} = 0, PE*{(0, 1)} = 1, PE_{(1, 0)} \approx 0.84$ - Meaning: Unique pattern for each position
Step 3: Addition Operation
- Process:
$\mathbf{X}_{pos} = \mathbf{X} + PE$ - Meaning: Position information added to embeddings
Step 4: Multi-Scale Representation
- Property: Different dimensions encode different frequency scales
- Meaning: Model can learn both local and global positional patterns
Control Impact: Positional encoding provides temporal/spatial awareness to the model, enabling it to understand sequence order.
Self-attention can be modeled as a dynamical control system that routes information:
Query, Key, Value Generation:
Attention Scores (Transfer Function):
Attention Weights (Control Signal):
Output (Controlled Response):
Attention as Feedback Control:
Meaning: Attention acts as a learnable routing mechanism controlled by similarities between queries and keys.
Head Splitting:
Parallel Processing:
Concatenation:
Causal Mask:
Masked Attention:
Effect: Prevents information flow from future positions.
Step 1: Query, Key, Value Generation
- Process: Linear transformations of input
- Meaning: Create three representations: what to look for (Q), what to match (K), what to retrieve (V)
Step 2: Similarity Computation
- Process:
$S_{ij} = Q_i \cdot K_j / \sqrt{d_k}$ - Meaning: Measure similarity/relevance between positions
$i$ and $ j $
Step 3: Softmax Normalization
- Process: $A*{ij} = \exp(S*{ij}) / \sumk \exp(S{ik})$
- Meaning: Convert similarities to probability distribution (attention weights)
Step 4: Weighted Aggregation
- Process: $Oi = \sum_j A{ij} V_j$
- Meaning: Combine values weighted by attention probabilities
Step 5: Information Flow
- Property: Each position receives information from all other positions (with causal masking)
- Meaning: Enables long-range dependencies and context understanding
Control Impact: Self-attention is the core control mechanism that determines what information flows where in the sequence.
Two-Stage Transformation:
Control Interpretation: GELU applies smooth gating - values near zero are suppressed, positive values pass through.
Step 1: Expansion
- Process: $\mathbf{H} = \mathbf{X} \mathbf{W}1 expands to d{ff} > d$
- Example:
$d = 512 \rightarrow d_{ff} = 2048$ - Meaning: Increases capacity for complex transformations
Step 2: Nonlinear Activation
- Process:
$\mathbf{H}' = \text{GELU}(\mathbf{H})$ - Meaning: Introduces nonlinearity, enabling complex function approximation
Step 3: Compression
- Process: $\mathbf{O} = \mathbf{H}' \mathbf{W}_2
$compresses back to$ d$ - Meaning: Projects back to original dimension
Control Impact: FFN provides nonlinear processing power and feature transformation at each position.
Normalization as State Regulation:
Meaning: Normalization regulates the distribution of activations, preventing saturation and improving gradient flow.
Transformer Block with Pre-Norm:
Control Impact: Pre-norm architecture provides stability and better gradient flow.
Step 1: Mean Computation
- Process:
$\mu = \frac{1}{d} \sum x_i$ - Meaning: Find center of distribution
Step 2: Variance Computation
- Process:
$\sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2$ - Meaning: Measure spread of distribution
Step 3: Normalization
- Process:
$\hat{x}_i = (x_i - \mu) / \sqrt{\sigma^2 + \epsilon}$ - Meaning: Standardize to zero mean, unit variance
Step 4: Scale and Shift
- Process:
$x_{out} = \gamma \odot \hat{x} + \beta$ - Meaning: Allow model to learn optimal scale and shift
Control Impact: Layer normalization provides stability and faster convergence by maintaining consistent activation distributions.
System State Evolution:
The complete system can be viewed as:
Properties:
- Nonlinear: Due to softmax, GELU, normalization
- Differentiable: All operations have gradients
- Compositional: Built from simpler functions
Step 1: Input Encoding
- Input: Token sequence
$\mathbf{T}$ - Process: Embedding + Positional Encoding
- Output:
$\mathbf{h}_0 \in \mathbb{R}^{B \times n \times d}$ - Meaning: Convert discrete tokens to continuous vectors with position info
Step 2: Layer Processing
- For each layer
$l = 1, ..., L $ :- Process: Self-attention + FFN with residual connections
- Output:
$\mathbf{h}_l \in \mathbb{R}^{B \times n \times d}$ - Meaning: Transform representations through attention and processing
Step 3: Output Generation
- Process: Final layer norm + output projection
- Output:
$\mathbf{L} \in \mathbb{R}^{B \times n \times V} (logits)$ - Meaning: Predict probability distribution over vocabulary
Step 4: Probability Computation
- Process: Softmax over logits
- Output:
$\mathbf{p} \in \mathbb{R}^{B \times n \times V} (probabilities)$ - Meaning: Normalized probability distribution for next token prediction
Objective Function:
Optimization Problem:
Gradient Computation:
Parameter Update (AdamW):
Cosine Annealing Schedule:
Control Interpretation: Learning rate acts as gain scheduling - high gain initially for fast convergence, low gain later for fine-tuning.
Clipping Function:
Purpose: Prevents explosive gradients that could destabilize training.
Step 1: Forward Pass
- Process:
$\hat{\mathbf{y}} = \mathcal{F}(\mathbf{x}, \theta_t)$ - Meaning: Compute predictions with current parameters
Step 2: Loss Computation
- Process:
$\mathcal{L} = \text{CrossEntropy}(\hat{\mathbf{y}}, \mathbf{y})$ - Meaning: Measure prediction error
Step 3: Backward Pass
- Process:
$\mathbf{g} = \nabla_\theta \mathcal{L}$ - Meaning: Compute gradients for all parameters
Step 4: Gradient Clipping
- Process:
$\mathbf{g}_{clipped} = \text{Clip}(\mathbf{g}, \theta)$ - Meaning: Prevent gradient explosion
Step 5: Optimizer Update
- Process:
$\theta*{t+1} = \text{AdamW}(\theta_t, \mathbf{g}*{clipped}, \eta_t)$ - Meaning: Update parameters using adaptive learning rate
Step 6: Learning Rate Update
- Process:
$\eta_{t+1} = \text{Scheduler}(\eta_t, t)$ - Meaning: Adjust learning rate according to schedule
Control Impact: Training process is a closed-loop control system where:
- Error signal: Loss
- Controller: Optimizer (AdamW)
- Actuator: Parameter updates
- Plant: Model forward pass
State-Space Model:
Step-by-Step:
-
Current State:
$\mathbf{h}_t$ - Output Generation: $\mathbf{p}t = \text{softmax}(\mathbf{h}_t \mathbf{W}{out})$
-
Sampling:
$x_{t+1} \sim \mathbf{p}_t (with temperature, top-k, top-p)$ - State Update: $\mathbf{h}{t+1} = \mathcal{F}([\mathbf{h}_t, x{t+1}], \theta)$
- Repeat: Until max length or stop token
Temperature Control:
Top-k Filtering:
Top-p (Nucleus) Sampling:
Step 1: Initialization
- Input: Prompt tokens
$\mathbf{P} = [p_1, ..., p_k]$ - Process: Initialize state
$\mathbf{h}_0 = \mathcal{E}(\mathbf{P}) + \mathbf{PE}$ - Meaning: Set initial state from prompt
Step 2: Forward Pass
- Process: $\mathbf{h}t = \text{Transformer}(\mathbf{h}{t-1})$
- Output: Hidden state
$\mathbf{h}_t$ - Meaning: Process current sequence
Step 3: Logit Generation
- Process: $\mathbf{l}t = \mathbf{h}_t \mathbf{W}{out}$
- Output: Logits
$\mathbf{l}_t \in \mathbb{R}^V$ - Meaning: Unnormalized scores for each token
Step 4: Probability Computation
- Process:
$\mathbf{p}_t = \text{softmax}(\mathbf{l}_t / T)$ - Output: Probability distribution
$\mathbf{p}_t$ - Meaning: Normalized probabilities with temperature
Step 5: Sampling
- Process:
$x_{t+1} \sim \mathbf{p}_t (with optional top-k/top-p)$ - Output: Next token
$x_{t+1}$ - Meaning: Stochastically select next token
Step 6: State Update
- Process: Append
$x*{t+1}$ to sequence, update$\mathbf{h}*{t+1}$ - Meaning: Incorporate new token into state
Step 7: Termination Check
- Condition:
$t < \text{max_length} and x_{t+1} \neq \text{}$ - If true: Go to Step 2
- If false: Return generated sequence
Control Impact: Inference is a recurrent control system where:
- State: Current hidden representation
- Control: Sampling strategy (temperature, top-k, top-p)
- Output: Generated token sequence
-
Tokenizer: Input encoder
$\mathcal{T}$ -
Seed: Initialization control
$\mathbf{s}$ -
Embeddings: State projection
$\mathcal{E}$ -
Positional Encoding: Temporal control
$\mathbf{PE}$ -
Attention: Information routing
$\mathcal{A}$ -
FFN: Nonlinear transformation
$\mathcal{F}$ -
Normalization: State regulation
$\mathcal{N}$ -
Optimizer: Parameter control
$\mathcal{O}$ -
Scheduler: Learning rate control
$\mathcal{S}$ -
Sampling: Output control
$\mathcal{P}$
Input Characters
↓ [Tokenizer Control]
Token IDs
↓ [Seed Control]
Initialized Parameters
↓ [Embedding Control]
Vector Representations
↓ [Positional Control]
Position-Aware Vectors
↓ [Attention Control]
Context-Aware Representations
↓ [FFN Control]
Transformed Features
↓ [Normalization Control]
Stabilized Activations
↓ [Output Control]
Probability Distributions
↓ [Sampling Control]
Generated Tokens
Each component acts as a control element in a unified dynamical system, working together to transform input text into meaningful language model outputs.
Block Diagram (a): Detailed Single Transformer Block
Input X
↓
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
↓
┌─────────────┐
│ Multi-Head │
│ Attention │
└──────┬──────┘
↓
┌─────────────┐
│ Dropout │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── (Residual Connection from X)
└──────┬──────┘
↓
┌─────────────┐
│ LayerNorm │
└──────┬──────┘
↓
┌─────────────┐
│ Feed-Forward│
│ Network │
└──────┬──────┘
↓
┌─────────────┐
│ Dropout │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── (Residual Connection)
└──────┬──────┘
↓
Output X'
Mathematical Transfer Function:
Block Diagram (b): Simplified Single Block
Input X
↓
┌─────────────────────────────────────┐
│ TransformerBlock │
│ G_block(X) = X + Attn(LN(X)) + │
│ FFN(LN(X + Attn(LN(X))))│
└──────────────┬──────────────────────┘
↓
Output X'
Transfer Function:
Block Diagram (c): Cascaded Transformer Blocks
Input Tokens T
↓
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
↓
┌─────────────┐
│ Positional │
│ G_pos │
└──────┬──────┘
↓
┌─────────────┐
│ Block 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ Block 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ ... │
└──────┬──────┘
↓
┌─────────────┐
│ Block L │
│ G_block_L │
└──────┬──────┘
↓
┌─────────────┐
│ Final Norm │
│ G_norm │
└──────┬──────┘
↓
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
↓
Output Logits
Overall Transfer Function:
Block Diagram (d): Training Control Loop
Input Data X
↓
┌─────────────┐
│ Model │
│ Forward │
│ F │
└──────┬──────┘
↓
┌─────────────┐
│ Output │
│ ŷ │
└──────┬──────┘
↓
┌─────────────┐
│ Loss │
│ L(ŷ, y) │
└──────┬──────┘
↓
┌─────────────┐
│ Gradient │
│ ∇θ │
└──────┬──────┘
↓
┌─────────────┐
│ Clipping │
│ Clip │
└──────┬──────┘
↓
┌─────────────┐
│ Optimizer │
│ AdamW │
└──────┬──────┘
↓
┌─────────────┐
│ Parameter │
│ Update │
└──────┬──────┘
↓
┌─────────────┐
│ - │ ←─── (Feedback to Model)
└─────────────┘
Closed-Loop Transfer Function:
We'll trace through the complete system with the phrase "Hello World".
Input: "Hello World"
Process:
Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
Token IDs: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
Mathematical:
Vector Representation:
- Dimension:
$n = 11$ tokens - Token IDs:
$\mathbf{t} \in \mathbb{N}^{11}$
Embedding Matrix:
Lookup Operation:
Example Values (first 3 dimensions):
Vector Visualization:
Token 'H' (ID=72): [0.10, -0.20, 0.30, ..., 0.05] (512-dim vector)
Token 'e' (ID=101): [-0.10, 0.30, -0.10, ..., 0.02] (512-dim vector)
Token 'l' (ID=108): [0.05, 0.15, -0.05, ..., 0.01] (512-dim vector)
...
Positional Encoding Matrix:
Computation:
Addition:
Example (first token, first 3 dimensions):
Query, Key, Value Projections:
Let
Example Calculation (head 0, token 0):
For
Attention Score Computation:
Example Numerical Calculation:
Assume:
Attention Weights:
Example:
If
Output Calculation:
Example (first dimension):
Input:
First Linear Transformation:
Example (token 0, first dimension):
GELU Activation:
Second Linear Transformation:
Input: $\mathbf{X}{in} = \mathbf{X}{pos} \in \mathbb{R}^{11 \times 512}$
Step 6.1: Layer Normalization
Example:
Step 6.2: Attention Output
Step 6.3: Residual Connection
Example:
Step 6.4: Second Layer Norm + FFN
Step 6.5: Final Residual
Example:
After L layers:
Output Projection:
Example (position 0):
Softmax:
Example:
If
Let's trace through the complete system with "Hello" step-by-step.
Query Generation:
Score Matrix (head 0):
Example Values:
Attention Weights:
Output (head 0):
Concatenate All Heads:
After processing through all L layers:
Example (position 4, predicting next token):
Probability Distribution:
2D Projection Example:
After embedding "Hello", tokens occupy positions in 512-dimensional space. Projected to 2D:
Token Positions (idealized 2D projection):
'l' (0.05, 0.15)
●
'e' (-0.10, 0.30)
●
Origin (0, 0)
●
'H' (0.10, -0.20)
●
'o' (-0.05, 0.20)
●
Distance in Embedding Space:
Attention Matrix Visualization:
Position 0 1 2 3 4
┌─────┴─────┴─────┴─────┴──┐
Token 0 │ 0.35 0.15 0.22 0.20 0.28 │ 'H'
│ │
Token 1 │ 0.15 0.38 0.20 0.18 0.27 │ 'e'
│ │
Token 2 │ 0.23 0.18 0.32 0.30 0.26 │ 'l'
│ │
Token 3 │ 0.21 0.19 0.28 0.33 0.25 │ 'l'
│ │
Token 4 │ 0.27 0.22 0.26 0.25 0.36 │ 'o'
└──────────────────────────┘
Interpretation:
- Token 0 ('H') attends most to itself (0.35) and token 4 (0.28)
- Token 4 ('o') attends moderately to all positions
- Higher values indicate stronger attention
Output Distribution for Position 5 (next token after "Hello"):
Probability Distribution p[5, :]
Probability
│
0.3 │ ●
│
0.2 │ ● ●
│
0.1 │ ● ● ● ●
│
0.0 ├─┴───┴───┴───┴───┴───┴───┴───┴─── Token IDs
32 72 87 101 108 111 ... 127
␣ H W e l o
Meaning:
- Highest probability for space (32) ≈ 0.28
- Next: 'o' (111) ≈ 0.23
- Then: 'W' (87) ≈ 0.18
- Model predicts space or continuation
Following control system reduction techniques, we can simplify the transformer model step-by-step:
Diagram (a): Original Complex System
Input R (Tokens)
↓
┌─────────────┐
│ Embedding │
│ G_emb │
└──────┬──────┘
↓
┌─────────────┐
│ Positional │
│ Encoding │
│ G_pos │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback from Layer 2
└──────┬──────┘
↓
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback from Output
└──────┬──────┘
↓
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── Feedback H₁
└──────┬──────┘
↓
┌─────────────┐
│ Output Proj │
│ G_out │
└──────┬──────┘
↓
Output C (Logits)
Diagram (b): First Simplification (Combine Embedding and Positional)
Input R
↓
┌─────────────────────┐
│ G_emb_pos = │
│ G_pos ∘ G_emb │
└──────┬──────────────┘
↓
┌─────────────┐
│ + │
└──────┬──────┘
↓
┌─────────────┐
│ Layer 1 │
│ G_block₁ │
└──────┬──────┘
↓
┌─────────────┐
│ + │
└──────┬──────┘
↓
┌─────────────┐
│ Layer 2 │
│ G_block₂ │
└──────┬──────┘
↓
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
↓
┌─────────────┐
│ G_out │
└──────┬──────┘
↓
Output C
Diagram (c): Second Simplification (Combine Layers)
Input R
↓
┌─────────────────────┐
│ G_emb_pos │
└──────┬──────────────┘
↓
┌──────────────────────────────────┐
│ G_layers = G_block₂ ∘ G_block₁ │
│ Equivalent to: │
│ X + Δ₁(X) + Δ₂(X + Δ₁(X)) │
└──────┬───────────────────────────┘
↓
┌─────────────┐
│ + │ ←─── H₁
└──────┬──────┘
↓
┌─────────────┐
│ G_out │
└──────┬──────┘
↓
Output C
Diagram (d): Third Simplification (Combine with Output)
Input R
↓
┌──────────────────────────────┐
│ G_forward = │
│ G_out ∘ G_layers ∘ G_emb_pos │
└──────┬───────────────────────┘
↓
┌─────────────┐
│ + │ ←─── H₁ (Feedback)
└──────┬──────┘
↓
Output C
Diagram (e): Final Simplified Transfer Function
Input R
↓
┌────────────────────────────────────────────┐
│ Overall Transfer Function: │
│ │
│ C/R = G_forward / (1 + G_forward × H₁) │
│ │
│ Where: │
│ G_forward = G_out ∘ G_layers ∘ G_emb_pos │
│ │
└──────┬─────────────────────────────────────┘
↓
Output C
Mathematical Derivation:
Step 1: Combine embedding and positional encoding:
Step 2: Combine transformer layers:
Step 3: Combine with output projection:
Step 4: Apply feedback reduction:
Diagram (a): Detailed Attention
Input X
↓
┌─────────────┐
│ Q │ ←─── W_Q
│ K │ ←─── W_K
│ V │ ←─── W_V
└──────┬──────┘
↓
┌─────────────┐
│ Scores │
│ S = QK^T/√d │
└──────┬──────┘
↓
┌─────────────┐
│ Softmax │
│ A = σ(S) │
└──────┬──────┘
↓
┌─────────────┐
│ Output │
│ O = AV │
└──────┬──────┘
↓
┌─────────────┐
│ Out Proj │
│ W_O │
└──────┬──────┘
↓
Output X'
Diagram (b): Simplified Attention Transfer Function
Input X
↓
┌──────────────────────────────┐
│ G_attn(X) = │
│ W_O · softmax(QK^T/√d) · V │
│ │
│ Where: │
│ Q = XW_Q, K = XW_K, V = XW_V │
└──────┬───────────────────────┘
↓
Output X'
Mathematical Transfer Function:
Input: "Hello World"
Stage 1: Tokenization
Stage 2: Embedding (showing first 4 dimensions)
Stage 3: Positional Encoding (first 4 dimensions)
Stage 4: Combined Input
Example Row 0 (token 'H'):
Stage 5: Attention (Head 0, showing attention from token 0 to all tokens)
Stage 6: Attention Output
Example (first dimension):
Stage 7: FFN Output
Stage 8: Final Output (after all layers)
Stage 9: Logits
Stage 10: Probabilities
Trajectory Plot:
512-Dimensional Embedding Space (2D Projection)
0.3 │ 'e' (pos 1)
│ ●
0.2 │ 'r' (pos 8)
│ ●
0.1 │ 'l' (pos 2,3,9) 'o' (pos 4,7)
│ ● ●
0.0 ├───────────────────────────────────────────
│ 'H' (pos 0)
-0.1 │ ●
│
-0.2 │
│
-0.3 │ 'W' (pos 6)
│ ●
└───────────────────────────────────────────
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
Attention Weight Matrix Visualization:
Attention Weights A[i,j] for "Hello World"
j → 0 1 2 3 4 5 6 7 8 9 10
↓ ['H'] ['e'] ['l'] ['l'] ['o'] [' '] ['W'] ['o'] ['r'] ['l'] ['d']
i=0 ['H'] │ 0.35 0.15 0.22 0.20 0.28 0.14 0.19 0.26 0.17 0.21 0.23 │
i=1 ['e'] │ 0.15 0.38 0.20 0.18 0.27 0.16 0.18 0.25 0.19 0.22 0.20 │
i=2 ['l'] │ 0.23 0.18 0.32 0.30 0.26 0.17 0.21 0.24 0.25 0.31 0.23 │
i=3 ['l'] │ 0.21 0.19 0.28 0.33 0.25 0.18 0.20 0.23 0.24 0.30 0.22 │
i=4 ['o'] │ 0.27 0.22 0.26 0.25 0.36 0.19 0.23 0.29 0.24 0.27 0.25 │
i=5 [' '] │ 0.18 0.20 0.19 0.21 0.24 0.40 0.22 0.25 0.21 0.20 0.22 │
i=6 ['W'] │ 0.22 0.21 0.23 0.24 0.26 0.20 0.45 0.28 0.27 0.23 0.25 │
i=7 ['o'] │ 0.26 0.25 0.24 0.23 0.29 0.21 0.28 0.38 0.26 0.24 0.26 │
i=8 ['r'] │ 0.19 0.21 0.25 0.24 0.24 0.19 0.27 0.26 0.42 0.27 0.28 │
i=9 ['l'] │ 0.21 0.22 0.31 0.30 0.27 0.20 0.23 0.24 0.27 0.35 0.24 │
i=10['d'] │ 0.23 0.20 0.23 0.22 0.25 0.22 0.25 0.26 0.28 0.24 0.48 │
Color Coding:
█ = 0.48-0.50 (very high attention)
█ = 0.35-0.48 (high attention)
█ = 0.25-0.35 (medium attention)
█ = 0.15-0.25 (low attention)
█ = 0.00-0.15 (very low attention)
Logits and Probabilities:
Logits L[5, :] (predicting token after "Hello ")
Logit
Value │
6.0 │ ● (token 87 'W')
│
5.0 │ ● (token 111 'o')
│
4.0 │ ● (token 32 ' ') ● (token 114 'r')
│
3.0 │ ● ● ●
│
2.0 │ ● ● ● ● ● ● ● ● ● ● ●
│
1.0 │ ● ● ● ● ● ● ● ● ● ● ●
│
0.0 ├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
␣ H W e l o r
Probabilities p[5, :]
Probability
│
0.3│ ● ('W')
│
0.2│ ● (' ') ● ('o')
│
0.1│ ● ● ● ● ● ● ●
│
0.0├─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴── Token IDs
32 72 87 101 108 111 114 ...
19.4 Hidden State Evolution Through Layers
Layer-by-Layer Transformation:
Hidden State Evolution for Token 'H' (position 0)
Dimension 0:
Layer 0: 0.10 (embedding + positional)
Layer 1: 0.42 (after attention + FFN)
Layer 2: 0.58 (after second layer)
Layer 3: 0.65 (after third layer)
... ...
Layer L: 0.72 (final hidden state)
Dimension 1:
Layer 0: 0.80 (embedding + positional)
Layer 1: 0.25 (after attention + FFN)
Layer 2: 0.18 (after second layer)
Layer 3: 0.22 (after third layer)
... ...
Layer L: 0.15 (final hidden state)
Visualization:
Hidden State Magnitude ||h[l]|| Over Layers
Magnitude
│
1.0│ ●
│ ●
0.8│ ●
│ ●
0.6│ ●
│ ●
0.4│ ●
│ ●
0.2│ ●
│ ●
0.0├───────────────────────── Layer
0 1 2 3 4 5 6
Text: "Hello World"
Complete Mathematical Flow:
- Tokenization:
- Embedding:
- Positional Encoding:
- Transformer Layers (L=6):
- Output:
- Probabilities:
Final Prediction:
For position 5 (after "Hello "):
Most Likely: 'W' → Complete prediction: "Hello World"
This document provides a complete mathematical control system formulation with block diagrams, vector visualizations, numerical examples, and step-by-step calculations for every component of the SheepOp LLM.