LLM Pre-training and Post-training Paradigms x-ref
PEFT: Parameter-Efficient Fine-Tuning (📺) [24 Apr 2023]
-
PEFT: Parameter-Efficient Fine-Tuning. PEFT is an approach to fine tuning only a few parameters. [10 Feb 2023]
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning: [cnt] [28 Mar 2023]
-
Category: Represent approach - Description - Pseudo Code ref [22 Sep 2023]
-
Adapters: Adapters - Additional Layers. Inference can be slower.
def transformer_with_adapter(x): residual = x x = SelfAttention(x) x = FFN(x) # adapter x = LN(x + residual) residual = x x = FFN(x) # transformer FFN x = FFN(x) # adapter x = LN(x + residual) return x
-
Soft Prompts: Prompt-Tuning - Learnable text prompts. Not always desired results.
def soft_prompted_model(input_ids): x = Embed(input_ids) soft_prompt_embedding = SoftPromptEmbed(task_based_soft_prompt) x = concat([soft_prompt_embedding, x], dim=seq) return model(x)
-
Selective: BitFit - Update only the bias parameters. fast but limited.
params = (p for n,p in model.named_parameters() if "bias" in n) optimizer = Optimizer(params)
-
Reparametrization: LoRa - Low-rank decomposition. Efficient, Complex to implement.
def lora_linear(x): h = x @ W # regular linear h += x @ W_A @ W_B # low_rank update return scale * h
-
-
LoRA: Low-Rank Adaptation of Large Language Models: [cnt]: LoRA is one of PEFT technique. To represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. git [17 Jun 2021]
-
LoRA learns less and forgets less: Compared to full training, LoRA has less learning but better retention of original knowledge. [15 May 2024]
- LoRA+: Improves LoRA’s performance and fine-tuning speed by setting different learning rates for the LoRA adapter matrices. [19 Feb 2024]
- LoTR: Tensor decomposition for gradient update. [2 Feb 2024]
- The Expressive Power of Low-Rank Adaptation: Theoretically analyzes the expressive power of LoRA. [26 Oct 2023]
- DoRA: Weight-Decomposed Low-Rank Adaptation. Decomposes pre-trained weight into two components, magnitude and direction, for fine-tuning. [14 Feb 2024]
- LoRA Family ref [11 Mar 2024]
LoRA
introduces low-rank matrices A and B that are trained, while the pre-trained weight matrix W is frozen.LoRA+
suggests having a much higher learning rate for B than for A.VeRA
does not train A and B, but initializes them randomly and trains new vectors d and b on top.LoRA-FA
only trains matrix B.LoRA-drop
uses the output of B*A to determine, which layers are worth to be trained at all.AdaLoRA
adapts the ranks of A and B in different layers dynamically, allowing for a higher rank in these layers, where more contribution to the model’s performance is expected.DoRA
splits the LoRA adapter into two components of magnitude and direction and allows to train them more independently.Delta-LoRA
changes the weights of W by the gradient of A*B.
- 5 Techniques of LoRA ref: LoRA, LoRA-FA, VeRA, Delta-LoRA, LoRA+ [May 2024]
-
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) [19 Nov 2023]: Best practical guide of LoRA.
- QLoRA saves 33% memory but increases runtime by 39%, useful if GPU memory is a constraint.
- Optimizer choice for LLM finetuning isn’t crucial. Adam optimizer’s memory-intensity doesn’t significantly impact LLM’s peak memory.
- Apply LoRA across all layers for maximum performance.
- Adjusting the LoRA rank is essential.
- Multi-epoch training on static datasets may lead to overfitting and deteriorate results.
-
Training language models to follow instructions with human feedback: [cnt] [4 Mar 2022]
-
QLoRA: Efficient Finetuning of Quantized LLMs: [cnt]: 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA). git [23 May 2023]
-
Fine-tuning a GPT - LoRA: Comprehensive guide for LoRA doc [20 Jun 2023]
-
LIMA: Less Is More for Alignment: [cnt]: fine-tuned with the standard supervised loss on
only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
LIMA demonstrates remarkably strong performance, either equivalent or strictly preferred to GPT-4 in 43% of cases. [18 May 2023] -
Efficient Streaming Language Models with Attention Sinks: [cnt] 1. StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. 2. We neither expand the LLMs' context window nor enhance their long-term memory. git [29 Sep 2023]
- Key-Value (KV) cache is an important component in the StreamingLLM framework.
- Window Attention: Only the most recent Key and Value states (KVs) are cached. This approach fails when the text length surpasses the cache size.
- Sliding Attention /w Re-computation: Rebuilds the Key-Value (KV) states from the recent tokens for each new token. Evicts the oldest part of the cache.
- StreamingLLM: One of the techniques used is to add a placeholder token (yellow-colored) as a dedicated attention sink during pre-training. This attention sink attracts the model’s attention and helps it generalize to longer sequences. Outperforms the sliding window with re-computation baseline by up to a remarkable 22.2× speedup.
-
LongLoRA
- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models: [cnt]: A combination of sparse local attention and LoRA git [21 Sep 2023]
- Key Takeaways from LongLora
- The document states that LoRA alone is not sufficient for long context extension.
- Although dense global attention is needed during inference, fine-tuning the model can be done by sparse local attention, shift short attention (S2-Attn).
- S2-Attn can be implemented with only two lines of code in training.
- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models: [cnt]: A combination of sparse local attention and LoRA git [21 Sep 2023]
-
How to continue pretraining an LLM on new data:
Continued pretraining
can be as effective asretraining on combined datasets
. [13 Mar 2024]Three training methods were compared:
- Regular pretraining: A model is initialized with random weights and pretrained on dataset D1.
- Continued pretraining: The pretrained model from 1) is further pretrained on dataset D2.
- Retraining on combined dataset: A model is initialized with random weights and trained on the combined datasets D1 and D2.
Continued pretraining can be as effective as retraining on combined datasets. Key strategies for successful continued pretraining include:
- Re-warming: Increasing the learning rate at the start of continued pre-training.
- Re-decaying: Gradually reducing the learning rate afterwards.
- Data Mixing: Adding a small portion (e.g., 5%) of the original pretraining data (D1) to the new dataset (D2) to prevent catastrophic forgetting.
-
A key difference between Llama 1: [cnt] [27 Feb 2023] and Llama 2: [cnt] [18 Jul 2023] is the architectural change of attention layer, in which Llama 2 takes advantage of Grouped Query Attention (GQA) mechanism to improve efficiency. > OSS LLM x-ref / Llama3 > Build an llms from scratch x-ref
-
Multi-query attention (MQA): [cnt] [22 May 2023]
-
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm 📺 / git [03 Sep 2023]
- KV Cache, Grouped Query Attention, Rotary PE
- Rotary PE
def apply_rotary_embeddings(x: torch.Tensor, freqs_complex: torch.Tensor, device: str): # Separate the last dimension pairs of two values, representing the real and imaginary parts of the complex number # Two consecutive values will become a single complex number # (B, Seq_Len, H, Head_Dim) -> (B, Seq_Len, H, Head_Dim/2) x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2)) # Reshape the freqs_complex tensor to match the shape of the x_complex tensor. So we need to add the batch dimension and the head dimension # (Seq_Len, Head_Dim/2) --> (1, Seq_Len, 1, Head_Dim/2) freqs_complex = freqs_complex.unsqueeze(0).unsqueeze(2) # Multiply each complex number in the x_complex tensor by the corresponding complex number in the freqs_complex tensor # Which results in the rotation of the complex number as shown in the Figure 1 of the paper # (B, Seq_Len, H, Head_Dim/2) * (1, Seq_Len, 1, Head_Dim/2) = (B, Seq_Len, H, Head_Dim/2) x_rotated = x_complex * freqs_complex # Convert the complex number back to the real number # (B, Seq_Len, H, Head_Dim/2) -> (B, Seq_Len, H, Head_Dim/2, 2) x_out = torch.view_as_real(x_rotated) # (B, Seq_Len, H, Head_Dim/2, 2) -> (B, Seq_Len, H, Head_Dim) x_out = x_out.reshape(*x.shape) return x_out.type_as(x).to(device)
- KV Cache, Grouped Query Attention
# Replace the entry in the cache self.cache_k[:batch_size, start_pos : start_pos + seq_len] = xk self.cache_v[:batch_size, start_pos : start_pos + seq_len] = xv # (B, Seq_Len_KV, H_KV, Head_Dim) keys = self.cache_k[:batch_size, : start_pos + seq_len] # (B, Seq_Len_KV, H_KV, Head_Dim) values = self.cache_v[:batch_size, : start_pos + seq_len] # Since every group of Q shares the same K and V heads, just repeat the K and V heads for every Q in the same group. # (B, Seq_Len_KV, H_KV, Head_Dim) --> (B, Seq_Len_KV, H_Q, Head_Dim) keys = repeat_kv(keys, self.n_rep) # (B, Seq_Len_KV, H_KV, Head_Dim) --> (B, Seq_Len_KV, H_Q, Head_Dim) values = repeat_kv(values, self.n_rep)
-
Comprehensive Guide for LLaMA with RLHF: StackLLaMA: A hands-on guide to train LLaMA with RLHF [5 Apr 2023]
-
Official LLama Recipes incl. Finetuning: git
-
Llama 2 ONNX git [Jul 2023]: ONNX, or Open Neural Network Exchange, is an open standard for machine learning interoperability. It allows AI developers to use models across various frameworks, tools, runtimes, and compilers.
- Machine learning technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning.
- InstructGPT: Training language models to follow instructions with human feedback: [cnt] is a model trained by OpenAI to follow instructions using human feedback. [4 Mar 2022]
cite - Libraries: TRL, trlX, Argilla
TRL: from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step
The three steps in the process: 1. pre-training on large web-scale data, 2. supervised fine-tuning on instruction data (instruction tuning), and 3. RLHF. ref [ⓒ 2023] Supervised Fine-Tuning (SFT)
fine-tuning a pre-trained model on a specific task or domain using labeled data. This can cause more significant shifts in the model’s behavior compared to RLHF.
- Reinforcement Learning from Human Feedback (RLHF)) is a process of pretraining and retraining a language model using human feedback to develop a scoring algorithm that can be reapplied at scale for future training and refinement. As the algorithm is refined to match the human-provided grading, direct human feedback is no longer needed, and the language model continues learning and improving using algorithmic grading alone. [18 Sep 2019] ref [9 Dec 2022]
Proximal Policy Optimization (PPO)
is a reinforcement learning method using first-order optimization. It modifies the objective function to penalize large policy changes, specifically those that move the probability ratio away from 1. Aiming for TRPO (Trust Region Policy Optimization)-level performance without its complexity which requires second-order optimization.
- Direct Preference Optimization (DPO): [cnt]: 1. RLHF can be complex because it requires fitting a reward model and performing significant hyperparameter tuning. On the other hand, DPO directly solves a classification problem on human preference data in just one stage of policy training. DPO more stable, efficient, and computationally lighter than RLHF. 2.
Your Language Model Is Secretly a Reward Model
[29 May 2023]- Direct Preference Optimization (DPO) uses two models: a trained model (or policy model) and a reference model (copy of trained model). The goal is to have the trained model output higher probabilities for preferred answers and lower probabilities for rejected answers compared to the reference model. ref: RHLF vs DPO [Jan 2, 2024] / ref [1 Jul 2023]
- ORPO (odds ratio preference optimization): Monolithic Preference Optimization without Reference Model. New method that
combines supervised fine-tuning and preference alignment into one process
git [12 Mar 2024] Fine-tune Llama 3 with ORPO [Apr 2024]
- Reinforcement Learning from AI Feedback (RLAF): [cnt]: Uses AI feedback to generate instructions for the model. TLDR: CoT (Chain-of-Thought, Improved), Few-shot (Not improved). Only explores the task of summarization. After training on a few thousand examples, performance is close to training on the full dataset. RLAIF vs RLHF: In many cases, the two policies produced similar summaries. [1 Sep 2023]
- OpenAI Spinning Up in Deep RL!: An educational resource to help anyone learn deep reinforcement learning. git [Nov 2018]
- A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More [23 Jul 2024]
- Preference optimization techniques: ref [13 Aug 2024]
RLHF (Reinforcement Learning from Human Feedback)
: Optimizes reward policy via objective function.DPO (Direct preference optimization)
: removes the need for a reward model. > Minimizes loss; no reward policy.IPO (Identity Preference Optimization)
: A change in the objective, which is simpler and less prone to overfitting.KTO (Kahneman-Tversky Optimization)
: Scales more data by replacing the pairs of accepted and rejected generations with a binary label.ORPO (Odds Ratio Preference Optimization)
: Combines instruction tuning and preference optimization into one training process, which is cheaper and faster.TPO (Thought Preference Optimization)
: This method generates thoughts before the final response, which are then evaluated by a Judge model for preference using Direct Preference Optimization (DPO). [14 Oct 2024]
- SFT vs RL: SFT Memorizes, RL Generalizes. RL enhances generalization across text and vision, while SFT tends to memorize and overfit. git [28 Jan 2025]
- A Survey on Model Compression for Large Language Models ref [15 Aug 2023]
-
Quantization-aware training (QAT): The model is further trained with quantization in mind after being initially trained in floating-point precision.
-
Post-training quantization (PTQ): The model is quantized after it has been trained without further optimization during the quantization process.
Method Pros Cons Post-training quantization Easy to use, no need to retrain the model May result in accuracy loss Quantization-aware training Can achieve higher accuracy than post-training quantization Requires retraining the model, can be more complex to implement -
bitsandbytes: 8-bit optimizers git [Oct 2021]
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. [27 Feb 2024]
-
Pruning: The process of removing some of the neurons or layers from a neural network. This can be done by identifying and eliminating neurons or layers that have little or no impact on the network's output.
-
Sparsification: A technique used to reduce the size of large language models by removing redundant parameters.
-
Wanda Pruning: [cnt]: A Simple and Effective Pruning Approach for Large Language Models [20 Jun 2023] ref
- phi-series: x-ref: Textbooks Are All You Need.
- Orca 2: [cnt]: Orca learns from rich signals from GPT 4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. ref [18 Nov 2023]
- Distilled Supervised Fine-Tuning (dSFT)
- Zephyr 7B: [cnt] Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). ref [25 Oct 2023]
- Mistral 7B: [cnt]: Outperforms Llama 2 13B on all benchmarks. Uses Grouped-query attention (GQA) for faster inference. Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost. ref [10 Oct 2023]
-
Transformer cache key-value tensors of context tokens into GPU memory to facilitate fast generation of the next token. However, these caches occupy significant GPU memory. The unpredictable nature of cache size, due to the variability in the length of each request, exacerbates the issue, resulting in significant memory fragmentation in the absence of a suitable memory management mechanism.
-
To alleviate this issue, PagedAttention was proposed to store the KV cache in non-contiguous memory spaces. It partitions the KV cache of each sequence into multiple blocks, with each block containing the keys and values for a fixed number of tokens.
-
PagedAttention : vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, 24x Faster LLM Inference doc. ref: vllm [12 Sep 2023]
- PagedAttention for a prompt “the cat is sleeping in the kitchen and the dog is”. Key-Value pairs of tensors for attention computation are stored in virtual contiguous blocks mapped to non-contiguous blocks in the GPU memory.
-
TokenAttention an attention mechanism that manages key and value caching at the token level. git [Jul 2023]
-
Flash Attention: [cnt] [27 May 2022]
- In a GPU, A thread is the smallest execution unit, and a group of threads forms a block.
- A block executes the same kernel (function, to simplify), with threads sharing fast SRAM memory.
- All blocks can access the shared global HBM memory.
- First, the query (Q) and key (K) product is computed in threads and returned to HBM. Then, it's redistributed for softmax and returned to HBM.
- Flash attention reduces these movements by caching results in SRAM.
Tiling
splits attention computation into memory-efficient blocks, whilerecomputation
saves memory by recalculating intermediates during backprop. 📺- FlashAttention-2: [cnt] [17 Jul 2023]: An method that reorders the attention computation and leverages classical techniques (tiling, recomputation). Instead of storing each intermediate result, use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead. git -> Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster
- FlashAttention-3 [11 Jul 2024]
-
CPU vs GPU vs TPU: The threads are grouped into thread blocks. Each of the thread blocks has access to a fast shared memory (SRAM). All the thread blocks can also share a large global memory. High-bandwidth memories (HBM).
HBM Bandwidth: 1.5-2.0TB/s vs SRAM Bandwidth: 19TB/s ~ 10x HBM
[27 May 2024]
- LLM patterns: 🏆From data to user, from defensive to offensive doc
- What We’ve Learned From A Year of Building with LLMs:💡A practical guide to building successful LLM products, covering the tactical, operational, and strategic. [8 June 2024]
- Large Transformer Model Inference Optimization: Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge ... [10 Jan 2023]
- Mixture of experts models: Mixtral 8x7B: Sparse mixture of experts models (SMoE) magnet [Dec 2023]
- Huggingface Mixture of Experts Explained: Mixture of Experts, or MoEs for short [Dec 2023]
- Simplifying Transformer Blocks: Simplifie Transformer. Removed several block components, including skip connections, projection/value matrices, sequential sub-blocks and normalisation layers without loss of training speed. [3 Nov 2023]
- Model merging: : A technique that combines two or more large language models (LLMs) into a single model, using methods such as SLERP, TIES, DARE, and passthrough. [Jan 2024] git: mergekit
Method Pros Cons SLERP Preserves geometric properties, popular method Can only merge two models, may decrease magnitude TIES Can merge multiple models, eliminates redundant parameters Requires a base model, may discard useful parameters DARE Reduces overfitting, keeps expectations unchanged May introduce noise, may not work well with large differences - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [1 Dec 2023] git: 1. Structured State Space (S4) - Class of sequence models, encompassing traits from RNNs, CNNs, and classical state space models. 2. Hardware-aware (Optimized for GPU) 3. Integrating selective SSMs and eliminating attention and MLP blocks ref / A Visual Guide to Mamba and State Space Models ref [19 FEB 2024]
- Mamba-2: 2-8X faster [31 May 2024]
- Sakana.ai: Evolutionary Optimization of Model Merging Recipes.: A Method to Combine 500,000 OSS Models. git [19 Mar 2024]
- Mixture-of-Depths: All tokens should not require the same effort to compute. The idea is to make token passage through a block optional. Each block selects the top-k tokens for processing, and the rest skip it. ref [2 Apr 2024]
- Kolmogorov-Arnold Networks (KANs): KANs use activation functions on connections instead of nodes like Multi-Layer Perceptrons (MLPs) do. Each weight in KANs is replaced by a learnable 1D spline function. KANs’ nodes simply sum incoming signals without applying any non-linearities. git [30 Apr 2024] / ref: A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KAN) [19 May 2024]
- Better & Faster Large Language Models via Multi-token Prediction: Suggest that training language models to predict multiple future tokens at once [30 Apr 2024]
- Lamini Memory Tuning: Mixture of Millions of Memory Experts (MoME). 95% LLM Accuracy, 10x Fewer Hallucinations. ref [Jun 2024]
- Scaling Synthetic Data Creation with 1,000,000,000 Personas A persona-driven data synthesis methodology using Text-to-Persona and Persona-to-Persona. [28 Jun 2024]
- RouteLLM: a framework for serving and evaluating LLM routers. [Jun 2024]
- KAN or MLP: A Fairer Comparison: In machine learning, computer vision, audio processing, natural language processing, and symbolic formula representation (except for symbolic formula representation tasks), MLP generally outperforms KAN. [23 Jul 2024]
- Differential Transformer: Amplifies attention to the relevant context while minimizing noise using two separate softmax attention mechanisms. [7 Oct 2024]
- Large Concept Models: Focusing on high-level sentence (concept) level rather than tokens. using SONAR for sentence embedding space. [11 Dec 2024]