Author: Pritam Trilochan Gouda
Affiliation: CSA, IISc
Date: April 27, 2024
This repository contains the solutions and discussions for Assignment #2 of the course E0270: Machine Learning. The assignment covers three main topics: Text Generation with GPT-2, Low Rank Adaptation (LoRA), and Knowledge Distillation.
- Introduction
- Problem 0: Text Generation with GPT-2
- Problem 1: Low Rank Adaptation (LoRA)
- Problem 2: Knowledge Distillation
- Files Included
- Plots
- Usage
- Conclusion
- References
This assignment explores the application and analysis of advanced machine learning techniques, focusing on text generation, efficient model adaptation, and knowledge transfer between models.
An exploration of GPT-2's text generation capabilities was conducted by providing a prompt and analyzing the model's ability to generate a coherent and creative continuation.
The model generated a narrative based on the prompt, demonstrating its grasp of context and creative storytelling.

LoRA is a Parameter-Efficient Fine-tuning (PEFT) technique that allows for selective updating of model parameters, reducing computational overhead while maintaining performance.
LoRA was integrated into the GPT-2 model, and the adaptation was tested on the CoLA dataset. The fine-tuned model achieved a balance between computational efficiency and accuracy.
-
GPT2 Variant Used: Medium
-
Total Number of Parameters: 356.40M
-
Number of Trainable Parameters: 1.68M
-
Reduction in Parameters: 99.53%
-
Maximum Accuracy on CoLA Validation Dataset: 82.73%
-
GPT2 Variant Used: Base
-
Total Number of Parameters: 125.03M
-
Number of Trainable Parameters: 0.63M
-
Reduction in Parameters: 99.50%
The GPT-2 model was fine-tuned using the following hyperparameters:
- Learning Rate: 1e-3
- Number of Epochs: 10
- Batch Size: 128
- Optimizer: Adam
- Loss Function: Cross-Entropy Loss
- LoRA Rank: 4
Knowledge Distillation aims to transfer knowledge from a larger teacher model to a smaller student model, enabling efficient deployment in resource-constrained environments.
An RNN was trained via knowledge distillation from the fine-tuned GPT-2 model. The student model achieved similar validation performance compared to the teacher model, confirming the effectiveness of the distillation process.
To distill knowledge from the fine-tuned GPT model (teacher model) to the DistilRNN model (student model) for the CoLA classification dataset, the distillation loss function used is a combination of soft target loss and true label loss.
- Embedding layer mapping input tokens to dense vectors of size 768.
- Two-layer RNN with hidden size 768.
- ReLU activation function.
- Linear layer projecting the output to a 2-dimensional space for binary classification.
- Batch size: 128
- Learning rate: 1e-3
- Number of epochs: 5
- Maximum Accuracy on CoLA Validation Dataset: 71%
- Accuracy without KD: 68%
- Accuracy Improvement with KD: 3%
Text Generation, LoRA, and Knowledge Distillation
├── plots
│ ├── Distillation_accuracy.png
│ ├── Distillation_loss.png
│ ├── LoRA_accuracy.png
│ ├── LoRA_loss.png
│ ├── rnn_accuracy.png
│ └── rnn_loss.png
├── tuning
│ ├── tuning.txt
│ ├── tuning2.txt
│ ├── tuning3.txt
│ └── tuning4.txt
├── Report_23754.pdf
├── model.py
├── run.py
├── train_utils.py
└── utils.py
model.py
: Full definition of a GPT Language Model, all of it in this single file.Report_23754.pdf
: Detailed project report with explanation.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Clone the repository:
git clone https://github.com/yourusername/ml-techniques.git cd ml-techniques
This assignment provides a comprehensive study of advanced ML techniques, demonstrating their practical applications and effectiveness in different scenarios.
Practical Tips for Finetuning LLMs Using LoRA
Knowledge Distillation: Principles, Algorithms, Applications