LLM Optimization for Edge Devices

This repository demonstrates how to:

Measure baseline performance of a pre-trained language model.
Implement knowledge distillation to train a smaller student model.
Apply magnitude-based pruning and post-training quantization.
Compare each optimized version to the original model in terms of size, latency, and accuracy.

Setup

pip install torch transformers datasets scikit-learn sentencepiece

Quick Start Gather or download dataset: (e.g. SST2 from Hugging Face)

Quick Start

Run baseline:

python baseline.py

This will train (or load) the base model, measure size, latency, and accuracy, and save metrics to baseline_metrics.json.

Note

If you encounter such error: 2025-03-08 13:19:01,673 [ERROR] Error in baseline script: Invalid pattern: '**' can only be an entire path component Please try to update the dataset by:

pip install --upgrade datasets huggingface_hub

Distillation, Pruning, Quantization

Knowledge Distillation

python knowledge_distillation.py

Loads the teacher (baseline) model.
Trains a smaller student with chosen temperature & alpha.
Saves the student metrics in data/distilled_model/distilled_metrics.json.

Pruning & Quantization

python pruning_quantization.py

Loads the original model (baseline).
Optionally applies magnitude-based pruning (gradual or one-shot).
Applies dynamic int8 quantization to reduce model size on disk and memory usage.
Saves final metrics to data/pruned_quantized_model/pruning_quant_metrics.json.

Comparing Models

python evaluate_compare.py

Outputs and compares:

Baseline (accuracy, size)
Distilled (accuracy, size)
Pruned / Pruned+Quant (accuracy, size) …along with any relative percentage improvements (size reduction, accuracy retention).

Potential Improvements

Train Longer & Use More Data

Remove or reduce any .select(...) slicing in the dataset scripts for higher accuracy.
Increase epochs in both baseline (baseline.py) and distillation (knowledge_distillation.py).

Hyperparameter Tuning

Adjust distillation temperature, alpha, learning rate, batch size.
Explore different pruning ratios or structured pruning techniques.

Quantization-Aware Training (QAT)

Instead of post-training dynamic quant, do QAT to maintain better accuracy.
Tools like bitsandbytes, Intel’s Neural Compressor, or OpenVINO can provide additional gains.

Structured / Gradual Pruning

Removing entire attention heads or channels can yield better speedups.
Gradual pruning with short re-training steps often retains more accuracy than one-shot pruning.

Real Edge Deployment

Export final pruned/quantized model to ONNX or TensorRT.
Test on actual resource-constrained hardware for accurate latency measurements.

Alternative Student Model

A 2-layer mini-model might be too small for some tasks.
Try DistilBERT or a 4-layer mini-BERT for a better balance of size & accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
data_utils.py		data_utils.py
evaluate_compare.py		evaluate_compare.py
evaluation.md		evaluation.md
knowledge_distillation.py		knowledge_distillation.py
logs.txt		logs.txt
pruning_quantization.py		pruning_quantization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Optimization for Edge Devices

Setup

Quick Start

Run baseline:

Note

Distillation, Pruning, Quantization

Comparing Models

Potential Improvements

About

Uh oh!

Releases

Packages

Languages

PPatricc/LLM-Optimization-for-Edge-Devices

Folders and files

Latest commit

History

Repository files navigation

LLM Optimization for Edge Devices

Setup

Quick Start

Run baseline:

Note

Distillation, Pruning, Quantization

Comparing Models

Potential Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages