Inference Speed Tests on Local LLMs

Inference speed tests on Local Large Language Models on various devices. Feel free to contribute your results.

Note: None of the following results are verified

All models have been tested with the following Prompt: Write a 500 word story

Macbook Pro

Ollama

GGUF models	M4 Max (128 GB RAM, 40-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5:7B (4bit)	72.50 tokens/s	26.85 tokens/s
Qwen2.5:14B (4bit)	38.23 tokens/s	14.66 tokens/s
Qwen2.5:32B (4bit)	19.35 tokens/s	6.95 tokens/s
Qwen2.5:72B (4bit)	8.76 tokens/s	Didn't Test

LM Studio

MLX models	M4 Max (128 GB RAM, 40-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)	101.87 tokens/s	38.99 tokens/s
Qwen2.5-14B-Instruct (4bit)	52.22 tokens/s	18.88 tokens/s
Qwen2.5-32B-Instruct (4bit)	24.46 tokens/s	9.10 tokens/s
Qwen2.5-32B-Instruct (8bit)	13.75 tokens/s	Won’t Complete (Crashed)
Qwen2.5-72B-Instruct (4bit)	10.86 tokens/s	Didn't Test

GGUF models	M4 Max (128 GB RAM, 40-core GPU)	M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)	71.73 tokens/s	26.12 tokens/s
Qwen2.5-14B-Instruct (4bit)	39.04 tokens/s	14.67 tokens/s
Qwen2.5-32B-Instruct (4bit)	19.56 tokens/s	4.53 tokens/s
Qwen2.5-72B-Instruct (4bit)	8.31 tokens/s	Didn't Test

Mac Studio

Ollama

GGUF models	M1 Max (32GB RAM, 23-core GPU)	M3 Ultra (256GB, 80-core GPU)
mistral-small:23b (4bit)	15.11 tokens/s	Didn't Test
mistral-large:123b (4bit)	Didn't Test	8.42 tokens/s
llama3.1:8b (4bit)	38.73 tokens/s	85.02 tokens/s
llama3.2-vision:9b (4bit)	39.05 tokens/s	Didn't Test
deepseek-r1:14b (4bit)	21.16 tokens/s	46.50 tokens/s
deepseek-r1:32b (4bit)	Didn't Test	25.58 tokens/s
deepseek-r1:70b (4bit)	Didn't Test	13.16 tokens/s
hermes3:405b (4bit)	Didn't Test	2.47 tokens/s
Qwen2.5:7B (4bit)	Didn't Test	88.87 tokens/s
Qwen2.5:14B (4bit)	Didn't Test	47.25 tokens/s
Qwen2.5:32B (4bit)	Didn't Test	26.02 tokens/s
Qwen2.5:70B (4bit)	Didn't Test	12.21 tokens/s

Contributing

Using Ollama

Run your model with the verbose flag (e.g ollama run mistral-small --verbose)
Enter the prompt Write a 500 word story
In the column of your device add the TPS (tokens-per-second) output of eval rate in Ollama
If your device is not in the list add it