Inference Speed Tests on Local LLMs
Inference speed tests on Local Large Language Models on various devices. Feel free to contribute your results.
Note : None of the following results are verified
All models have been tested with the following Prompt: Write a 500 word story
GGUF models
M4 Max (128 GB RAM, 40-core GPU)
M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5:7B (4bit)
72.50 tokens/s
26.85 tokens/s
Qwen2.5:14B (4bit)
38.23 tokens/s
14.66 tokens/s
Qwen2.5:32B (4bit)
19.35 tokens/s
6.95 tokens/s
Qwen2.5:72B (4bit)
8.76 tokens/s
Didn't Test
MLX models
M4 Max (128 GB RAM, 40-core GPU)
M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)
101.87 tokens/s
38.99 tokens/s
Qwen2.5-14B-Instruct (4bit)
52.22 tokens/s
18.88 tokens/s
Qwen2.5-32B-Instruct (4bit)
24.46 tokens/s
9.10 tokens/s
Qwen2.5-32B-Instruct (8bit)
13.75 tokens/s
Won’t Complete (Crashed)
Qwen2.5-72B-Instruct (4bit)
10.86 tokens/s
Didn't Test
GGUF models
M4 Max (128 GB RAM, 40-core GPU)
M1 Pro (32GB RAM, 16-core GPU)
Qwen2.5-7B-Instruct (4bit)
71.73 tokens/s
26.12 tokens/s
Qwen2.5-14B-Instruct (4bit)
39.04 tokens/s
14.67 tokens/s
Qwen2.5-32B-Instruct (4bit)
19.56 tokens/s
4.53 tokens/s
Qwen2.5-72B-Instruct (4bit)
8.31 tokens/s
Didn't Test
GGUF models
M1 Max (32GB RAM, 23-core GPU)
M3 Ultra (256GB, 80-core GPU)
mistral-small:23b (4bit)
15.11 tokens/s
Didn't Test
mistral-large:123b (4bit)
Didn't Test
8.42 tokens/s
llama3.1:8b (4bit)
38.73 tokens/s
85.02 tokens/s
llama3.2-vision:9b (4bit)
39.05 tokens/s
Didn't Test
deepseek-r1:14b (4bit)
21.16 tokens/s
46.50 tokens/s
deepseek-r1:32b (4bit)
Didn't Test
25.58 tokens/s
deepseek-r1:70b (4bit)
Didn't Test
13.16 tokens/s
hermes3:405b (4bit)
Didn't Test
2.47 tokens/s
Qwen2.5:7B (4bit)
Didn't Test
88.87 tokens/s
Qwen2.5:14B (4bit)
Didn't Test
47.25 tokens/s
Qwen2.5:32B (4bit)
Didn't Test
26.02 tokens/s
Qwen2.5:70B (4bit)
Didn't Test
12.21 tokens/s
Run your model with the verbose flag (e.g ollama run mistral-small --verbose
)
Enter the prompt Write a 500 word story
In the column of your device add the TPS (tokens-per-second) output of eval rate
in Ollama
If your device is not in the list add it