Performance of llama.cpp on Nvidia CUDA #15013

olegshulyakov · 2025-08-01T15:20:29Z

olegshulyakov
Aug 1, 2025

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	79c1160	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	`5143fa8`	@Hedede
RTX 5090	32 GB / GDDR7 / 512 bit	14751.98 ± 136.24	239.62 ± 0.37	`9c35706`	@RodriMora
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	8a4280c	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	`9c35706`	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`	@m18coppola
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	`20638e4`	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	`9c35706`	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	`4795c91`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	`9c35706`	@Ristovski
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	`e5155e6`	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46		@Spyro000
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	`9c35706`	@ariya
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	`89d1029`	@mike-llamacpp
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	`65349f2`	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	baa9255	@QuantiusBenignus
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	`89d1029`	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	`9c35706`	@ariya
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	`5c0eb5e`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	`756cfea`	@DigitalRudeness
Tesla P100	16 GB / HBM2 / 4096 bit	703.27 ± 3.21	50.20 ± 0.01	9ef5369	@VinnyG9
GTX 1660 Ti Mobile	6 GB / GDDR5 / 192 bit	520.25 ± 2.00	46.46 ± 0.21	`912ff8c`	@pt13762104
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	`d32e03f`	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	`d79d8f3`	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	`9515c61`	@ariya
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	79c1160	@pebaryan
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	baa9255	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	1e74897	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	`32732f2`	@pebaryan

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	`5143fa8`	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	`5143fa8`	@Hedede
RTX 5090	32 GB / GDDR7 / 512 bit	16041.54 ± 85.27	248.57 ± 0.05	`9c35706`	@RodriMora
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	8a4280c	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	`9c35706`	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`	@m18coppola
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	`20638e4`	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	`4795c91`	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	`9c35706`	@slaren
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	`e5155e6`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	`9c35706`	@Ristovski
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52		@Spyro000
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	`9c35706`	@ariya
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	`89d1029`	@mike-llamacpp
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	`65349f2`	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	baa9255	@QuantiusBenignus
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	`89d1029`	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	`9c35706`	@ariya
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	`5c0eb5e`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	`756cfea`	@DigitalRudeness
Tesla P100	16 GB / HBM2 / 4096 bit	735.19 ± 3.72	51.08 ± 0.00	9ef5369	@VinnyG9
GTX 1660 Ti Mobile	6 GB / GDDR5 / 192 bit	635.21 ± 0.27	46.37 ± 0.07	`912ff8c`	@pt13762104
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	`d32e03f`	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	`9515c61`	@ariya
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	79c1160	@pebaryan
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	baa9255	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	`32732f2`	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	1e74897	@aleksyx

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

m18coppola · 2025-08-01T16:11:12Z

m18coppola
Aug 1, 2025

Here's the results for my devices. Not sure how to get a "cuda info string" though.

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	514.53 ± 3.06	33.29 ± 0.00	`c76b420`
Tesla P40	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`
RTX 3090	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	529.53 ± 2.12	33.12 ± 0.03	`c76b420`
Tesla P40	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`
RTX 3090	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`

0 replies

bennmann · 2025-08-01T19:48:15Z

bennmann
Aug 1, 2025

While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls.

I admit risk of confusion for Nvidia users in the thread if this path is taken.

1 reply

olegshulyakov Aug 1, 2025
Author

As I know you cannot run ROCm on Nvidia GPU. If you would like to see compared results check Vulkan thread. You can find there results for Vulkan/CUDA and Vulkan/ROCm.

UPD: Created ROCm discussion.

slaren · 2025-08-01T20:21:40Z

slaren
Aug 1, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6567.49 ± 20.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	171.19 ± 3.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	6924.01 ± 10.76
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	172.26 ± 1.31

build: 9c35706 (6060)

Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5013.86 ± 24.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.65 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5569.56 ± 14.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	139.95 ± 0.95

build: 9c35706 (6060)

0 replies

Ristovski · 2025-08-01T21:10:34Z

Ristovski
Aug 1, 2025

Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6924.53 ± 13.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	132.26 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	7612.32 ± 37.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	132.85 ± 0.31

build: 9c35706 (647)

3 replies

Ristovski Aug 7, 2025

@olegshulyakov One more benchmark for RTX 4080:

Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	8031.64 ± 26.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	142.49 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	9205.93 ± 22.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	143.47 ± 0.02

build: 20638e4 (2)

olegshulyakov Aug 7, 2025
Author

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Ristovski Aug 7, 2025

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Hmm indeed, I didn't give much thought to the score at first. It should be stock but not completely sure as that is one of our work machines. I didn't have much time to investigate today, will check again tomorrow!

RodriMora · 2025-08-01T22:51:37Z

RodriMora
Aug 1, 2025

Device 0: 3090. Power limit to 250w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4175.47 ± 27.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	137.72 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4377.03 ± 89.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	138.34 ± 0.96

build: 9c35706 (6060)

Device 2: 5090. Power limit to 400w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	12706.26 ± 13.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	236.73 ± 1.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	13823.36 ± 20.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	245.02 ± 1.08

build: 9c35706 (6060)

2 replies

olegshulyakov Aug 2, 2025
Author

Can you please launch them without a limit on full power?

RodriMora Aug 2, 2025

Sure, results with defaults power limits:

3090 at 390W
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5405.83 ± 5.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	151.04 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5932.44 ± 10.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	155.36 ± 0.09

build: 9c35706 (6060)

5090 at 600W
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	14751.98 ± 136.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	239.62 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	16041.54 ± 85.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	248.57 ± 0.05

build: 9c35706 (6060)

ariya · 2025-08-02T04:34:57Z

ariya
Aug 2, 2025

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	pp512	1084.41 ± 3.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	tg128	62.49 ± 0.06

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	1138.14 ± 2.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	61.38 ± 0.03

build: 9c35706 (6060)

0 replies

ariya · 2025-08-02T17:01:55Z

ariya
Aug 2, 2025

@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard.

Example:

Chip	VRAM	pp512 t/s	tg128 t/s	Commit
RTX 3090 Ti	24 GB	6567.49 $\pm$ 20.30	171.19 $\pm$ 3.98	`9c35706`
RTX 3090	24 GB	5174.69 $\pm$ 21.83	158.16 $\pm$ 0.21	`c76b420`
RTX 3080	10 GB	5013.86 $\pm$ 24.80	139.65 $\pm$ 0.99	`9c35706`

1 reply

olegshulyakov Aug 2, 2025
Author

Made it a little bit better 🙂

ggerganov · 2025-08-02T19:17:24Z

ggerganov
Aug 2, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1420.24 ± 1.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	60.04 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1563.77 ± 0.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	61.13 ± 0.05

build: 5c0eb5e (6075)

1 reply

olegshulyakov Aug 2, 2025
Author

@ggerganov Can you please add "performance" label?

mike-llamacpp · 2025-08-02T20:45:23Z

mike-llamacpp
Aug 2, 2025

@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card:

RTX 5060 Ti 16 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	pp512	3737.25 ± 6.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	tg128	90.94 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	pp512	4195.53 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	tg128	93.46 ± 0.01

build: 89d10295 (6002)

And here's another GPU for the collection:

RTX 4060 Ti 8 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3394.63 ± 7.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	63.86 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3803.45 ± 70.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	64.03 ± 0.53

build: 89d10295 (6002)

2 replies

rohan-sircar Aug 5, 2025

Nice 64GB VRAM setup you got there!

And here's another GPU for the collection:

We all be here showing off our GPU collections 😅

mike-llamacpp Aug 5, 2025

Thanks. It isn't the fastest setup around, especially when working with 70B+ models, but it is completely usable for inference. There are also some benefits I like about these particular cards (Gigabyte Windforce):

Two slots thick and only ~200 mm in length makes them easy to fit in a wide variety of cases
Physical x8 PCI-e connector lets them fit in either x8 or x16 slots without modification (5060 TIs only use 8 lanes anyhow)
Quiet (Silent when idle)
Low idle power consumption (~5 watts per card)
Relatively low power draw under full load (<180W each), so easy to power all four with an inexpensive PSU

ariya · 2025-08-04T06:20:35Z

ariya
Aug 4, 2025

Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	pp512	2890.66 ± 2.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	tg128	107.51 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	3107.61 ± 4.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	109.17 ± 0.07

build: 9c35706 (6060)

0 replies

lhl · 2025-08-06T08:26:05Z

lhl
Aug 6, 2025

Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build:

FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	14770.63 ± 102.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	188.96 ± 0.05

Non-FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	11992.70 ± 107.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	186.21 ± 0.13


build: 224145325 (6098)

nvidia-dkms 575.64.03-1

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

0 replies

pebaryan · 2025-08-07T10:10:35Z

pebaryan
Aug 7, 2025

NVIDIA P106-100
6GB VRAM
Win 11
Driver Version: 566.36 CUDA Version: 12.7

I ran two times, took the best on 2 different build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	406.94 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	30.40 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	438.49 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	30.64 ± 0.06

build: 5fd160b (6106)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	425.73 ± 0.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	29.42 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	436.90 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	29.94 ± 0.03

build: 860a9e4 (5688)

Sadly, nvidia was not supporting this device for the vulkan driver

2 replies

pebaryan Aug 7, 2025

I just bricked my gtx 1070 Ti :( so i would not be able to reproduce the result with newer build

olegshulyakov Aug 7, 2025
Author

@pebaryan I've taken the last build one.

DigitalRudeness · 2025-08-07T10:38:52Z

DigitalRudeness
Aug 7, 2025

Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W)

I did two runs:

pull/compilation of llama.cpp from yesterday:

gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1956.22 ± 7.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.62 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2250.14 ± 5.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.71 ± 0.01

build: 756cfea (6105)

fresh pull/compilation of llama.cpp ~5min ago:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1952.82 ± 7.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.59 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2237.16 ± 6.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.67 ± 0.01

build: 1d72c84 (6109)

Seems to make no big difference... ^^

0 replies

pebaryan · 2025-08-11T09:42:50Z

pebaryan
Aug 11, 2025

I finally got my hands on similar card as before (NP106) but with display output

NVIDIA GTX 1060
6GB GDDR5 192-bit
Driver 566.36

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	416.85 ± 1.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.79 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	446.19 ± 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.18 ± 0.01

build: 5fd160b (6106)

1 reply

pebaryan Aug 11, 2025

just realized i didn't use the latest build, not that difference though

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	413.59 ± 2.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.74 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	443.66 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.08 ± 0.04

build: 79c1160 (6123)

pebaryan · 2025-08-22T19:15:33Z

pebaryan
Aug 22, 2025

NVIDIA K80

recompile using visual studio 2022 and cuda toolkit 11.4

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla K80, compute capability 3.7, VMM: no
  Device 1: Tesla K80, compute capability 3.7, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	133.14 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	13.80 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	133.36 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	14.27 ± 0.32

build: 32732f2 (6248)

it seems in the background, the layer is split into two devices, first 2G is on Device 0 and the rest in the second Device. The utilization at some point is 75% and 25%

1 reply

olegshulyakov Aug 23, 2025
Author

Re-run with --main-gpu 0 --split-mode none

etasnadi · 2025-08-22T21:20:59Z

etasnadi
Aug 22, 2025

I have found possible inefficiencies in CUDA code: many mul_mat, two flash_attn and one rwkv_wkv kernel does spilling on sm_75 (RTX 20x devices).

Attached file shows the output of llama.cpp compiled with --ptxas-options=-v stripped to lines around spill read/writes or nonzero stack frames.

@JohannesGaessler

llama-cuda-spills.txt

0 replies

pwilkin · 2025-08-23T11:31:11Z

pwilkin
Aug 23, 2025
Collaborator

ilintar@LinuksowaJaskinia:/devel/models$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	threads	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,BLAS	8	0	pp512	4293.66 ± 24.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,BLAS	8	0	tg128	132.18 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,BLAS	8	1	pp512	4901.77 ± 11.63
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,BLAS	8	1	tg128	135.07 ± 0.20

build: 9ef5369 (6256)

Pretty much in line with @slaren results.

0 replies

gilbrotheraway · 2025-08-23T13:15:50Z

gilbrotheraway
Aug 23, 2025

Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	pp512	703.27 ± 3.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	tg128	50.20 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	pp512	735.19 ± 3.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	tg128	51.08 ± 0.00

build: 9ef53690 (6256)

0 replies

autonomous-AI-lab · 2025-08-24T03:55:54Z

autonomous-AI-lab
Aug 24, 2025

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	10293.86 ± 134.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	189.33 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	12506.97 ± 11.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	191.57 ± 0.03

build: 79c1160 (6123)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	10259.68 ± 153.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	12555.82 ± 28.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	13183.82 ± 20.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	11987.23 ± 6.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	9442.55 ± 4.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	6051.97 ± 1.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	3492.30 ± 0.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	187.83 ± 2.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	186.96 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	183.08 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	179.21 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	170.17 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	12415.31 ± 20.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	16077.52 ± 34.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	18410.38 ± 45.13
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	19138.08 ± 19.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	16430.60 ± 2.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	12031.88 ± 9.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	8000.82 ± 2.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	189.34 ± 4.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	191.15 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	188.01 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	182.59 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	173.16 ± 0.05

build: 79c1160 (6123)

1 reply

ehartford Aug 25, 2025

I tried but I have 4 3090s and it only used 1 of them the other 3 were at 0%

aleksyx · 2025-08-27T11:36:03Z

aleksyx
Aug 27, 2025

NVIDIA Quadro P1000 - 4 GB GDDR5 - 128 bit
Fedora Linux 42 (ThinkStation P340)
Driver Version: 575.64.05
CUDA Version: 12.9

./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro P1000, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	183.40 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	13.99 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	173.82 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	13.65 ± 0.14

build: 1e74897 (1)

You will not be able to run this benchmark if you are running GNOME or anything else that uses GPU memory, boot into text mode to run it. With that said the GPU will run smaller models (< 2.5 GB) while using the graphical environment fine, for example IBM Granite 3.3 2B, Qwen3 1.7B, etc.

0 replies

swein · 2025-08-27T14:31:21Z

swein
Aug 27, 2025

4090 -pl 300 (300 watts, stock clocks) on official container:latest

docker run -it --rm --gpus all -v /mnt/user/AI/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda --bench -m /models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	10712.64 ± 135.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	9831.51 ± 120.33
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	8534.83 ± 94.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	6939.80 ± 24.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	5036.04 ± 6.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	3012.11 ± 1.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	167.53 ± 1.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	166.85 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	163.54 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	159.41 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	150.76 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	12416.64 ± 153.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	12362.39 ± 96.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	11764.82 ± 198.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	11016.56 ± 151.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	9918.19 ± 27.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	8449.90 ± 5.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	169.74 ± 2.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	170.65 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	167.57 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	162.95 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	154.41 ± 0.06

4090 stock 450watts

docker run -it --rm --gpus all -v /mnt/user/AI/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda --bench -m /models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	11773.24 ± 91.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	10885.59 ± 89.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	9388.77 ± 8.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	7400.17 ± 3.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	5216.83 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	3050.56 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	167.52 ± 1.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	166.90 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	163.71 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	159.68 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	151.05 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	14017.58 ± 118.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	14209.33 ± 21.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	13843.49 ± 134.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	13251.65 ± 133.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	12011.00 ± 38.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	10170.22 ± 6.53
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	170.12 ± 2.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	171.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	167.87 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	163.23 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	154.68 ± 0.04

% difference:

model	size	params	backend	ngl	fa	test	t/s	% diff
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	10712.64 ± 135.22	+9.90%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	9831.51 ± 120.33	+10.72%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	8534.83 ± 94.59	+10.01%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	6939.80 ± 24.46	+6.63%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	5036.04 ± 6.41	+3.59%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	3012.11 ± 1.49	+1.28%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	167.53 ± 1.89	-0.01%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	166.85 ± 0.19	+0.03%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	163.54 ± 0.09	+0.10%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	159.41 ± 0.17	+0.17%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	150.76 ± 0.03	+0.19%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	12416.64 ± 153.67	+12.89%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	12362.39 ± 96.52	+14.95%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	11764.82 ± 198.87	+17.67%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	11016.56 ± 151.08	+20.29%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	9918.19 ± 27.37	+21.09%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	8449.90 ± 5.87	+20.37%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	169.74 ± 2.16	+0.22%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	170.65 ± 0.06	+0.23%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	167.57 ± 0.18	+0.18%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	162.95 ± 0.10	+0.17%
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	154.41 ± 0.06	+0.17%

0 replies

Hedede · 2025-08-28T11:31:27Z

Hedede
Aug 28, 2025

RTX 5080 (16GB)

Ubuntu 22.04.1 (Linux 6.8.0) Driver Version: 575.64.03 CUDA Version: 12.9

$ ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99   -p 512,1024,2048,4096,8192,16384   -n 128,256,512,1024,2048   -fa 0,1                       
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                                                                                         
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                                                                                         
ggml_cuda_init: found 1 CUDA devices:                                                                                                                                                              
  Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	8297.36 ± 9.50
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	7526.74 ± 5.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	6764.02 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	5672.04 ± 2.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	4302.98 ± 1.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	2764.80 ± 1.63
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	181.99 ± 0.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	179.77 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	174.80 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	171.27 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	163.08 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	9487.70 ± 21.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	9359.28 ± 12.61
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	8744.84 ± 54.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	8287.33 ± 65.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	7596.55 ± 34.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	6392.95 ± 26.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	184.68 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	183.91 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	180.83 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	175.42 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	166.30 ± 0.03

build: 8a4280c (6307)

0 replies

Hedede · 2025-09-01T09:37:16Z

Hedede
Sep 1, 2025

RTX A6000 (48 GB / GDDR6 / 384-bit)

Linux 5.15.0
Driver Version: 565.77
CUDA Version: 12.7

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4913.93 ± 6.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	4619.66 ± 5.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	4313.47 ± 8.60
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	3740.25 ± 4.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	2899.87 ± 3.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	1981.68 ± 3.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	1091.41 ± 0.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	138.73 ± 2.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	137.87 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	133.61 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	131.28 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	125.16 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5662.39 ± 13.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	5600.69 ± 18.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	5497.16 ± 15.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	5035.14 ± 4.70
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	4599.80 ± 2.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	3887.28 ± 1.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	2954.14 ± 0.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	144.87 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	143.27 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	140.72 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	136.76 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	130.11 ± 0.06

build: 4795c91 (6342)

0 replies

hanabu · 2025-09-05T16:00:44Z

hanabu
Sep 5, 2025

Quadro T1000 (TU117 4GB / GDDR6 / 128-bit) , HP Z2 mini G5

I couldn't load 512 tokens with 4GB of VRAM. I tested pp256 instead.

Oracle Linux 9.6
Driver Version: 575.57.08
CUDA Version: 12.9
llama.cpp: f6da8cb 6362

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro T1000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /usr/lib64/ggml/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib64/ggml/libggml-cpu.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp256	79.44 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	27.82 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp256	81.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	27.46 ± 0.23

With pt13762104's GGML_CUDA_NO_TURING_MMA modification

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro T1000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /usr/lib64/ggml/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib64/ggml/libggml-cpu.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp256	385.93 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	28.43 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp256	431.16 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	28.43 ± 0.11

I'm running offline batch inference on this old workstation. Llama.cpp and Qwen3-4B-Thinking-2507 IQ4 work well. Many thanks to the llama.cpp team!

0 replies

Hedede · 2025-09-05T18:51:01Z

Hedede
Sep 5, 2025

A100 80GB PCIe
Driver Version: 570.133.20
CUDA Version: 12.8
Linux 6.8.0 (22.04.1-Ubuntu)

./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048,4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4883.69 ± 5.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	4760.67 ± 1.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	4572.24 ± 2.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	4230.73 ± 0.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	3664.65 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	2851.50 ± 4.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	1867.40 ± 1.92
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	185.91 ± 7.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	186.56 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	180.46 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	178.87 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	173.15 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg4096	163.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5295.94 ± 5.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	5315.97 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	5265.85 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	5118.09 ± 3.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	4822.77 ± 1.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	4306.81 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	3544.53 ± 4.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	196.26 ± 2.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	197.44 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	193.92 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	190.27 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	182.79 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg4096	170.53 ± 0.08

build: 5143fa8 (6392)

0 replies

Hedede · 2025-09-05T18:56:03Z

Hedede
Sep 5, 2025

A100 80GB SXM4
Driver Version: 565.57.01
CUDA Version: 12.7
Linux 6.5.0 (22.04.1-Ubuntu)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4849.53 ± 8.94
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	4755.22 ± 8.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	4569.57 ± 6.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	4239.42 ± 5.94
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	3663.99 ± 1.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	2894.84 ± 0.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32784	1918.93 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	190.88 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	188.06 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	182.03 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	180.78 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	175.78 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg4096	165.85 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5285.96 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	5308.42 ± 8.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	5282.45 ± 5.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	5150.01 ± 5.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	4883.68 ± 3.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	4426.85 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32784	3708.18 ± 1.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	200.90 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	200.80 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	197.55 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	193.66 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	186.59 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg4096	174.47 ± 0.03

build: 5143fa8 (6392)

0 replies

Hedede · 2025-09-05T18:59:16Z

Hedede
Sep 5, 2025

H100 80GB PCIe
Driver Version: 570.133.20
CUDA Version: 12.8
Linux 6.8.0 (22.04.1-Ubuntu)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA H100 PCIe, compute capability 9.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	7350.98 ± 63.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	7277.24 ± 3.96
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	6976.45 ± 8.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	6370.87 ± 11.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	5298.29 ± 4.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	3836.34 ± 6.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32784	2435.55 ± 3.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	219.23 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	215.16 ± 0.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	208.08 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	206.23 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	200.75 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg4096	189.94 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	8457.52 ± 18.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	8513.08 ± 13.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	8442.74 ± 29.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	8110.29 ± 7.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	7335.59 ± 8.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	6214.56 ± 4.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32784	4837.63 ± 2.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	229.85 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	230.12 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	227.18 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	207.57 ± 0.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	213.82 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg4096	200.16 ± 0.21

build: 5143fa8 (6392)

0 replies

Hedede · 2025-09-05T19:04:59Z

Hedede
Sep 5, 2025

H100 80GB SXM5
Driver Version: 560.35.05
CUDA Version: 12.6
Linux 6.5.0 (22.04.1-Ubuntu)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	9918.34 ± 176.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	9794.27 ± 76.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	9366.63 ± 50.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	8568.65 ± 3.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	7152.63 ± 2.61
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	5331.01 ± 1.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32784	3456.25 ± 1.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	267.81 ± 1.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	263.37 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	255.22 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	253.44 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	247.29 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg4096	236.07 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	11263.29 ± 98.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	11454.74 ± 8.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	11331.01 ± 10.68
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	10935.28 ± 34.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	10340.75 ± 4.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	9211.07 ± 2.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32784	7407.12 ± 6.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	280.74 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	280.73 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	277.60 ± 0.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	274.97 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	264.73 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg4096	248.77 ± 0.19

build: 5143fa8 (6392)

0 replies

Tom94 · 2025-09-05T19:30:52Z

Tom94
Sep 5, 2025

After seeing @Hedede's numbers with the H100, I had to try the RTX 6000 Pro Blackwell on the latest llama.cpp version to compare. Just barely manages to edge it out. :)

With fa:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           pp512 |     16618.98 ± 20.66 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          pp1024 |     16590.31 ± 28.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          pp2048 |     16255.31 ± 42.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          pp4096 |     15571.01 ± 44.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          pp8192 |     14231.99 ± 25.94 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |         pp16384 |     11797.81 ± 27.10 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |         pp32768 |      7462.29 ± 27.29 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           tg128 |        281.11 ± 0.41 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           tg256 |        280.98 ± 0.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           tg512 |        277.08 ± 0.48 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          tg1024 |        270.57 ± 0.17 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |          tg2048 |        256.81 ± 0.21 |

build: 5143fa8 (6392)

Without fa:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |           pp512 |     14854.63 ± 22.73 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          pp1024 |     14046.00 ± 42.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          pp2048 |     12772.94 ± 18.17 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          pp4096 |     10429.76 ± 11.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          pp8192 |       7786.16 ± 3.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         pp16384 |       5128.03 ± 3.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         pp32768 |       2845.66 ± 2.95 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |           tg128 |        274.20 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |           tg256 |        271.15 ± 0.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |           tg512 |        264.13 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          tg1024 |        260.42 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          tg2048 |        249.18 ± 0.11 |

build: 5143fa8 (6392)

0 replies

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

Performance of llama.cpp on Nvidia CUDA #15013

Uh oh!

Uh oh!

Instructions

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

More detailed test

Replies: 43 comments · 30 replies

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Uh oh!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 1, 2025 Author

Uh oh!

slaren Aug 1, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 7, 2025 Author

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

ggerganov Aug 2, 2025 Maintainer

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

RTX 5060 Ti 16 GB

RTX 4060 Ti 8 GB

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 7, 2025 Author

Uh oh!

Replies: 43 comments 30 replies

olegshulyakov Aug 1, 2025
Author

slaren
Aug 1, 2025
Maintainer

olegshulyakov Aug 7, 2025
Author

olegshulyakov Aug 2, 2025
Author

olegshulyakov Aug 2, 2025
Author

ggerganov
Aug 2, 2025
Maintainer

olegshulyakov Aug 2, 2025
Author

olegshulyakov Aug 7, 2025
Author