Slow performance compared to llama.cpp origin

I have try on two platforms, 12490f with 64G 6400GHz DDR5,      EPYC 7302 16C 3.0GHz 128G 3200 DDR4 (memory read 118GB/s)

there is log on 7302, firstly  t-mac   and   secondly for llamacpp latest   for model with similar config (llama3 8b instraction 4bit/Q40)

I' post logs for 12490f and also 14700 later.    For 12490f is about 3 times slower for prefill.

So,  why .    What's missing.   And how to speedup tk/s especially prefill using t-mac.

#30 

Thx.


  Running command in /opt/T-MAC/3rdparty/llama.cpp/build:
    /opt/T-MAC/3rdparty/llama.cpp/build/bin/main -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -ngl 0 -c 2048
Check logs/2024-09-06-08-36-28.log for inference output
(base) root@b96728f2aeed:/opt/T-MAC# cat logs/2024-09-06-08-36-28.log
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed  = 1725611788
[08:36:28] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init



llm_load_print_meta: model size       = 5.41 GiB (5.79 BPW)
llm_load_print_meta: general.name     = Llama-3-8b-instruct-EfficientQAT-w4g128
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128256 '[PAD]'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  5749.03 MiB
..................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0


<|begin_of_text|>Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.imersimersimersimersushihandsimersimersushiimersaqimersimersimersindeimersindeimerspinsimersellersimersayasimersindeushiimersushiushiimersayasaqimersindeimersindeushiimershandsushiayasimersushiimersindeushiimersindeimersimersorgeimersimersingoimersushiaqimersimersingoimersimersersimersorgeimersimersaqimersimersellersushiimersimersindeimershandsushihandsimersimersimersimersayasellersantimersushiimersimersimersimersimersimersimersimersimersayasushiimersimersimersuggyindeindeimerspinsaqimersimersimersimershandsimersimersorgeiringimersushiimersimersimersaqimersingohandsimershands
llama_print_timings:        load time =    1000.39 ms
llama_print_timings:      sample time =      10.08 ms /   128 runs   (    0.08 ms per token, 12704.71 tokens per second)
llama_print_timings: prompt eval time =    2035.98 ms /    18 tokens (  113.11 ms per token,     8.84 tokens per second)
llama_print_timings:        eval time =   18743.23 ms /   127 runs   (  147.58 ms per token,     6.78 tokens per second)
llama_print_timings:       total time =   20853.05 ms /   145 tokens
Log end



####################################################################
llama.cpp
####################################################



./bin/llama-cli -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -n 128 -t 4 -p 'Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.' -ngl 0 -c 2048


llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 (n_threads_batch = 4) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0


Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. It is one of the largest and most successful technology companies in the world. Microsoft was founded in 1975 by Bill Gates and Paul Allen, and it is known for its Windows operating system, Microsoft Office software suite, and Xbox gaming console. The company is also known for its cloud computing platform, Azure, and its artificial intelligence (AI) and machine learning (ML) technologies.

Microsoft is a multinational corporation with operations in over 100 countries. It has a market capitalization of over $2 trillion and employs over 140,000 people worldwide. The company is headquartered in Redmond, Washington, and has major offices in New York
llama_print_timings:        load time =     684.18 ms
llama_print_timings:      sample time =      10.31 ms /   128 runs   (    0.08 ms per token, 12411.52 tokens per second)
llama_print_timings: prompt eval time =    1259.19 ms /    17 tokens (   74.07 ms per token,    13.50 tokens per second)
llama_print_timings:        eval time =   18773.20 ms /   127 runs   (  147.82 ms per token,     6.76 tokens per second)
llama_print_timings:       total time =   20089.71 ms /   144 tokens
Log end



model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	1	pp512	3.53 ± 0.00
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	1	tg128	1.83 ± 0.00

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	4	pp512	13.98 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	4	tg128	6.76 ± 0.03

model	size	params	backend	threads	test	t/s
llama 8B IN	5.41 GiB	8.03 B	CPU	1	pp 512	2.27 ± 0.00
llama 8B IN	5.41 GiB	8.03 B	CPU	1	tg 128	1.93 ± 0.00

model	size	params	backend	threads	test	t/s
llama 8B IN	5.41 GiB	8.03 B	CPU	4	pp 512	8.91 ± 0.00
llama 8B IN	5.41 GiB	8.03 B	CPU	4	tg 128	6.74 ± 0.01

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	23.63 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	13.85 ± 0.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow performance compared to llama.cpp origin #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

model	size	params	backend	threads	test	t/s
llama 7B IN	3.69 GiB	6.74 B	CPU	4	pp 512	18.78 ± 0.04
llama 7B IN	3.69 GiB	6.74 B	CPU	4	tg 128	15.27 ± 0.01

model	size	params	backend	threads	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	CPU	4	pp512	23.03 ± 0.00
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	CPU	4	tg128	16.45 ± 0.00

model	size	params	backend	threads	test	t/s
llama 8B IN	3.99 GiB	8.03 B	CPU	4	pp 512	33.06 ± 0.04
llama 8B IN	3.99 GiB	8.03 B	CPU	4	tg 128	19.32 ± 0.01

model	size	params	backend	threads	test	t/s
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	CPU	4	pp 512	23.16 ± 0.13
llama 7B Q2_K - Medium	2.36 GiB	6.74 B	CPU	4	tg 128	17.90 ± 0.01

Slow performance compared to llama.cpp origin #40

Description

Activity

kaleid-liner commented on Sep 6, 2024

idreamerhx commented on Sep 6, 2024

idreamerhx commented on Sep 6, 2024

idreamerhx commented on Sep 10, 2024

idreamerhx commented on Sep 10, 2024

idreamerhx commented on Sep 10, 2024

idreamerhx commented on Sep 10, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions