Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ x ] I carefully followed the README.md.
- [ x ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [ x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I have an ole Blade CPU cluster with HS22s each running dual Xeon E5540 CPUs. The best one has 72GB of RAM. I have been trying to see if I can use llama.cpp with my existing OpenMPI install to distribute Mistral-7B across my cluster to see if it makes any difference in inference rate.
I was inspired by the guy in #2164 who successfully ran llama.cpp across a bunch of Raspberry Pis so it seems like it should be possible.
I ran
make CC=mpicc CXX=mpicxx LLAMA_MPI=1 -j
to compile it for compatibility with OpenMPI and then tried to run it on the model I downloaded:
mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"
Current Behavior
It seemed to load the model and start setting things up but then bombed. Here's what I got:
cluster@blade8:~/llama.cpp$ mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1698096051
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1698096051
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/cluster/models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_K [ 4096, 1024, 1, 1 ]
...
llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: mem required = 4165.46 MB
...............................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
GGML_ASSERT: llama.cpp:5876: false && "not implemented"
[blade8:09703] *** Process received signal ***
[blade8:09703] Signal: Aborted (6)
[blade8:09703] Signal code: (-6)
[blade8:09703] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f90ca49f520]
[blade8:09703] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f90ca4f39fc]
[blade8:09703] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f90ca49f476]
[blade8:09703] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90ca4857f3]
[blade8:09703] [ 4] ./main(+0x6aa93)[0x55b88ad8ea93]
[blade8:09703] [ 5] ./main(+0xa0628)[0x55b88adc4628]
[blade8:09703] [ 6] ./main(+0x137a9)[0x55b88ad377a9]
[blade8:09703] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f90ca486d90]
[blade8:09703] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f90ca486e40]
[blade8:09703] [ 9] ./main(+0x1b3c5)[0x55b88ad3f3c5]
[blade8:09703] *** End of error message ***
Aborted (core dumped)
I checked line 5876 in llama.cpp and the code surrounding it is this:
... other code
GGML_ASSERT(n_tokens <= n_batch);
int n_threads = n_tokens == 1 ? cparams.n_threads : cparams.n_threads_batch;
GGML_ASSERT((!batch.token && batch.embd) || (batch.token && !batch.embd)); // NOLINT
const int64_t t_start_us = ggml_time_us();
#ifdef GGML_USE_MPI
// TODO: needs fix after #3228
GGML_ASSERT(false && "not implemented");
//ggml_mpi_eval_init(lctx.ctx_mpi, &n_tokens, &n_past, &n_threads);
#endif
GGML_ASSERT(n_threads > 0);
auto & kv_self = lctx.kv_self;
GGML_ASSERT(!!kv_self.ctx);
... more other code
Line 5876 is the one that says "GGML_ASSERT(false && "not implemented");"
Environment and Context
System:
Ubuntu Server 22.04 LTS
HS22 with dual Xeon E5540 processors and 72GB RAM
Running OpenMPI V4.1.2
- Physical (or virtual) hardware you are using, e.g. for Linux:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
CPU family: 6
Model: 26
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
Stepping: 5
CPU max MHz: 2527.0000
CPU min MHz: 1596.0000
BogoMIPS: 5066.73
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse
36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
c cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 x
tpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tp
r_shadow vnmi flexpriority ept vpid dtherm flush_l1d
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 2 MiB (8 instances)
L3: 16 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnera
ble
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional,
RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
- Operating System, e.g. for Linux:
$ uname -a
Linux blade8 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0