Trouble running llama.cpp compiled for OpenMPI

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [ x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ x ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [ x ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ x ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I have an ole Blade CPU cluster with HS22s each running dual Xeon E5540 CPUs. The best one has 72GB of RAM. I have been trying to see if I can use llama.cpp with my existing OpenMPI install to distribute Mistral-7B across my cluster to see if it makes any difference in inference rate.

I was inspired by the guy in #2164 who successfully ran llama.cpp across a bunch of Raspberry Pis so it *seems* like it should be possible. 

I ran
```make CC=mpicc CXX=mpicxx LLAMA_MPI=1 -j```

to compile it for compatibility with OpenMPI and then tried to run it on the model I downloaded:

```
mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"
```

# Current Behavior

It seemed to load the model and start setting things up but then bombed. Here's what I got:

```
cluster@blade8:~/llama.cpp$ mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1698096051
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1698096051
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/cluster/models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]

...
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name   = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: mem required  = 4165.46 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
GGML_ASSERT: llama.cpp:5876: false && "not implemented"
[blade8:09703] *** Process received signal ***
[blade8:09703] Signal: Aborted (6)
[blade8:09703] Signal code:  (-6)
[blade8:09703] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f90ca49f520]
[blade8:09703] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f90ca4f39fc]
[blade8:09703] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f90ca49f476]
[blade8:09703] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90ca4857f3]
[blade8:09703] [ 4] ./main(+0x6aa93)[0x55b88ad8ea93]
[blade8:09703] [ 5] ./main(+0xa0628)[0x55b88adc4628]
[blade8:09703] [ 6] ./main(+0x137a9)[0x55b88ad377a9]
[blade8:09703] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f90ca486d90]
[blade8:09703] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f90ca486e40]
[blade8:09703] [ 9] ./main(+0x1b3c5)[0x55b88ad3f3c5]
[blade8:09703] *** End of error message ***
Aborted (core dumped)
```

I checked line 5876 in llama.cpp and the code surrounding it is this:
```
... other code
    GGML_ASSERT(n_tokens <= n_batch);

    int n_threads = n_tokens == 1 ? cparams.n_threads : cparams.n_threads_batch;
    GGML_ASSERT((!batch.token && batch.embd) || (batch.token && !batch.embd)); // NOLINT

    const int64_t t_start_us = ggml_time_us();

#ifdef GGML_USE_MPI
    // TODO: needs fix after #3228
    GGML_ASSERT(false && "not implemented");
    //ggml_mpi_eval_init(lctx.ctx_mpi, &n_tokens, &n_past, &n_threads);
#endif

    GGML_ASSERT(n_threads > 0);

    auto & kv_self = lctx.kv_self;

    GGML_ASSERT(!!kv_self.ctx);
... more other code
```
Line 5876 is the one that says "GGML_ASSERT(false && "not implemented");"

# Environment and Context

System:
Ubuntu Server 22.04 LTS
HS22 with dual Xeon E5540 processors and 72GB RAM
Running OpenMPI V4.1.2

* Physical (or virtual) hardware you are using, e.g. for Linux:
```
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz
    CPU family:          6
    Model:               26
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           2
    Stepping:            5
    CPU max MHz:         2527.0000
    CPU min MHz:         1596.0000
    BogoMIPS:            5066.73
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse
                         36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm
                         constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
                         c cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 x
                         tpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tp
                         r_shadow vnmi flexpriority ept vpid dtherm flush_l1d
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-3,8-11
  NUMA node1 CPU(s):     4-7,12-15
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnera
                         ble
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional,
                          RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

* Operating System, e.g. for Linux:

`$ uname -a`
Linux blade8 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

* SDK version, e.g. for Linux:

```
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble running llama.cpp compiled for OpenMPI #3752

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trouble running llama.cpp compiled for OpenMPI #3752

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions