You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
llama : add enum for built-in chat templates (ggml-org#10623)
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
Copy file name to clipboardExpand all lines: examples/server/README.md
+10-1Lines changed: 10 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -69,6 +69,8 @@ The project is under active development, and we are [looking for feedback and co
69
69
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
70
70
|`--no-mmap`| do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
71
71
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
72
+
|`-dev, --device <dev1,dev2,..>`| comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
73
+
|`--list-devices`| print list of available devices and exit |
72
74
|`-ngl, --gpu-layers, --n-gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
73
75
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
74
76
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
@@ -158,9 +160,16 @@ The project is under active development, and we are [looking for feedback and co
158
160
|`--props`| enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
|`--slot-save-path PATH`| path to save slot kv cache (default: disabled) |
161
-
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted:<br/>https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
163
+
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>list of built-in templates:<br/>chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
162
164
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
163
165
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
166
+
|`--draft-max, --draft, --draft-n N`| number of tokens to draft for speculative decoding (default: 16) |
167
+
|`--draft-min, --draft-n-min N`| minimum number of draft tokens to use for speculative decoding (default: 5) |
168
+
|`--draft-p-min P`| minimum speculative decoding probability (greedy) (default: 0.9) |
169
+
|`-cd, --ctx-size-draft N`| size of the prompt context for the draft model (default: 0, 0 = loaded from model) |
170
+
|`-devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
171
+
|`-ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| number of layers to store in VRAM for the draft model |
172
+
|`-md, --model-draft FNAME`| draft model for speculative decoding (default: unused) |
164
173
165
174
166
175
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments