Skip to content

Commit 642330a

Browse files
authored
llama : add enum for built-in chat templates (ggml-org#10623)
* llama : add enum for supported chat templates * use "built-in" instead of "supported" * arg: print list of built-in templates * fix test * update server README
1 parent 8648c52 commit 642330a

File tree

5 files changed

+307
-104
lines changed

5 files changed

+307
-104
lines changed

common/arg.cpp

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,18 @@ bool common_params_parse(int argc, char ** argv, common_params & params, llama_e
348348
return true;
349349
}
350350

351+
static std::string list_builtin_chat_templates() {
352+
std::vector<const char *> supported_tmpl;
353+
int32_t res = llama_chat_builtin_templates(nullptr, 0);
354+
supported_tmpl.resize(res);
355+
res = llama_chat_builtin_templates(supported_tmpl.data(), supported_tmpl.size());
356+
std::ostringstream msg;
357+
for (auto & tmpl : supported_tmpl) {
358+
msg << tmpl << (&tmpl == &supported_tmpl.back() ? "" : ", ");
359+
}
360+
return msg.str();
361+
}
362+
351363
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **)) {
352364
// load dynamic backends
353365
ggml_backend_load_all();
@@ -1814,9 +1826,11 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
18141826
).set_examples({LLAMA_EXAMPLE_SERVER}));
18151827
add_opt(common_arg(
18161828
{"--chat-template"}, "JINJA_TEMPLATE",
1817-
"set custom jinja chat template (default: template taken from model's metadata)\n"
1818-
"if suffix/prefix are specified, template will be disabled\n"
1819-
"only commonly used templates are accepted:\nhttps://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template",
1829+
string_format(
1830+
"set custom jinja chat template (default: template taken from model's metadata)\n"
1831+
"if suffix/prefix are specified, template will be disabled\n"
1832+
"list of built-in templates:\n%s", list_builtin_chat_templates().c_str()
1833+
),
18201834
[](common_params & params, const std::string & value) {
18211835
if (!common_chat_verify_template(value)) {
18221836
throw std::runtime_error(string_format(

examples/server/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ The project is under active development, and we are [looking for feedback and co
6969
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
7070
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
7171
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
72+
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
73+
| `--list-devices` | print list of available devices and exit |
7274
| `-ngl, --gpu-layers, --n-gpu-layers N` | number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
7375
| `-sm, --split-mode {none,layer,row}` | how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
7476
| `-ts, --tensor-split N0,N1,N2,...` | fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
@@ -158,9 +160,16 @@ The project is under active development, and we are [looking for feedback and co
158160
| `--props` | enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
159161
| `--no-slots` | disables slots monitoring endpoint<br/>(env: LLAMA_ARG_NO_ENDPOINT_SLOTS) |
160162
| `--slot-save-path PATH` | path to save slot kv cache (default: disabled) |
161-
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted:<br/>https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
163+
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>list of built-in templates:<br/>chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
162164
| `-sps, --slot-prompt-similarity SIMILARITY` | how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
163165
| `--lora-init-without-apply` | load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
166+
| `--draft-max, --draft, --draft-n N` | number of tokens to draft for speculative decoding (default: 16) |
167+
| `--draft-min, --draft-n-min N` | minimum number of draft tokens to use for speculative decoding (default: 5) |
168+
| `--draft-p-min P` | minimum speculative decoding probability (greedy) (default: 0.9) |
169+
| `-cd, --ctx-size-draft N` | size of the prompt context for the draft model (default: 0, 0 = loaded from model) |
170+
| `-devd, --device-draft <dev1,dev2,..>` | comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
171+
| `-ngld, --gpu-layers-draft, --n-gpu-layers-draft N` | number of layers to store in VRAM for the draft model |
172+
| `-md, --model-draft FNAME` | draft model for speculative decoding (default: unused) |
164173

165174

166175
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.

include/llama.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -990,6 +990,9 @@ extern "C" {
990990
char * buf,
991991
int32_t length);
992992

993+
// Get list of built-in chat templates
994+
int32_t llama_chat_builtin_templates(const char ** output, size_t len);
995+
993996
//
994997
// Sampling API
995998
//

0 commit comments

Comments
 (0)