Skip to content

add notes to sglang installation to use xgrammar grammar backend for memory efficiency #7977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ DSPy stands for Declarative Self-improving Python. Instead of brittle prompts, y
> pip install "sglang[all]"
> pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

> CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct
> CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct --grammar-backend xgrammar
```

If you don't have access from Meta to download `meta-llama/Llama-3.1-8B-Instruct`, use `Qwen/Qwen2.5-7B-Instruct` for example.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/learn/programming/language_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ dspy.configure(lm=lm)
> pip install "sglang[all]"
> pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

> CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct
> CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct --grammar-backend xgrammar
```

Then, connect to it from your DSPy code as an OpenAI-compatible endpoint.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/tutorials/agents/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
"\n",
"A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.\n",
"\n",
"You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n",
"You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n",
"\n",
"In the snippet below, we'll configure our main LM as `Llama-3.2-3B`. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM."
]
Expand Down
6 changes: 4 additions & 2 deletions docs/docs/tutorials/classification_finetuning/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@
"\n",
"Install the latest DSPy via `pip install -U --pre dspy` and follow along. This tutorial depends on DSPy 2.6.0 (pre-release).\n",
"\n",
"This tutorial requires a local GPU at the moment for inference, though we plan to support ollama serving for finetuned models as well.\n",
"This tutorial requires a local GPU at the moment for inference, though we plan to support Ollama serving for finetuned models as well.\n",
"\n",
"You will also need the following dependencies:\n",
"\n",
"```shell\n",
"> pip install \"sglang[all]\"; pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/\n",
"> pip install -U torch transformers accelerate trl peft\n",
"```"
"```\n",
"\n",
"(Note: launch SGLang with `--grammar-backend xgrammar`)"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/tutorials/multihop_search/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"source": [
"In this tutorial, we'll use a small local LM, Meta's `Llama-3.1-8B-Instruct` which has 8 billion parameters.\n",
"\n",
"You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n",
"You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n",
"\n",
"In the snippet below, we'll configure this small model as our main LM. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM. This is technically not necessary; the small model can typically teach itself tasks like this in DSPy. But using a larger teacher will give us some peace of mind, where the initial system or optimizer configuration doesn't matter as much."
]
Expand Down
3 changes: 2 additions & 1 deletion dspy/clients/lm_local.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,8 @@ def launch(lm: "LM", launch_kwargs: Optional[Dict[str, Any]] = None):
)
port = get_free_port()
timeout = launch_kwargs.get("timeout", 1800)
command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0"
#NOTE - Launched with grammar-backend xgrammar as it is more memory-friendly
command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0 --grammar-backend xgrammar"

# We will manually stream & capture logs.
process = subprocess.Popen(
Expand Down
2 changes: 2 additions & 0 deletions examples/migration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@
"metadata": {},
"outputs": [],
"source": [
"#NOTE - Launch your SGLang server with `--grammar-backend xgrammar` as it is more memory-friendly\n",
"\n",
"sglang_port = 7501\n",
"sglang_url = f\"http://localhost:{sglang_port}/v1\"\n",
"sglang_llama = dspy.LM(\"openai/meta-llama/Meta-Llama-3-8B-Instruct\", api_base=sglang_url)\n",
Expand Down
Loading