stanfordnlp · arnavsinghvi11 · Mar 18, 2025
diff --git a/docs/docs/index.md b/docs/docs/index.md
@@ -73,7 +73,7 @@ DSPy stands for Declarative Self-improving Python. Instead of brittle prompts, y
           > pip install "sglang[all]"
           > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ 
 
-          > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct
+          > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct --grammar-backend xgrammar
           ```
 
         If you don't have access from Meta to download `meta-llama/Llama-3.1-8B-Instruct`, use `Qwen/Qwen2.5-7B-Instruct` for example.

diff --git a/docs/docs/learn/programming/language_models.md b/docs/docs/learn/programming/language_models.md
@@ -48,7 +48,7 @@ dspy.configure(lm=lm)
           > pip install "sglang[all]"
           > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ 
 
-          > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct
+          > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct --grammar-backend xgrammar
           ```
 
           Then, connect to it from your DSPy code as an OpenAI-compatible endpoint.

diff --git a/docs/docs/tutorials/agents/index.ipynb b/docs/docs/tutorials/agents/index.ipynb
@@ -58,7 +58,7 @@
     "\n",
     "A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.\n",
     "\n",
-    "You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n",
+    "You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n",
     "\n",
     "In the snippet below, we'll configure our main LM as `Llama-3.2-3B`. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM."
    ]

diff --git a/docs/docs/tutorials/classification_finetuning/index.ipynb b/docs/docs/tutorials/classification_finetuning/index.ipynb
@@ -14,14 +14,16 @@
     "\n",
     "Install the latest DSPy via `pip install -U --pre dspy` and follow along. This tutorial depends on DSPy 2.6.0 (pre-release).\n",
     "\n",
-    "This tutorial requires a local GPU at the moment for inference, though we plan to support ollama serving for finetuned models as well.\n",
+    "This tutorial requires a local GPU at the moment for inference, though we plan to support Ollama serving for finetuned models as well.\n",
     "\n",
     "You will also need the following dependencies:\n",
     "\n",
     "```shell\n",
     "> pip install \"sglang[all]\"; pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/\n",
     "> pip install -U torch transformers accelerate trl peft\n",
-    "```"
+    "```\n",
+    "\n",
+    "(Note: launch SGLang with `--grammar-backend xgrammar`)"
    ]
   },
   {

diff --git a/docs/docs/tutorials/multihop_search/index.ipynb b/docs/docs/tutorials/multihop_search/index.ipynb
@@ -59,7 +59,7 @@
    "source": [
     "In this tutorial, we'll use a small local LM, Meta's `Llama-3.1-8B-Instruct` which has 8 billion parameters.\n",
     "\n",
-    "You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n",
+    "You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n",
     "\n",
     "In the snippet below, we'll configure this small model as our main LM. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM. This is technically not necessary; the small model can typically teach itself tasks like this in DSPy. But using a larger teacher will give us some peace of mind, where the initial system or optimizer configuration doesn't matter as much."
    ]

diff --git a/dspy/clients/lm_local.py b/dspy/clients/lm_local.py
@@ -57,7 +57,8 @@ def launch(lm: "LM", launch_kwargs: Optional[Dict[str, Any]] = None):
         )
         port = get_free_port()
         timeout = launch_kwargs.get("timeout", 1800)
-        command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0"
+        #NOTE - Launched with grammar-backend xgrammar as it is more memory-friendly
+        command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0 --grammar-backend xgrammar"
 
         # We will manually stream & capture logs.
         process = subprocess.Popen(

diff --git a/examples/migration.ipynb b/examples/migration.ipynb
@@ -101,6 +101,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "#NOTE - Launch your SGLang server with `--grammar-backend xgrammar` as it is more memory-friendly\n",
+    "\n",
     "sglang_port = 7501\n",
     "sglang_url = f\"http://localhost:{sglang_port}/v1\"\n",
     "sglang_llama = dspy.LM(\"openai/meta-llama/Meta-Llama-3-8B-Instruct\", api_base=sglang_url)\n",