NVIDIA-NeMo · oyilmaz-nvidia · Feb 19, 2026 · Jan 28, 2026 · Jan 28, 2026 · Feb 11, 2026
diff --git a/docs/llm/mbridge/in-framework-ray.md b/docs/llm/mbridge/in-framework-ray.md
@@ -26,8 +26,7 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
-      --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \
-      --model_format megatron \
+      --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
       --model_id llama \
       --num_replicas 1 \
       --num_gpus 1 \
@@ -56,6 +55,5 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI
 Deploying Megatron-Bridge models with Ray Serve closely follows the same process as deploying NeMo 2.0 models. The primary differences are:
 
 - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file.
-- Set `--model_format megatron` to indicate the model type.
 
-All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation.
+All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation.
diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md
@@ -21,7 +21,9 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
+       --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
+       --triton_model_name llama
    ```
 
 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).
@@ -35,7 +37,10 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 6. To send a query to the Triton server, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py -mn llama -p "What is the color of a banana?" -mol 5
+   python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py \
+       -mn llama \
+       -p "What is the color of a banana?" \
+       -mol 5
    ```
 
 ## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server
@@ -47,7 +52,6 @@ You can deploy an LLM from a Megatron-Bridge checkpoint on Triton using the prov
 The following instructions are very similar to those for [deploying NeMo 2.0 models](../nemo_2/in-framework.md), with only a few key differences specific to Megatron-Bridge highlighted below.
 
 - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file.
-- Set `--model_format megatron` to indicate the model type.
 
 
 Executing the script will directly deploy the Megatron-Bridge LLM model and start the service on Triton.
@@ -57,13 +61,12 @@ Executing the script will directly deploy the Megatron-Bridge LLM model and star
 2. To begin serving the downloaded model, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama
    ```
 
    The following parameters are defined in the ``deploy_inframework_triton.py`` script:
 
    - ``-nc``, ``--megatron_checkpoint``: Path to the Megatron-Bridge checkpoint file to deploy. (Required)
-   - ``-mf``, ``--model_format``: Whether to load megatron-bridge or nemo 2 model. This should be set to megatron.
    - ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required)
    - ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1
    - ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080
@@ -137,4 +140,4 @@ output = nq.query_llm(
     repetition_penalty=1.0
 )
 print(output)
-```
+```
diff --git a/docs/llm/mbridge/optimized/index.md b/docs/llm/mbridge/optimized/index.md
@@ -1,4 +1,13 @@
 # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries
 
-**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM and vLLM is coming soon. Please check back for updates.
+Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM.
+
+```{toctree}
+:maxdepth: 1
+:titlesonly:
+
+vLLM <vllm.md>
+```
+
+**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates.
 
diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
@@ -0,0 +1,195 @@
+# Deploy Megatron-Bridge LLMs with vLLM and Triton Inference Server
+
+This section shows how to use scripts and APIs to export a Megatron-Bridge LLM to vLLM and deploy it with the NVIDIA Triton Inference Server.
+
+## Quick Example
+
+1. Follow the steps in the [Generate a Megatron-Bridge Checkpoint](../gen_mbridge_ckpt.md) to generate a Megatron-Bridge Llama checkpoint.
+
+2. In a terminal, go to the folder where the ``hf_llama31_8B_mbridge`` checkpoint is located. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:
+
+   ```shell
+   docker pull nvcr.io/nvidia/nemo:vr
+
+   docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \
+       -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge/ \
+       -w /opt/Export-Deploy \
+       --name nemo-fw \
+       nvcr.io/nvidia/nemo:vr
+   ```
+
+3. Install vLLM by executing the following command inside the container if it is not available in the container:
+
+   ```shell
+   cd /opt/Export-Deploy
+   uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
+
+   ```
+
+4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama Megatron-Bridge checkpoint to vLLM and subsequently serves it on the Triton server:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
+       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/  \
+       --triton_model_name llama \
+       --tensor_parallelism_size 1
+   ```
+
+5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
+
+6. In a separate terminal, access the running container as follows:
+
+   ```shell
+   docker exec -it nemo-fw bash
+   ```
+
+7. To send a query to the Triton server, run the following script:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
+   ```
+
+## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server
+
+You can deploy a LLM from a Megatron-Bridge checkpoint on Triton using the provided script.
+
+### Export and Deploy a Megatron-Bridge LLM
+
+After executing the script, it will export the model to vLLM and then initiate the service on Triton.
+
+1. Start the container using the steps described in the **Quick Example** section.
+
+2. To begin serving the downloaded model, run the following script:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
+       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \
+       --triton_model_name llama \
+       --tensor_parallelism_size 1
+   ```
+
+   The following parameters are defined in the ``deploy_vllm_triton.py`` script:
+
+   - ``--model_path_id``: Path of a Megatron-Bridge checkpoint, or Hugging Face model ID or path. (Required)
+   - ``--tokenizer``: Tokenizer file if it is not provided in the checkpoint. (Optional)
+   - ``--lora_ckpt``: List of LoRA checkpoints in HF format. (Optional, can specify multiple)
+   - ``--tensor_parallelism_size``: Number of GPUs to use for tensor parallelism. Default is 1.
+   - ``--dtype``: Data type for the model in vLLM. Choices: "auto", "bfloat16", "float16", "float32". Default is "auto".
+   - ``--quantization``: Quantization method for vLLM. Choices: "awq", "gptq", "fp8". Default is None.
+   - ``--seed``: Random seed for reproducibility. Default is 0.
+   - ``--gpu_memory_utilization``: GPU memory utilization percentage for vLLM. Default is 0.9.
+   - ``--swap_space``: Size (GiB) of CPU memory per GPU to use as swap space. Default is 4.
+   - ``--cpu_offload_gb``: Size (GiB) of CPU memory to use for offloading model weights. Default is 0.
+   - ``--enforce_eager``: Whether to enforce eager execution. Default is False.
+   - ``--max_seq_len_to_capture``: Maximum sequence length covered by CUDA graphs. Default is 8192.
+   - ``--triton_model_name``: Name for the service/model on Triton. (Required)
+   - ``--triton_model_version``: Version for the service/model. Default is 1.
+   - ``--triton_port``: Port for the Triton server to listen for requests. Default is 8000.
+   - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0.
+   - ``--max_batch_size``: Maximum batch size of the model. Default is 8.
+   - ``--debug_mode``: Enable debug/verbose output. Default is False.
+
+3. Access the models with a Hugging Face token.
+
+   If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you'll need to generate a Hugging Face token that has access to these models. Visit `Hugging Face <https://huggingface.co/>`__ for more information. After you have the token, perform one of the following steps.
+
+   - Log in to Hugging Face:
+
+   ```shell
+   huggingface-cli login
+   ```
+
+   - Or, set the HF_TOKEN environment variable:
+
+   ```shell
+   export HF_TOKEN=your_token_here
+   ```
+
+## Supported LLMs
+
+Megatron-Bridge models are supported for export and deployment if they are listed as compatible in the [vLLM supported models list](https://docs.vllm.ai/en/v0.9.2/models/supported_models.html).
+
+
+## Use NeMo Export and Deploy APIs to Export
+
+Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Megatron-Bridge checkpoints to vLLM.
+
+
+### Export Megatron-Bridge LLMs
+
+You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path.
+
+```python
+def run_test():
+    from nemo_export.vllm_exporter import vLLMExporter
+
+    checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/"
+
+    exporter = vLLMExporter()
+    exporter.export(
+        model_path_id=checkpoint_file,
+        tensor_parallel_size=1,
+    )
+
+    # The correct argument for output length is 'max_tokens', not 'max_output_len'
+    output = exporter.forward(
+        ["What is the best city in the world?"],
+        max_tokens=50,
+        top_k=1,
+        top_p=0.1,
+        temperature=1.0,
+    )
+    print("output: ", output)
+
+
+
+if __name__ == "__main__":
+    run_test()
+```
+
+Be sure to check the vLLMExporter class docstrings for details.
+
+
+## How To Send a Query
+
+### Send a Query using the Script
+
+You can send queries to your deployed Megatron-Bridge LLM using the provided query script. This script allows you to interact with the model via HTTP requests, sending prompts and receiving generated responses directly from the Triton server.
+
+The example below demonstrates how to use the query script to send a prompt to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below.
+
+```shell
+python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
+```
+
+**Additional parameters:**
+- `--prompt_file`: Read prompt from a file instead of the command line
+- `--max_tokens`: Maximum number of tokens to generate (default: 16)
+- `--min_tokens`: Minimum number of tokens to generate (default: 0)
+- `--n_log_probs`: Number of log probabilities to return per output token
+- `--n_prompt_log_probs`: Number of log probabilities to return per prompt token
+- `--seed`: Random seed for generation
+- `--top_k`: Top-k sampling (default: 1)
+- `--top_p`: Top-p sampling (default: 0.1)
+- `--temperature`: Sampling temperature (default: 1.0)
+- `--lora_task_uids`: List of LoRA task UIDs for LoRA-enabled models (use -1 to disable)
+- `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0)
+
+
+### Send a Query using the NeMo APIs
+
+Please see the below if you would like to use APIs to send a query.
+
+```python
+from nemo_deploy.llm import NemoQueryvLLM
+
+nq = NemoQueryvLLM(url="localhost:8000", model_name="llama")
+output = nq.query_llm(
+    prompts=["What is the capital of United States? "],
+    max_tokens=100,
+    top_k=1,
+    top_p=0.8,
+    temperature=1.0,
+)
+print("output: ", output)
+```