-
Notifications
You must be signed in to change notification settings - Fork 8
Add vllm docs for mbridge ckpt #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
e915df8
Add vllm docs for mbridge ckpt
oyilmaz-nvidia 14b5c47
Add params
oyilmaz-nvidia 1a624a0
Update vllm.md
oyilmaz-nvidia 1574682
Update vllm.md
oyilmaz-nvidia d1e3883
Update in-framework.md
oyilmaz-nvidia 70ffb6d
Update in-framework-ray.md
oyilmaz-nvidia db79574
Update vllm.md
oyilmaz-nvidia 27760b5
Update in-framework.md
oyilmaz-nvidia b318a05
Update in-framework-ray.md
oyilmaz-nvidia 478a536
Update in-framework.md
oyilmaz-nvidia 308a2aa
Merge branch 'main' into onur/add-mbridge-vllm-docs
oyilmaz-nvidia f8c313d
Update vllm.md
oyilmaz-nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,13 @@ | ||
| # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries | ||
|
|
||
| **Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM and vLLM is coming soon. Please check back for updates. | ||
| Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM. | ||
|
|
||
| ```{toctree} | ||
| :maxdepth: 1 | ||
| :titlesonly: | ||
|
|
||
| vLLM <vllm.md> | ||
| ``` | ||
|
|
||
| **Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| # Deploy Megatron-Bridge LLMs with vLLM and Triton Inference Server | ||
|
|
||
| This section shows how to use scripts and APIs to export a Megatron-Bridge LLM to vLLM and deploy it with the NVIDIA Triton Inference Server. | ||
|
|
||
| ## Quick Example | ||
|
|
||
| 1. Follow the steps in the [Generate a Megatron-Bridge Checkpoint](../gen_mbridge_ckpt.md) to generate a Megatron-Bridge Llama checkpoint. | ||
|
|
||
| 2. In a terminal, go to the folder where the ``hf_llama31_8B_mbridge`` checkpoint is located. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: | ||
|
|
||
| ```shell | ||
| docker pull nvcr.io/nvidia/nemo:vr | ||
|
|
||
| docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ | ||
| -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge/ \ | ||
| -w /opt/Export-Deploy \ | ||
| --name nemo-fw \ | ||
| nvcr.io/nvidia/nemo:vr | ||
| ``` | ||
|
|
||
| 3. Install vLLM by executing the following command inside the container if it is not available in the container: | ||
|
|
||
| ```shell | ||
| cd /opt/Export-Deploy | ||
| uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt) | ||
|
|
||
| ``` | ||
|
|
||
| 4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama Megatron-Bridge checkpoint to vLLM and subsequently serves it on the Triton server: | ||
|
|
||
| ```shell | ||
| python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ | ||
athitten marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ | ||
| --triton_model_name llama \ | ||
| --tensor_parallelism_size 1 | ||
| ``` | ||
|
|
||
| 5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example). | ||
|
|
||
| 6. In a separate terminal, access the running container as follows: | ||
|
|
||
| ```shell | ||
| docker exec -it nemo-fw bash | ||
| ``` | ||
|
|
||
| 7. To send a query to the Triton server, run the following script: | ||
|
|
||
| ```shell | ||
| python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50 | ||
| ``` | ||
|
|
||
| ## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server | ||
|
|
||
| You can deploy a LLM from a Megatron-Bridge checkpoint on Triton using the provided script. | ||
|
|
||
| ### Export and Deploy a Megatron-Bridge LLM | ||
|
|
||
| After executing the script, it will export the model to vLLM and then initiate the service on Triton. | ||
|
|
||
| 1. Start the container using the steps described in the **Quick Example** section. | ||
|
|
||
| 2. To begin serving the downloaded model, run the following script: | ||
|
|
||
| ```shell | ||
| python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ | ||
| --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \ | ||
athitten marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --triton_model_name llama \ | ||
| --tensor_parallelism_size 1 | ||
| ``` | ||
|
|
||
| The following parameters are defined in the ``deploy_vllm_triton.py`` script: | ||
|
|
||
| - ``--model_path_id``: Path of a Megatron-Bridge checkpoint, or Hugging Face model ID or path. (Required) | ||
| - ``--tokenizer``: Tokenizer file if it is not provided in the checkpoint. (Optional) | ||
| - ``--lora_ckpt``: List of LoRA checkpoints in HF format. (Optional, can specify multiple) | ||
| - ``--tensor_parallelism_size``: Number of GPUs to use for tensor parallelism. Default is 1. | ||
| - ``--dtype``: Data type for the model in vLLM. Choices: "auto", "bfloat16", "float16", "float32". Default is "auto". | ||
| - ``--quantization``: Quantization method for vLLM. Choices: "awq", "gptq", "fp8". Default is None. | ||
| - ``--seed``: Random seed for reproducibility. Default is 0. | ||
| - ``--gpu_memory_utilization``: GPU memory utilization percentage for vLLM. Default is 0.9. | ||
| - ``--swap_space``: Size (GiB) of CPU memory per GPU to use as swap space. Default is 4. | ||
| - ``--cpu_offload_gb``: Size (GiB) of CPU memory to use for offloading model weights. Default is 0. | ||
| - ``--enforce_eager``: Whether to enforce eager execution. Default is False. | ||
| - ``--max_seq_len_to_capture``: Maximum sequence length covered by CUDA graphs. Default is 8192. | ||
| - ``--triton_model_name``: Name for the service/model on Triton. (Required) | ||
| - ``--triton_model_version``: Version for the service/model. Default is 1. | ||
| - ``--triton_port``: Port for the Triton server to listen for requests. Default is 8000. | ||
| - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0. | ||
| - ``--max_batch_size``: Maximum batch size of the model. Default is 8. | ||
| - ``--debug_mode``: Enable debug/verbose output. Default is False. | ||
|
|
||
| 3. Access the models with a Hugging Face token. | ||
|
|
||
| If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you'll need to generate a Hugging Face token that has access to these models. Visit `Hugging Face <https://huggingface.co/>`__ for more information. After you have the token, perform one of the following steps. | ||
|
|
||
| - Log in to Hugging Face: | ||
|
|
||
| ```shell | ||
| huggingface-cli login | ||
| ``` | ||
|
|
||
| - Or, set the HF_TOKEN environment variable: | ||
|
|
||
| ```shell | ||
| export HF_TOKEN=your_token_here | ||
| ``` | ||
|
|
||
| ## Supported LLMs | ||
|
|
||
| Megatron-Bridge models are supported for export and deployment if they are listed as compatible in the [vLLM supported models list](https://docs.vllm.ai/en/v0.9.2/models/supported_models.html). | ||
|
|
||
|
|
||
| ## Use NeMo Export and Deploy APIs to Export | ||
|
|
||
| Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Megatron-Bridge checkpoints to vLLM. | ||
|
|
||
|
|
||
| ### Export Megatron-Bridge LLMs | ||
|
|
||
| You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path. | ||
|
|
||
| ```python | ||
| def run_test(): | ||
| from nemo_export.vllm_exporter import vLLMExporter | ||
|
|
||
| checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/" | ||
|
|
||
| exporter = vLLMExporter() | ||
| exporter.export( | ||
| model_path_id=checkpoint_file, | ||
| tensor_parallel_size=1, | ||
| ) | ||
|
|
||
| # The correct argument for output length is 'max_tokens', not 'max_output_len' | ||
| output = exporter.forward( | ||
| ["What is the best city in the world?"], | ||
| max_tokens=50, | ||
| top_k=1, | ||
| top_p=0.1, | ||
| temperature=1.0, | ||
| ) | ||
| print("output: ", output) | ||
|
|
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| run_test() | ||
| ``` | ||
|
|
||
| Be sure to check the vLLMExporter class docstrings for details. | ||
|
|
||
|
|
||
| ## How To Send a Query | ||
|
|
||
| ### Send a Query using the Script | ||
|
|
||
| You can send queries to your deployed Megatron-Bridge LLM using the provided query script. This script allows you to interact with the model via HTTP requests, sending prompts and receiving generated responses directly from the Triton server. | ||
|
|
||
| The example below demonstrates how to use the query script to send a prompt to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below. | ||
|
|
||
| ```shell | ||
| python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" | ||
| ``` | ||
|
|
||
| **Additional parameters:** | ||
| - `--prompt_file`: Read prompt from a file instead of the command line | ||
| - `--max_tokens`: Maximum number of tokens to generate (default: 16) | ||
| - `--min_tokens`: Minimum number of tokens to generate (default: 0) | ||
| - `--n_log_probs`: Number of log probabilities to return per output token | ||
| - `--n_prompt_log_probs`: Number of log probabilities to return per prompt token | ||
| - `--seed`: Random seed for generation | ||
| - `--top_k`: Top-k sampling (default: 1) | ||
| - `--top_p`: Top-p sampling (default: 0.1) | ||
| - `--temperature`: Sampling temperature (default: 1.0) | ||
| - `--lora_task_uids`: List of LoRA task UIDs for LoRA-enabled models (use -1 to disable) | ||
| - `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0) | ||
|
|
||
|
|
||
| ### Send a Query using the NeMo APIs | ||
|
|
||
| Please see the below if you would like to use APIs to send a query. | ||
|
|
||
| ```python | ||
| from nemo_deploy.llm import NemoQueryvLLM | ||
|
|
||
| nq = NemoQueryvLLM(url="localhost:8000", model_name="llama") | ||
| output = nq.query_llm( | ||
| prompts=["What is the capital of United States? "], | ||
| max_tokens=100, | ||
| top_k=1, | ||
| top_p=0.8, | ||
| temperature=1.0, | ||
| ) | ||
| print("output: ", output) | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.