vllm-project
diff --git a/‎docs/.nav.yml‎
Lines changed: 40 additions & 53 deletions b/‎docs/.nav.yml‎
Lines changed: 40 additions & 53 deletions
diff --git a/‎docs/README.md‎
Lines changed: 8 additions & 10 deletions b/‎docs/README.md‎
Lines changed: 8 additions & 10 deletions
diff --git a/‎docs/configuration/README.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/configuration/README.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/dev_guide/README.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/dev_guide/README.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/design/plugin_system.md‎ renamed to ‎docs/dev_guide/plugin_system.md‎ b/‎docs/design/plugin_system.md‎ renamed to ‎docs/dev_guide/plugin_system.md‎
diff --git a/‎docs/features/compatibility_matrix.md‎
Lines changed: 0 additions & 13 deletions b/‎docs/features/compatibility_matrix.md‎
Lines changed: 0 additions & 13 deletions
diff --git a/‎docs/features/supported_features.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/features/supported_features.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/general/faq.md‎
Lines changed: 116 additions & 0 deletions b/‎docs/general/faq.md‎
Lines changed: 116 additions & 0 deletions
diff --git a/‎docs/user_guide/metrics.md‎ renamed to ‎docs/general/metrics.md‎ b/‎docs/user_guide/metrics.md‎ renamed to ‎docs/general/metrics.md‎
diff --git a/‎docs/general/troubleshooting.md‎
Lines changed: 45 additions & 0 deletions b/‎docs/general/troubleshooting.md‎
Lines changed: 45 additions & 0 deletions
@@ -1,58 +1,45 @@
 nav:
-  - Home: 
-    - vLLM Hardware Plugin for Intel® Gaudi®: README.md
-    - Getting Started:
-      - Quick Start: 
-        - getting_started/quickstart.md
-        - getting_started/quickstart_configuration.md
-        - getting_started/quickstart_inference.md
-      - Installation: getting_started/installation.md
-    - Quick Links:
-      - User Guide: user_guide/README.md
-      - Developer Guide: dev_guide/README.md
-      - API Reference: api/README.md
-  - User Guide:
-    - Summary: user_guide/README.md
-    - user_guide/v1_guide.md
-    - General:
-      - user_guide/*
-    - Configuration:
-      - Summary: configuration/README.md
-      - configuration/env_vars.md
-      - configuration/long_context.md
-      - Calibration: 
-        - configuration/calibration/calibration.md
-        - configuration/calibration/calibration_one_node.md
-        - configuration/calibration/calibration_multi_node.md
-      - Quantization and Inference: 
-        - configuration/quantization/quantization.md
-        - configuration/quantization/inc.md
-        - configuration/quantization/auto_awq.md
-        - configuration/quantization/gptqmodel.md
-      - configuration/optimization.md
-      - configuration/pipeline_parallelism.md
-      #- configuration/*
-    - Models:
-      - models/validated_models.md
-    - Features:
-      - features/supported_features.md
-      - features/compatibility_matrix.md
-      - features/*
-  - Developer Guide:
-    - Summary: dev_guide/README.md
-    - General:
-      - dev_guide/ci-failures.md
-      - Profiling:
-        - Summary: dev_guide/profiling/profiling.md
-        - dev_guide/profiling/e2e-profiling.md
-        - dev_guide/profiling/high-level-profiling.md
-        - dev_guide/profiling/pytorch-profiling-async.md
-        - dev_guide/profiling/pytorch-profiling-script.md
-        - dev_guide/profiling/profiling-prompt-decode.md
-    - Design Documents:
-      - design/*
+  - Getting Started:
+    - README.md
+    - Quick Start: 
+      - getting_started/quickstart/quickstart.md
+      - getting_started/quickstart/quickstart_configuration.md
+      - getting_started/quickstart/quickstart_inference.md
+    - Installation: getting_started/installation.md
+    - getting_started/compatibility_matrix.md
+    - getting_started/validated_models.md
+  - Configuration Guides:
+    - configuration/env_vars.md
+    - configuration/long_context.md
+    - Calibration: 
+      - configuration/calibration/calibration.md
+      - configuration/calibration/calibration_one_node.md
+      - configuration/calibration/calibration_multi_node.md
+    - Quantization and Inference: 
+      - configuration/quantization/quantization.md
+      - configuration/quantization/inc.md
+      - configuration/quantization/auto_awq.md
+      - configuration/quantization/gptqmodel.md
+    - configuration/optimization.md
+    - configuration/pipeline_parallelism.md
+  - Features:
+    - features/supported_features.md
+    - features/*
+    - features/quantization
+  - Developer Guides:
+    - dev_guide/plugin_system.md
+    - dev_guide/ci-failures.md
+    - Profiling:
+      - dev_guide/profiling/profiling.md
+      - dev_guide/profiling/e2e-profiling.md
+      - dev_guide/profiling/high-level-profiling.md
+      - dev_guide/profiling/pytorch-profiling-async.md
+      - dev_guide/profiling/pytorch-profiling-script.md
+      - dev_guide/profiling/profiling-prompt-decode.md
   - API Reference:
     - Summary: api/README.md
     - Contents:
       - glob: api/vllm_gaudi/*
-        preserve_directory_names: true
+        preserve_directory_names: true
+  - general/troubleshooting.md
+  - general/faq.md
@@ -1,5 +1,3 @@
-# vLLM Hardware Plugin for Intel® Gaudi®
-
 <figure markdown="span" style="display: flex; justify-content: center; align-items: center; gap: 10px; margin: auto;">
   <img src="./assets/logos/vllm-logo-text-light.png" alt="vLLM" style="width: 30%; margin: 0;"> x
   <img src="./assets/logos/gaudi-logo.png" alt="Intel-Gaudi" style="width: 30%; margin: 0;">
@@ -15,9 +13,9 @@
 <a class="github-button" href="https://github.com/vllm-project/vllm-gaudi/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
 </p>
 
-The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.
+# Overview
 
-## 🔍 Overview
+The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.
 
 The vLLM Hardware Plugin for Intel® Gaudi® connects the [vLLM serving engine](https://docs.vllm.ai/) with [Intel® Gaudi®](https://docs.habana.ai/) hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by the Intel® Gaudi® team and follows the [hardware pluggable RFC](https://github.com/vllm-project/vllm/issues/11162) and [vLLM plugin architecture RFC](https://github.com/vllm-project/vllm/issues/19161) for modular integration.
 
@@ -33,12 +31,12 @@ The vLLM Hardware Plugin for Intel® Gaudi® offers the following key benefits:
 
 To get started with vLLM Hardware Plugin for Intel® Gaudi®:
 
-- [ ] **Set up your environment** using the [quickstart](getting_started/quickstart.md) guide and use  the plugin locally or in your containerized environment.
-- [ ] **Run inference** using supported models, such as Llama 3.1, Mixtral, or DeepSeek.
-- [ ] **Explore advanced features**, such as FP8 quantization, recipe caching, and expert parallelism.
-- [ ] **Join the community** by contributing to the [vLLM-Gaudi](https://github.com/vllm-project/vllm-gaudi) GitHub repository.
+- [ ] **Set up your environment** using the [quickstart](getting_started/quickstart/quickstart.md) and plugin locally or in your containerized environment.
+- [ ] **Run inference** using supported models like Llama 3.1, Mixtral, or DeepSeek.
+- [ ] **Explore advanced features** such as FP8 quantization, recipe caching, and expert parallelism.
+- [ ] **Join the community** by contributing to the [vLLM-Gaudi GitHub repo](https://github.com/vllm-project/vllm-gaudi).
 
 For more information, see:
 
-- 📚 [Intel Gaudi Documentation](https://docs.habana.ai/en/latest/index.html)  
-- 📦 [vLLM Plugin System Overview](design/plugin_system.md)
+- 📚 [Intel® Gaudi® Documentation](https://docs.habana.ai/en/latest/index.html)  
+- 📦 [vLLM Plugin System Overview](https://docs.vllm.ai/en/latest/design/plugin_system/)
@@ -10,7 +10,7 @@ This document summarizes the features currently supported by the vLLM Hardware P
 
 | **Feature**   | **Description**   | **References**  |
 |---    |---    |---    |
-| Offline batched inference     | Supports offline inference using the LLM class from vLLM Python API.    | [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference), [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/batch_llm_inference.html)   |
+| Offline batched inference     | Supports offline inference using the LLM class from vLLM Python API.    | [Quickstart](../getting_started/quickstart/quickstart_inference.md#offline-batched-inference), [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/batch_llm_inference.html)   |
 | Online inference via the OpenAI-Compatible Server     | Supports online inference through an HTTP server that implements the OpenAI Chat and Completions API.    | [Documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html), [Example](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html)    |
 | HPU autodetection     | Enables automatic target platform detection for HPU users at vLLM startup.     | N/A   |
 | Paged KV cache with algorithms enabled for Intel® Gaudi® accelerators   | Provides a custom paged attention and cache operators implementations optimized for Intel® Gaudi® devices.   | N/A   |
 
@@ -0,0 +1,116 @@
+---
+title: Frequently Asked Questions
+---
+[](){ #faq }
+
+## Prerequisites and System Requirements
+
+### What are the system requirements for running vLLM on Intel® Gaudi®?
+
+- Ubuntu 22.04 LTS OS.
+- Python 3.10.
+- Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
+- Intel Gaudi software version 1.23.0 and above.
+
+### What is the vLLM plugin and where can I find its GitHub repository?
+
+Intel develops and maintains its own vLLM plugin project called vLLM Hardware Plugin for Intel® Gaudi® and located in the [vLLM-gaudi](https://github.com/vllm-project/vllm-gaudi) repository on GitHub.
+
+### How do I verify that the Intel® Gaudi® software is installed correctly?
+
+1. Run ``hl-smi`` to check if Intel® Gaudi® accelerators are visible. Refer to [System Verifications and Final Tests](https://docs.habana.ai/en/latest/Installation_Guide/System_Verification_and_Final_Tests.html#system-verification) for more details.
+
+2. Run ``apt list --installed | grep habana`` to verify installed packages. The output should look similar to the following example:
+
+    ```text
+    $ apt list --installed | grep habana
+    habanalabs-container-runtime
+    habanalabs-dkms
+    habanalabs-firmware-tools
+    habanalabs-graph
+    habanalabs-qual
+    habanalabs-rdma-core
+    habanalabs-thunk
+    habanalabs-tools
+    ```
+
+3. Check the installed Python packages by running ``pip list | grep habana`` and ``pip list | grep neural``. The output should look similar to this example:
+
+    ```text
+    $ pip list | grep habana
+    habana_gpu_migration              1.19.0.561
+    habana-media-loader               1.19.0.561
+    habana-pyhlml                     1.19.0.561
+    habana-torch-dataloader           1.19.0.561
+    habana-torch-plugin               1.19.0.561
+    lightning-habana                  1.6.0
+    Pillow-SIMD                       9.5.0.post20+habana
+    $ pip list | grep neural
+    neural_compressor_pt              3.2
+    ```
+
+### How can I quickly set up the environment for vLLM using Docker?
+
+Use the `Dockerfile.ubuntu.pytorch.vllm` file provided in the [.cd directory on GitHub](https://github.com/vllm-project/vllm-gaudi/tree/main/.cd) to build and run a container with the latest Intel® Gaudi® software release.
+
+For more details, see [Quick Start Using Dockerfile](../getting_started/quickstart/quickstart.md).
+
+## Building and Installing vLLM
+
+### How can I install vLLM on Intel Gaudi?
+
+There are two different installation methods:
+
+- [Running vLLM Hardware Plugin for Intel® Gaudi® using a Dockerfile](../getting_started/installation.md#running-vllm-hardware-plugin-for-intel-gaudi-using-dockerfile): We recommend this method as it is the most suitable option for production deployments.
+
+- [Building vLLM Hardware Plugin for Intel® Gaudi® from source](../getting_started/installation.md#building-vllm-hardware-plugin-for-intel-gaudi-from-source): This method is intended for developers working with experimental code or new features that are still under testing.
+
+## Examples and Model Support
+
+### Which models and configurations have been validated on Intel® Gaudi® 2 and Intel® Gaudi® 3 devices?
+
+The list of validated models is available in the [Validated Models](../getting_started/validated_models.md) document. The list includes models such as:
+
+- Llama 2, Llama 3, and Llama 3.1 (7B, 8B, and 70B versions). Refer to Llama-3.1 jupyter notebook example.
+
+- Mistral and Mixtral models.
+
+- Different tensor parallelism configurations , such as single HPU, 2x, and 8x HPU.
+
+## Features Support
+
+### Which key features does vLLM support on Intel® Gaudi®?
+
+The list of the supported features is available in the [Supported Features](../features/supported_features.md) document. It includes features such as:
+
+- Offline Batched Inference
+
+- OpenAI-Compatible Server
+
+- Paged KV cache optimized for Intel® Gaudi® devices
+
+- Speculative decoding (experimental)
+
+- Tensor parallel inference
+
+- FP8 models and KV Cache quantization and calibration with Intel® Neural Compressor (INC). See [FP8 Calibration and Inference with vLLM](../features/quantization/inc.md) for more details.
+
+## Performance Tuning
+
+### Which execution modes does the plugin support?
+
+- PyTorch Eager mode (default)
+
+- torch.compile (default)
+
+- HPU Graphs (recommended for best performance)
+
+- PyTorch Lazy mode
+
+### How does the bucketing mechanism work in vLLM Hardware Plugin for Intel® Gaudi®?
+
+The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime. Buckets are determined by parameters for batch size and sequence length. For more information, see [Bucketing Mechanism](../features/bucketing_mechanism.md).
+
+### What should I do if a request exceeds the maximum bucket size?
+
+Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.
@@ -0,0 +1,45 @@
+---
+title: Troubleshooting
+---
+[](){ #troubleshooting }
+
+# Troubleshooting
+
+This document contains troubleshooting instructions for common issues that you may encounter when using vLLM Hardware Plugin for Intel® Gaudi®.
+
+## FP8 model fails when torch compile is enabled
+
+If your Floating Point 8-bit (FP8) model is not working when torch compile is enabled and you receive the following error, the issue is likely caused by the Runtime Scale Patching feature.
+
+```
+AssertionError: Scaling method "ScaleMethodString.ACT_MAXABS_PCS_POW2_WEIGHT_MAXABS_PTS_POW2_HW" is not supported for runtime scale patching (graph recompile reduction)
+```
+
+The default Runtime Scale Patching feature does not support the scaling method that your workload is using for the FP8 execution. To fix the issue, disable Runtime Scale Patching when running this model by exporting `RUNTIME_SCALE_PATCHING=0` in your environment.
+
+## Server error occurs when setting max_concurrency
+
+If setting `max_concurrency` causes the following error, the specified value is likely incorrect.
+
+```
+assert num_output_tokens == 0, \
+(EngineCore_DP0 pid=545) ERROR 10-13 06:03:56 [core.py:710] AssertionError: req_id: cmpl-benchmark-serving39-0, 236
+```
+
+vLLM calculates the maximum available concurrency for current environment based on KV cache settings. To fix the issue, use the value printed in logs:
+
+```
+[kv_cache_utils.py:1091] Maximum concurrency for 4,352 tokens per request: 10.59x 
+```
+
+In this example, the correct `max_concurrency` value in this specific scenario is `10`.
+
+## Out-of-memory errors occur when using the plugin
+
+If you encounter out-of-memory errors while using the plugin, consider the following solutions and recommendations:
+
+- Increase `--gpu-memory-utilization` to a higher value than the default `0.9`. This addresses insufficient available memory per card.
+
+- Increase `--tensor-parallel-size` to a higher value than the default `1`. This approach shards model weights across the devices and may help in loading a model, which is too big for a single card, across multiple cards.
+
+- Disable HPU Graphs completely by switching to any other execution mode to maximize KV cache space allocation.