Skip to content

Commit c6eead0

Browse files
authored
Documentation: Troubleshooting and FAQ updates and the updated documentation structure (#548)
This PR includes the following updates: - Reviewed and updated the Troubleshooting and FAQ documents. - Reorganized the documentation structure by replacing the top navigation with a left-side navigation bar. This layout is more common in technical documentation and makes it easier to browse, view available documents, and switch between them. - Adjusted the document locations in the navigation bar to better fit their categories under the new navigation structure. - Added custom styling to make category headers in the sidebar more prominent and easier to distinguish from individual documents. - Removed unnecessary index pages (e.g., for Configuration, User Guides, and Developer Guides), which were previously empty and nor really needed. - Reorganized doc files into appropriate folders (to match the updated website structure) and updated links to these documents. --------- Signed-off-by: mhelf-intel <[email protected]>
1 parent eee0bbe commit c6eead0

23 files changed

+268
-251
lines changed

docs/.nav.yml

Lines changed: 40 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,45 @@
11
nav:
2-
- Home:
3-
- vLLM Hardware Plugin for Intel® Gaudi®: README.md
4-
- Getting Started:
5-
- Quick Start:
6-
- getting_started/quickstart.md
7-
- getting_started/quickstart_configuration.md
8-
- getting_started/quickstart_inference.md
9-
- Installation: getting_started/installation.md
10-
- Quick Links:
11-
- User Guide: user_guide/README.md
12-
- Developer Guide: dev_guide/README.md
13-
- API Reference: api/README.md
14-
- User Guide:
15-
- Summary: user_guide/README.md
16-
- user_guide/v1_guide.md
17-
- General:
18-
- user_guide/*
19-
- Configuration:
20-
- Summary: configuration/README.md
21-
- configuration/env_vars.md
22-
- configuration/long_context.md
23-
- Calibration:
24-
- configuration/calibration/calibration.md
25-
- configuration/calibration/calibration_one_node.md
26-
- configuration/calibration/calibration_multi_node.md
27-
- Quantization and Inference:
28-
- configuration/quantization/quantization.md
29-
- configuration/quantization/inc.md
30-
- configuration/quantization/auto_awq.md
31-
- configuration/quantization/gptqmodel.md
32-
- configuration/optimization.md
33-
- configuration/pipeline_parallelism.md
34-
#- configuration/*
35-
- Models:
36-
- models/validated_models.md
37-
- Features:
38-
- features/supported_features.md
39-
- features/compatibility_matrix.md
40-
- features/*
41-
- Developer Guide:
42-
- Summary: dev_guide/README.md
43-
- General:
44-
- dev_guide/ci-failures.md
45-
- Profiling:
46-
- Summary: dev_guide/profiling/profiling.md
47-
- dev_guide/profiling/e2e-profiling.md
48-
- dev_guide/profiling/high-level-profiling.md
49-
- dev_guide/profiling/pytorch-profiling-async.md
50-
- dev_guide/profiling/pytorch-profiling-script.md
51-
- dev_guide/profiling/profiling-prompt-decode.md
52-
- Design Documents:
53-
- design/*
2+
- Getting Started:
3+
- README.md
4+
- Quick Start:
5+
- getting_started/quickstart/quickstart.md
6+
- getting_started/quickstart/quickstart_configuration.md
7+
- getting_started/quickstart/quickstart_inference.md
8+
- Installation: getting_started/installation.md
9+
- getting_started/compatibility_matrix.md
10+
- getting_started/validated_models.md
11+
- Configuration Guides:
12+
- configuration/env_vars.md
13+
- configuration/long_context.md
14+
- Calibration:
15+
- configuration/calibration/calibration.md
16+
- configuration/calibration/calibration_one_node.md
17+
- configuration/calibration/calibration_multi_node.md
18+
- Quantization and Inference:
19+
- configuration/quantization/quantization.md
20+
- configuration/quantization/inc.md
21+
- configuration/quantization/auto_awq.md
22+
- configuration/quantization/gptqmodel.md
23+
- configuration/optimization.md
24+
- configuration/pipeline_parallelism.md
25+
- Features:
26+
- features/supported_features.md
27+
- features/*
28+
- features/quantization
29+
- Developer Guides:
30+
- dev_guide/plugin_system.md
31+
- dev_guide/ci-failures.md
32+
- Profiling:
33+
- dev_guide/profiling/profiling.md
34+
- dev_guide/profiling/e2e-profiling.md
35+
- dev_guide/profiling/high-level-profiling.md
36+
- dev_guide/profiling/pytorch-profiling-async.md
37+
- dev_guide/profiling/pytorch-profiling-script.md
38+
- dev_guide/profiling/profiling-prompt-decode.md
5439
- API Reference:
5540
- Summary: api/README.md
5641
- Contents:
5742
- glob: api/vllm_gaudi/*
58-
preserve_directory_names: true
43+
preserve_directory_names: true
44+
- general/troubleshooting.md
45+
- general/faq.md

docs/README.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
# vLLM Hardware Plugin for Intel® Gaudi®
2-
31
<figure markdown="span" style="display: flex; justify-content: center; align-items: center; gap: 10px; margin: auto;">
42
<img src="./assets/logos/vllm-logo-text-light.png" alt="vLLM" style="width: 30%; margin: 0;"> x
53
<img src="./assets/logos/gaudi-logo.png" alt="Intel-Gaudi" style="width: 30%; margin: 0;">
@@ -15,9 +13,9 @@
1513
<a class="github-button" href="https://github.com/vllm-project/vllm-gaudi/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
1614
</p>
1715

18-
The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.
16+
# Overview
1917

20-
## 🔍 Overview
18+
The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.
2119

2220
The vLLM Hardware Plugin for Intel® Gaudi® connects the [vLLM serving engine](https://docs.vllm.ai/) with [Intel® Gaudi®](https://docs.habana.ai/) hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by the Intel® Gaudi® team and follows the [hardware pluggable RFC](https://github.com/vllm-project/vllm/issues/11162) and [vLLM plugin architecture RFC](https://github.com/vllm-project/vllm/issues/19161) for modular integration.
2321

@@ -33,12 +31,12 @@ The vLLM Hardware Plugin for Intel® Gaudi® offers the following key benefits:
3331

3432
To get started with vLLM Hardware Plugin for Intel® Gaudi®:
3533

36-
- [ ] **Set up your environment** using the [quickstart](getting_started/quickstart.md) guide and use the plugin locally or in your containerized environment.
37-
- [ ] **Run inference** using supported models, such as Llama 3.1, Mixtral, or DeepSeek.
38-
- [ ] **Explore advanced features**, such as FP8 quantization, recipe caching, and expert parallelism.
39-
- [ ] **Join the community** by contributing to the [vLLM-Gaudi](https://github.com/vllm-project/vllm-gaudi) GitHub repository.
34+
- [ ] **Set up your environment** using the [quickstart](getting_started/quickstart/quickstart.md) and plugin locally or in your containerized environment.
35+
- [ ] **Run inference** using supported models like Llama 3.1, Mixtral, or DeepSeek.
36+
- [ ] **Explore advanced features** such as FP8 quantization, recipe caching, and expert parallelism.
37+
- [ ] **Join the community** by contributing to the [vLLM-Gaudi GitHub repo](https://github.com/vllm-project/vllm-gaudi).
4038

4139
For more information, see:
4240

43-
- 📚 [Intel Gaudi Documentation](https://docs.habana.ai/en/latest/index.html)
44-
- 📦 [vLLM Plugin System Overview](design/plugin_system.md)
41+
- 📚 [Intel® Gaudi® Documentation](https://docs.habana.ai/en/latest/index.html)
42+
- 📦 [vLLM Plugin System Overview](https://docs.vllm.ai/en/latest/design/plugin_system/)

docs/configuration/README.md

Lines changed: 0 additions & 3 deletions
This file was deleted.

docs/dev_guide/README.md

Lines changed: 0 additions & 3 deletions
This file was deleted.
File renamed without changes.

docs/features/compatibility_matrix.md

Lines changed: 0 additions & 13 deletions
This file was deleted.

docs/features/supported_features.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ This document summarizes the features currently supported by the vLLM Hardware P
1010

1111
| **Feature** | **Description** | **References** |
1212
|--- |--- |--- |
13-
| Offline batched inference | Supports offline inference using the LLM class from vLLM Python API. | [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference), [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/batch_llm_inference.html) |
13+
| Offline batched inference | Supports offline inference using the LLM class from vLLM Python API. | [Quickstart](../getting_started/quickstart/quickstart_inference.md#offline-batched-inference), [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/batch_llm_inference.html) |
1414
| Online inference via the OpenAI-Compatible Server | Supports online inference through an HTTP server that implements the OpenAI Chat and Completions API. | [Documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html), [Example](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html) |
1515
| HPU autodetection | Enables automatic target platform detection for HPU users at vLLM startup. | N/A |
1616
| Paged KV cache with algorithms enabled for Intel® Gaudi® accelerators | Provides a custom paged attention and cache operators implementations optimized for Intel® Gaudi® devices. | N/A |

docs/general/faq.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
title: Frequently Asked Questions
3+
---
4+
[](){ #faq }
5+
6+
## Prerequisites and System Requirements
7+
8+
### What are the system requirements for running vLLM on Intel® Gaudi®?
9+
10+
- Ubuntu 22.04 LTS OS.
11+
- Python 3.10.
12+
- Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
13+
- Intel Gaudi software version 1.23.0 and above.
14+
15+
### What is the vLLM plugin and where can I find its GitHub repository?
16+
17+
Intel develops and maintains its own vLLM plugin project called vLLM Hardware Plugin for Intel® Gaudi® and located in the [vLLM-gaudi](https://github.com/vllm-project/vllm-gaudi) repository on GitHub.
18+
19+
### How do I verify that the Intel® Gaudi® software is installed correctly?
20+
21+
1. Run ``hl-smi`` to check if Intel® Gaudi® accelerators are visible. Refer to [System Verifications and Final Tests](https://docs.habana.ai/en/latest/Installation_Guide/System_Verification_and_Final_Tests.html#system-verification) for more details.
22+
23+
2. Run ``apt list --installed | grep habana`` to verify installed packages. The output should look similar to the following example:
24+
25+
```text
26+
$ apt list --installed | grep habana
27+
habanalabs-container-runtime
28+
habanalabs-dkms
29+
habanalabs-firmware-tools
30+
habanalabs-graph
31+
habanalabs-qual
32+
habanalabs-rdma-core
33+
habanalabs-thunk
34+
habanalabs-tools
35+
```
36+
37+
3. Check the installed Python packages by running ``pip list | grep habana`` and ``pip list | grep neural``. The output should look similar to this example:
38+
39+
```text
40+
$ pip list | grep habana
41+
habana_gpu_migration 1.19.0.561
42+
habana-media-loader 1.19.0.561
43+
habana-pyhlml 1.19.0.561
44+
habana-torch-dataloader 1.19.0.561
45+
habana-torch-plugin 1.19.0.561
46+
lightning-habana 1.6.0
47+
Pillow-SIMD 9.5.0.post20+habana
48+
$ pip list | grep neural
49+
neural_compressor_pt 3.2
50+
```
51+
52+
### How can I quickly set up the environment for vLLM using Docker?
53+
54+
Use the `Dockerfile.ubuntu.pytorch.vllm` file provided in the [.cd directory on GitHub](https://github.com/vllm-project/vllm-gaudi/tree/main/.cd) to build and run a container with the latest Intel® Gaudi® software release.
55+
56+
For more details, see [Quick Start Using Dockerfile](../getting_started/quickstart/quickstart.md).
57+
58+
## Building and Installing vLLM
59+
60+
### How can I install vLLM on Intel Gaudi?
61+
62+
There are two different installation methods:
63+
64+
- [Running vLLM Hardware Plugin for Intel® Gaudi® using a Dockerfile](../getting_started/installation.md#running-vllm-hardware-plugin-for-intel-gaudi-using-dockerfile): We recommend this method as it is the most suitable option for production deployments.
65+
66+
- [Building vLLM Hardware Plugin for Intel® Gaudi® from source](../getting_started/installation.md#building-vllm-hardware-plugin-for-intel-gaudi-from-source): This method is intended for developers working with experimental code or new features that are still under testing.
67+
68+
## Examples and Model Support
69+
70+
### Which models and configurations have been validated on Intel® Gaudi® 2 and Intel® Gaudi® 3 devices?
71+
72+
The list of validated models is available in the [Validated Models](../getting_started/validated_models.md) document. The list includes models such as:
73+
74+
- Llama 2, Llama 3, and Llama 3.1 (7B, 8B, and 70B versions). Refer to Llama-3.1 jupyter notebook example.
75+
76+
- Mistral and Mixtral models.
77+
78+
- Different tensor parallelism configurations , such as single HPU, 2x, and 8x HPU.
79+
80+
## Features Support
81+
82+
### Which key features does vLLM support on Intel® Gaudi®?
83+
84+
The list of the supported features is available in the [Supported Features](../features/supported_features.md) document. It includes features such as:
85+
86+
- Offline Batched Inference
87+
88+
- OpenAI-Compatible Server
89+
90+
- Paged KV cache optimized for Intel® Gaudi® devices
91+
92+
- Speculative decoding (experimental)
93+
94+
- Tensor parallel inference
95+
96+
- FP8 models and KV Cache quantization and calibration with Intel® Neural Compressor (INC). See [FP8 Calibration and Inference with vLLM](../features/quantization/inc.md) for more details.
97+
98+
## Performance Tuning
99+
100+
### Which execution modes does the plugin support?
101+
102+
- PyTorch Eager mode (default)
103+
104+
- torch.compile (default)
105+
106+
- HPU Graphs (recommended for best performance)
107+
108+
- PyTorch Lazy mode
109+
110+
### How does the bucketing mechanism work in vLLM Hardware Plugin for Intel® Gaudi®?
111+
112+
The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime. Buckets are determined by parameters for batch size and sequence length. For more information, see [Bucketing Mechanism](../features/bucketing_mechanism.md).
113+
114+
### What should I do if a request exceeds the maximum bucket size?
115+
116+
Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.
File renamed without changes.

docs/general/troubleshooting.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: Troubleshooting
3+
---
4+
[](){ #troubleshooting }
5+
6+
# Troubleshooting
7+
8+
This document contains troubleshooting instructions for common issues that you may encounter when using vLLM Hardware Plugin for Intel® Gaudi®.
9+
10+
## FP8 model fails when torch compile is enabled
11+
12+
If your Floating Point 8-bit (FP8) model is not working when torch compile is enabled and you receive the following error, the issue is likely caused by the Runtime Scale Patching feature.
13+
14+
```
15+
AssertionError: Scaling method "ScaleMethodString.ACT_MAXABS_PCS_POW2_WEIGHT_MAXABS_PTS_POW2_HW" is not supported for runtime scale patching (graph recompile reduction)
16+
```
17+
18+
The default Runtime Scale Patching feature does not support the scaling method that your workload is using for the FP8 execution. To fix the issue, disable Runtime Scale Patching when running this model by exporting `RUNTIME_SCALE_PATCHING=0` in your environment.
19+
20+
## Server error occurs when setting max_concurrency
21+
22+
If setting `max_concurrency` causes the following error, the specified value is likely incorrect.
23+
24+
```
25+
assert num_output_tokens == 0, \
26+
(EngineCore_DP0 pid=545) ERROR 10-13 06:03:56 [core.py:710] AssertionError: req_id: cmpl-benchmark-serving39-0, 236
27+
```
28+
29+
vLLM calculates the maximum available concurrency for current environment based on KV cache settings. To fix the issue, use the value printed in logs:
30+
31+
```
32+
[kv_cache_utils.py:1091] Maximum concurrency for 4,352 tokens per request: 10.59x
33+
```
34+
35+
In this example, the correct `max_concurrency` value in this specific scenario is `10`.
36+
37+
## Out-of-memory errors occur when using the plugin
38+
39+
If you encounter out-of-memory errors while using the plugin, consider the following solutions and recommendations:
40+
41+
- Increase `--gpu-memory-utilization` to a higher value than the default `0.9`. This addresses insufficient available memory per card.
42+
43+
- Increase `--tensor-parallel-size` to a higher value than the default `1`. This approach shards model weights across the devices and may help in loading a model, which is too big for a single card, across multiple cards.
44+
45+
- Disable HPU Graphs completely by switching to any other execution mode to maximize KV cache space allocation.

0 commit comments

Comments
 (0)