|
| 1 | +# Basic |
| 2 | + |
| 3 | +The `LLM` class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server. |
| 4 | + |
| 5 | +## Usage |
| 6 | + |
| 7 | +The first script in this example shows the most basic usage of vLLM. If you are new to Python and vLLM, you should start here. |
| 8 | + |
| 9 | +```bash |
| 10 | +python examples/offline_inference/basic/basic.py |
| 11 | +``` |
| 12 | + |
| 13 | +The rest of the scripts include an [argument parser](https://docs.python.org/3/library/argparse.html), which you can use to pass any arguments that are compatible with [`LLM`](https://docs.vllm.ai/en/latest/api/offline_inference/llm.html). Try running the script with `--help` for a list of all available arguments. |
| 14 | + |
| 15 | +```bash |
| 16 | +python examples/offline_inference/basic/classify.py |
| 17 | +``` |
| 18 | + |
| 19 | +```bash |
| 20 | +python examples/offline_inference/basic/embed.py |
| 21 | +``` |
| 22 | + |
| 23 | +```bash |
| 24 | +python examples/offline_inference/basic/score.py |
| 25 | +``` |
| 26 | + |
| 27 | +The chat and generate scripts also accept the [sampling parameters](https://docs.vllm.ai/en/latest/api/inference_params.html#sampling-parameters): `max_tokens`, `temperature`, `top_p` and `top_k`. |
| 28 | + |
| 29 | +```bash |
| 30 | +python examples/offline_inference/basic/chat.py |
| 31 | +``` |
| 32 | + |
| 33 | +```bash |
| 34 | +python examples/offline_inference/basic/generate.py |
| 35 | +``` |
| 36 | + |
| 37 | +## Features |
| 38 | + |
| 39 | +In the scripts that support passing arguments, you can experiment with the following features. |
| 40 | + |
| 41 | +### Default generation config |
| 42 | + |
| 43 | +The `--generation-config` argument specifies where the generation config will be loaded from when calling `LLM.get_default_sampling_params()`. If set to ‘auto’, the generation config will be loaded from model path. If set to a folder path, the generation config will be loaded from the specified folder path. If it is not provided, vLLM defaults will be used. |
| 44 | + |
| 45 | +> If max_new_tokens is specified in generation config, then it sets a server-wide limit on the number of output tokens for all requests. |
| 46 | +
|
| 47 | +Try it yourself with the following argument: |
| 48 | + |
| 49 | +```bash |
| 50 | +--generation-config auto |
| 51 | +``` |
| 52 | + |
| 53 | +### Quantization |
| 54 | + |
| 55 | +#### AQLM |
| 56 | + |
| 57 | +vLLM supports models that are quantized using AQLM. |
| 58 | + |
| 59 | +Try one yourself by passing one of the following models to the `--model` argument: |
| 60 | + |
| 61 | +- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf` |
| 62 | +- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf` |
| 63 | +- `ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf` |
| 64 | +- `ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf` |
| 65 | +- `BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf` |
| 66 | + |
| 67 | +> Some of these models are likely to be too large for a single GPU. You can split them across multiple GPUs by setting `--tensor-parallel-size` to the number of required GPUs. |
| 68 | +
|
| 69 | +#### GGUF |
| 70 | + |
| 71 | +vLLM supports models that are quantized using GGUF. |
| 72 | + |
| 73 | +Try one yourself by downloading a GUFF quantised model and using the following arguments: |
| 74 | + |
| 75 | +```python |
| 76 | +from huggingface_hub import hf_hub_download |
| 77 | +repo_id = "bartowski/Phi-3-medium-4k-instruct-GGUF" |
| 78 | +filename = "Phi-3-medium-4k-instruct-IQ2_M.gguf" |
| 79 | +print(hf_hub_download(repo_id, filename=filename)) |
| 80 | +``` |
| 81 | + |
| 82 | +```bash |
| 83 | +--model {local-path-printed-above} --tokenizer microsoft/Phi-3-medium-4k-instruct |
| 84 | +``` |
| 85 | + |
| 86 | +### CPU offload |
| 87 | + |
| 88 | +The `--cpu-offload-gb` argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass. |
| 89 | + |
| 90 | +Try it yourself with the following arguments: |
| 91 | + |
| 92 | +```bash |
| 93 | +--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10 |
| 94 | +``` |
0 commit comments