Skip to content

Commit

Permalink
DOC: update readme for pytorch model (xorbitsai#207)
Browse files Browse the repository at this point in the history
  • Loading branch information
pangyoki authored Jul 28, 2023
1 parent 518fdf9 commit 900a8a8
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 9 deletions.
54 changes: 49 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,33 +180,77 @@ To view the builtin models, run the following command:
$ xinference list --all
```

### ggmlv3 models

| Name | Type | Language | Format | Size (in billions) | Quantization |
|---------------|------------------|----------|---------|--------------------|-----------------------------------------|
| llama-2 | Foundation Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| baichuan | Foundation Model | en, zh | ggmlv3 | 7 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| llama-2-chat | RLHF Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| llama-2-chat | RLHF Model | en | ggmlv3 | 7, 13, 70 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| chatglm | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |
| chatglm2 | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |
| wizardlm-v1.0 | SFT Model | en | ggmlv3 | 7, 13, 33 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| wizardlm-v1.1 | SFT Model | en | ggmlv3 | 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| vicuna-v1.3 | SFT Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| orca | SFT Model | en | ggmlv3 | 3, 7, 13 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |

### pytorch models

| Name | Type | Language | Format | Size (in billions) | Quantization |
|---------------|------------------|----------|---------|--------------------|--------------------------|
| baichuan | Foundation Model | en, zh | pytorch | 7, 13 | '4-bit', '8-bit', 'none' |
| baichuan-chat | SFT Model | en, zh | pytorch | 13 | '4-bit', '8-bit', 'none' |
| vicuna-v1.3 | SFT Model | en | pytorch | 7, 13, 33 | '4-bit', '8-bit', 'none' |


**NOTE**:
- Xinference will download models automatically for you, and by default the models will be saved under `${USER}/.xinference/cache`.
- Foundation models only provide interface `generate`.
- RLHF and SFT models provide both `generate` and `chat`.
- If you want to use Apple Metal GPU for acceleration, please choose the q4_0 and q4_1 quantization methods.
- `llama-2-chat` 70B ggmlv3 model only supports q4_0 quantization currently.


## Pytorch Model Best Practices

Pytorch has been integrated recently, and the usage scenarios are described below:

### supported models
- Foundation Model:baichuan(7B、13B)。
- SFT Model:baichuan-chat(13B)、vicuna-v1.3(7B、13B、33B)。

### supported devices
- CUDA: On Linux and Windows systems, `cuda` device is used by default.
- MPS: On Mac M1/M2 devices, `mps` device is used by default.
- CPU: It is not recommended to use a `cpu` device, as it takes up a lot of memory and the inference speed is very slow.

### quantization methods
- `none`: indicates that no quantization is used.
- `8-bit`: use 8-bit quantization.
- `4-bit`: use 4-bit quantization. Note: 4-bit quantization is only supported on Linux systems and CUDA devices.

### other instructions
- On MacOS system, baichuan-chat model is not supported, and baichuan model cannot use 8-bit quantization.

### use cases

The table below shows memory usage and supported devices of some models.

| Name | Size (B) | OS | No quantization (MB) | Quantization 8-bit (MB) | Quantization 4-bit (MB) |
|---------------|----------|-------|----------------------|-------------------------|-------------------------|
| baichuan-chat | 13 | linux | not currently tested | 13275 | 7263 |
| baichuan-chat | 13 | macos | not supported | not supported | not supported |
| vicuna-v1.3 | 7 | linux | 12884 | 6708 | 3620 |
| vicuna-v1.3 | 7 | macos | 12916 | 565 | not supported |
| baichuan | 7 | linux | 13480 | 7304 | 4216 |
| baichuan | 7 | macos | 13480 | not supported | not supported |



## Roadmap
Xinference is currently under active development. Here's a roadmap outlining our planned
developments for the next few weeks:

### PyTorch Support
With PyTorch integration, users will be able to seamlessly utilize PyTorch models from Hugging Face
within Xinference.

### Langchain & LlamaIndex integration
With Xinference, it will be much easier for users to use these libraries and build applications
with LLMs.
51 changes: 47 additions & 4 deletions README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,31 +171,74 @@ model.chat(
$ xinference list --all
```

### ggmlv3 模型

| Name | Type | Language | Format | Size (in billions) | Quantization |
|---------------|------------------|----------|---------|--------------------|-----------------------------------------|
| llama-2 | Foundation Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| baichuan | Foundation Model | en, zh | ggmlv3 | 7 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| llama-2-chat | RLHF Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| llama-2-chat | RLHF Model | en | ggmlv3 | 7, 13, 70 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| chatglm | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |
| chatglm2 | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |
| wizardlm-v1.0 | SFT Model | en | ggmlv3 | 7, 13, 33 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| wizardlm-v1.1 | SFT Model | en | ggmlv3 | 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| vicuna-v1.3 | SFT Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |
| orca | SFT Model | en | ggmlv3 | 3, 7, 13 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |

### pytorch 模型

| Name | Type | Language | Format | Size (in billions) | Quantization |
|---------------|------------------|----------|---------|--------------------|--------------------------|
| baichuan | Foundation Model | en, zh | pytorch | 7, 13 | '4-bit', '8-bit', 'none' |
| baichuan-chat | SFT Model | en, zh | pytorch | 13 | '4-bit', '8-bit', 'none' |
| vicuna-v1.3 | SFT Model | en | pytorch | 7, 13, 33 | '4-bit', '8-bit', 'none' |


**注意**:
- Xinference 会自动为你下载模型,默认的模型存放路径为 `${USER}/.xinference/cache`
- 基础模型仅提供 `generate` 接口.
- RLHF 与 SFT 模型 提供 `generate``chat` 接口。
- 如果想使用 Apple metal GPU 加速,请选择 q4_0 或者 q4_1 这两种量化方式。
- `llama-2-chat` 70B ggmlv3 模型目前仅支持 q4_0 量化方式。


## Pytorch 模型最佳实践

近期集成了 Pytorch ,下面对 Pytorch 模型的使用场景进行说明:

### 模型支持
- Foundation Model:baichuan(7B、13B)。
- SFT Model:baichuan-chat(13B)、vicuna-v1.3(7B、13B、33B)。

### 设备支持
- CUDA:在 Linux、Windows 系统下,默认使用 `cuda` 设备。
- MPS:在 Mac M1/M2 设备上,默认使用 `mps` 设备。
- CPU:不建议使用 `cpu` 设备,显存占用较大,且推理速度非常慢。

### 量化方式
- `none`:表示不使用量化。
- `8-bit`:使用 8-bit 量化。
- `4-bit`:使用 4-bit 量化。注意:4-bit 量化仅在 Linux 系统、CUDA 设备上支持。

### 其他说明
- 在 MacOS 系统上,不支持 baichuan-chat 模型,baichuan 模型无法使用 8-bit 量化。

### 使用案例

下表展示部分模型显存占用情况与设备支持情况。

| Name | Size (B) | OS | No quantization (MB) | Quantization 8-bit (MB) | Quantization 4-bit (MB) |
|---------------|----------|-------|----------------------|-------------------------|-------------------------|
| baichuan-chat | 13 | linux | 暂未测试 | 13275 | 7263 |
| baichuan-chat | 13 | macos | 不支持 | 不支持 | 不支持 |
| vicuna-v1.3 | 7 | linux | 12884 | 6708 | 3620 |
| vicuna-v1.3 | 7 | macos | 12916 | 565 | 不支持 |
| baichuan | 7 | linux | 13480 | 7304 | 4216 |
| baichuan | 7 | macos | 13480 | 不支持 | 不支持 |

## 近期开发计划
Xinference 目前正在快速迭代。我们近期的开发计划包括:

### PyTorch 支持
通过 PyTorch 集成, 用户将可以在 Xinference 中无缝使用来自 Hugging Face 的大量开源模型。

### Langchain & LlamaIndex integration
通过与 Langchain 及 LlamaIndex 集成,用户将能够通过 Xinference,基于开源模型快速构建 AI 应用。

0 comments on commit 900a8a8

Please sign in to comment.