Skip to content

Latest commit

 

History

History
74 lines (50 loc) · 2.77 KB

w8a8.md

File metadata and controls

74 lines (50 loc) · 2.77 KB

SmoothQuant

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers(INT8). For GPUs such as Nvidia H100, lmdeploy also supports 8-bit floating point(FP8).

And the following NVIDIA GPUs are available for INT8/FP8 inference respectively:

  • INT8
    • V100(sm70): V100
    • Turing(sm75): 20 series, T4
    • Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
    • Ada Lovelace(sm89): 40 series
    • Hopper(sm90): H100
  • FP8
    • Ada Lovelace(sm89): 40 series
    • Hopper(sm90): H100

First of all, run the following command to install lmdeploy:

pip install lmdeploy[all]

8-bit Weight Quantization

Performing 8-bit weight quantization involves three steps:

  1. Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
  2. Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file.
  3. Save the Quantized Model: Once you've made the necessary replacements, save the new quantized model.

lmdeploy provides lmdeploy lite smooth_quant command to accomplish all three tasks detailed above. Note that the argument --quant-dtype is used to determine if you are doing int8 or fp8 weight quantization. To get more info about usage of the cli, run lmdeploy lite smooth_quant --help

Here are two examples:

  • int8

    lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-int8 --quant-dtype int8
  • fp8

    lmdeploy lite smooth_quant internlm/internlm2_5-7b-chat --work-dir ./internlm2_5-7b-chat-fp8 --quant-dtype fp8

Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

from lmdeploy import pipeline, PytorchEngineConfig

engine_config = PytorchEngineConfig(tp=1)
pipe = pipeline("internlm2_5-7b-chat-int8", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Service

LMDeploy's api_server enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

lmdeploy serve api_server ./internlm2_5-7b-chat-int8 --backend pytorch

The default port of api_server is 23333. After the server is launched, you can communicate with server on terminal through api_client:

lmdeploy serve api_client http://0.0.0.0:23333

You can overview and try out api_server APIs online by swagger UI at http://0.0.0.0:23333, or you can also read the API specification from here.