FMS Model Optimizer is a framework to research and develop reduced-precision numerical representation of neural networks using quantization and sparsification techniques. When these models are properly deployed on specialized hardware, such as modern GPUs, both inference speed and energy consumption will be greatly improved. The goal of this package is to provide a variety of popular quantization methods out-of-the-box with a simple interface while maintaining flexibility so that new customized methods can be added and experimented with easily.
Quantization can dramatically reduce the memory and computation required to run a model, which is a crucial technique as we extend into the era of large language models (LLMs).
The FMS Model Optimizer goals are:
- To provide an extendable framework for optimization techniques like quantization on LLMs
- To generate quantized checkpoints for model serving purposes or as a base model for further studies
- To provide capability to verify the inference performance of resulting quantized models with specialized hardware. For example, once an INT8 model is generated, users may want to utilize the real INT compute engine on GPU to run evaluation on this model and confirm the resulting accuracy
A 32 bit floating-point tensor with
where
(Detailed example can be found in this tutorial notebook. )
The quantization process can be illustrated in the following plots:
- First row: scaling of
$W$ , e.g. 2.5 in the original tensor will be translated to 7. - Second row: quantize (round and clip), e.g. numbers ranged [6.5, 7.5) will become 7.
- Third row: dequantize, scale the integers from [-7, 7] back to [-2.5, 2.5].
NOTE: Rounding process will introduce approximation errors, or another way to look at it is that some information in the original tensor will be lost.
In order to accommodate the quantization errors, one straightforward technique is to take quantization/dequantization into account during the training process, hence the name quantization-aware training (QAT), as illustrated by Step 1 of the following figure. The training optimizer will then adjust the parameters of the model, e.g. weights, accordingly so that the resulting accuracy will be comparable to the original FP32 model.
There are many other techniques, such as post-training quantization (PTQ), that can achieve similar outcome. Users will need to pick the proper method for their specific task based on model size, dataset size, resource available, and other considerations.
When we performed QAT in the previous section, one should note that all the matrix multiplications are still done in full precision. In other words, the QAT step is just to prepare the model for deployment, it will not provide any speed-up. In fact, QAT will slow down the execution by quite a bit because of the additional quant/dequant operations on top of the full precision forward passes and customized backward passes.
One will need to deploy the model using specialized hardware in order to fully benefit from a quantized model. For example, if an INT8 matrix multiplication is needed in an INT8 quantized model but PyTorch does not have native support for this operation yet, users will need to write an "external kernel" (often in CUDA) to assign the computation to the true INT8 compute engine. Or, users may utilize inference serving engines like TensorRT-LLM or vLLM which provide their own implementation of external kernels. TensorRT-LLM and vLLM usually can provide better inference performance due to their highly optimized kernels.
Note FMS Model Optimizer also provides a few simple INT kernels, which are easy to use and very efficient compared to typical PyTorch FP16, but by no means the most optimized kernels. The main purpose for these kernels is to verify the INT model behavior at deployment stage rather than encourage users to serve models with our kernels.
Because an 8-bit floating point can only represent at most 256 numbers for a given range, representable numbers are separated by a certain gap in between, hence, it is a "quantized format" similar to INT8. Furthermore, for typical GPUs, e.g. H100 and MI300, FP8 and INT8 are usually designed to have the same computation throughput (although, not necessarily with the same energy efficiency. Also custom accelerators may have different design choices.) But these formats have very different behavior when it comes to quantization. Here are a few things to keep in mind when choosing a 8-bit numerical format for your task.
A N-bit floating-point number is represented as N consecutive bits composed of 3 sub-parts, i.e. (sign bit, exponent bits, mantissa bits). For example, a FP8 number may allocate its 8 bits in (1, 4, 3) or (1, 5, 2) manner. The real number for a given bit representation can be calculated by:
Here we plot all the representable positive numbers for both INT8 and FP8 formats on linear scale (upper plot) and log scale (lower plot). For easy comparison purpose, both formats are scaled to the same max value, i.e. 448 in this case. We can see that FP8 has higher "resolution" for the smaller numbers in this range while INT8 has higher resolution for the larger numbers.
Instead of using the quantization formula introduced previously, FP8 could be converted from FP16 through a simple "type casting", which is close to a truncation operation of the extra bits. But one should keep in mind that there are two flavors of FP8 commonly in use, i.e. e4m3
and e5m2
, which means "4-bit exponent 3-bit mantissa" and "5-bit exponent 2-bit mantissa", respectively. Their resolutions and representable ranges, compared to FP16's, are illustrated below.
Note
FP16 has similar representable range as e5m2
but with so much more representable numbers available. That's why it is usually considered as a "continuous format" instead of "quantized format."
Native FP8 computation is only supported by newer GPUs, such as H100. Older GPUs might be able to support a scheme where data is stored in FP8 but compute is done in FP16.
FMS Model Optimizer supports FP8 in two ways:
- Deployable through vLLM. One could create FP8 checkpoints that can be later deployed with vLLM. This option simply invokes
llm-compressor
package (for now, may integrate with internal functions in the near future.) See example - Simulation only, which allows users to create and experiment different scaling methods, e.g. per-token min/max. But users may need to create their own external kernels when deployment is needed. See example
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing auto_gptq
package. See this example
The key architectural components are:
model_analyzer
, which traces the model and identifies the layers/operations to be quantized or to be skipped. It will try to recognize several well-known structures and configure based on best practice. However, users could also choose to bypass the tracing and manually specify the desired configuration with full flexibility.- A set of
wrappers
. As shown in the figure above, the preparation for QAT and deployment can be viewed as a "layer swapping" process. One could identify a desiredtorch.nn.Linear
layer to be quantized, e.g. Linear1 in the plot, and replace it with aQLinear
wrapper, which contains a set ofquantizers
that can quantize/dequantize the inputs and weights before the Linear operation. Similarly, theQLinear
wrapper for deployment stage will quantize the inputs, perform INT matmul, then dequantize the outcome. It is mathematically equivalent to the wrapper used in QAT, but it can utilize the INT compute engine.
There are two main user interfaces:
- A Python script
fms_mo/run_quant.py
which provides a means to call different optimization techniques like GPTQ, FP8, etc. For example, GPTQ can be run on theMaykeye/TinyLLama-v0
model as follows:
python -m fms_mo.run_quant --model_name_or_path Maykeye/TinyLLama-v0 \
--output_dir TinyLLama-v0-gptq --quant_method gptq --training_data_path data
-
Python functions which wrap model optimizer capability:
quantize()
: Main entry point to quantize a given model with a set of specified hyperparametersqmodel_prep()
: Prepare a PyTorch model for the quantization process with more fine controls
The examples directory contains examples of how to use these two approaches for different optimization techniques.