Skip to content

Latest commit

 

History

History
51 lines (33 loc) · 3.89 KB

README.md

File metadata and controls

51 lines (33 loc) · 3.89 KB

NVIDIA TensorRT Model Optimizer Examples

Quantization

Pruning

  • Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:

Distillation

  • Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.

Speculative Decoding

  • Speculative Decoding demonstrates how to use speculative decoding to accelerate the text generation of large language models.

Sparsity

  • Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.

Evaluation

  • Evaluation for LLMs shows how to evaluate the performance of LLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
  • Evaluation for VLMs shows how to evaluate the performance of VLMs on popular benchmarks for quantized models or TensorRT-LLM engines.

Chaining

  • Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).

Model Hub

  • Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.

Windows

  • Windows contains examples for Model Optimizer on Windows.