A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Apr 15, 2025 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
A production-ready inference server supporting any AI model on all major hardware platforms (CPU, GPU, TPU, Apple Silicon). Inferno seamlessly deploys and serves language models from Hugging Face, local files, or GGUF format with automatic memory management and hardware optimization. Developed by HelpingAI.
Add a description, image, and links to the trainium topic page so that developers can more easily learn about it.
To associate your repository with the trainium topic, visit your repo's landing page and select "manage topics."