LLM Inference API – Multi-Model GPU-Ready Deployment

Overview
Features
Requirements
Security considerations
Monitoring system
Preparation of the environment
Running services

Overview

This project is a functional prototype built to explore multi-container deployment of ML inference APIs using large language models. It intentionally skips production features (like reverse proxy and autoscaling) to focus on clean modular design, Docker-based deployment and GPU provisioning.

It includes:

Whisper for automatic speech recognition and transcription
Stable Diffusion for image generation
LLaMA Q4_K_M for text generation
Gemma 2 for text generation
Silero RU V3 for text-to-speech audio synthesis for Russian language

Each service is independently deployed using FastAPI and Docker, optimized for GPU inference. Additionally monitoring system using Prometheus/Grafana is added to keep track of the requests that services process.

Features

Supports multiple transformer-based models (Hugging Face)
GPU-accelerated inference (CUDA-compatible)
Docker Compose with modular containers
Internal service communication over HTTP
Optimized Docker images using shared base and caching layers
Optionally supports rootless Docker for better isolation

Requirements

1. CUDA-compatible GPU

In order to use the services efficiently, the local system should have access to CUDA drivers and GPU. Instructions on how to check the visibility/availability of GPU can be found

Details: GPU visibility check

2. Docker Engine

Each service is launched in a separate Docker container all of which are orchestrated using Docker Compose. Hence, it is necessary to have Docker installed on the system.

Details: Docker installation for Ubuntu

3. NVIDIA Container Toolkit

Since the inference is conducted using GPU, containers should have access to the system's GPU. This requires NVIDIA Container Toolkit to be installed.

Details: NVIDIA Container Toolkit Installation

4. ~16 GB VRAM

In order that all models could be fit on a single GPU, it is recommendable that the GPU be equipped with at least 16GB of VRAM.

Security considerations

Containers are configured to run under a non-root user to prevent privilege escalation and align with container hardening best practices.
This project also was built using rootless Docker to ensure containers do not run with root privileges, improving deployment security and isolating services from the host system. In this connection NVIDIA Container Toolkit has also been configured to run in rootless mode.

Details:

Rootless Docker Configuration

Configuring NVIDIA Container Toolkit for rootless mode

Monitoring system

In order to be able to keep track of key metrics and the state of the requests to the APIs, the monitoring system has been embedded. More information about monitoring can be consulted here.

Preparation of the environment for launch

Cloning the repository

First, we clone the project locally and set up the environment:

# Cloning the repo
git clone https://github.com/spolivin/llm-inference-api.git
cd llm-inference-api

# Copying example environment and configuring
cp .env.example .env

NOTE: File .env.example contains only credentials for Grafana which can be configured with new username and password after copying.

Configuring virtual environment

Next, we need to set up a virtual environment so that the next steps could be executed without problems.

Python

sudo apt-get update
sudo apt-get install python3.12-venv

python3.12 -m venv .venv
source .venv/bin/activate

pip install -r requirements-all.txt

NOTE: During libraries installation there can appear subprocess-exited-with-error error connected to the fact that installing some libraries requires compiling code in C/C++. The best way to avoid this error is to install the required compilies (such as gcc or g++) and other needed tools via sudo apt install build-essential.

Conda

source setup_env.sh

NOTE: Before running this command one needs to make sure that Conda is installed via which conda command. In case of its absence, it it necessary to follow Conda installation guidelines for local installation on Ubuntu.

Loading models locally

The next step is to load all models that we are about to deploy locally. Before that, we need to log in to Hugging Face, since some models will be pulled from HF Hub:

sh hf_login.sh <YOUR-HF-TOKEN>

1. LLaMA model (text generation)

cd llama-api
sh download_llama_weights.sh
cd ..

2. Gemma model (text generation)

cd gemma-api
sh download_gemma_model.sh
cd ..

3. Stable Diffusion model (image generation)

cd stable-diffusion-api
python download_sd_model.py
cd ..

4. Silero TTS model (text-to-speech)

cd tts-api
python download_tts_model.py
cd ..

5. Whisper model (audio transcription to text)

cd whisper-api
python download_whisper_model.py
cd ..

After loading all models it is necessary to log out of Hugging Face to avoid the token getting leaked into containers during build stage:

sh hf_logout.sh

Running services

In order to save memory and re-use cached layers, the system is based on shared base image which we need to firstly build:

# Building a common image for re-use across images
make build-base

After successfully building the base image, we can go on and build the main services and launch them:

# Building services with deployed models
make build-services

# Launching the system
make up-services

One can check if all services (including Prometheus and Grafana) are up and running:

make services

NOTE: One can optionally check the opened ports via make open-ports or make containers to make sure that all services are running

Services can be stopped and deleted in this way:

make down-services

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
api_scripts		api_scripts
docs		docs
gemma-api		gemma-api
llama-api		llama-api
prometheus		prometheus
stable-diffusion-api		stable-diffusion-api
tts-api		tts-api
whisper-api		whisper-api
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.base		Dockerfile.base
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
check_gpu.py		check_gpu.py
dashboard.json		dashboard.json
docker-compose.yml		docker-compose.yml
fix_datasource_uid.py		fix_datasource_uid.py
hf_login.sh		hf_login.sh
hf_logout.sh		hf_logout.sh
requirements-all.txt		requirements-all.txt
requirements-base.txt		requirements-base.txt
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Inference API – Multi-Model GPU-Ready Deployment

Overview

Features

Requirements

1. CUDA-compatible GPU

2. Docker Engine

3. NVIDIA Container Toolkit

4. ~16 GB VRAM

Security considerations

Monitoring system

Preparation of the environment for launch

Cloning the repository

Configuring virtual environment

Python

Conda

Loading models locally

1. LLaMA model (text generation)

2. Gemma model (text generation)

3. Stable Diffusion model (image generation)

4. Silero TTS model (text-to-speech)

5. Whisper model (audio transcription to text)

Running services

About

Uh oh!

Releases

Packages

Languages

License

spolivin/llm-inference-api

Folders and files

Latest commit

History

Repository files navigation

LLM Inference API – Multi-Model GPU-Ready Deployment

Overview

Features

Requirements

1. CUDA-compatible GPU

2. Docker Engine

3. NVIDIA Container Toolkit

4. ~16 GB VRAM

Security considerations

Monitoring system

Preparation of the environment for launch

Cloning the repository

Configuring virtual environment

Python

Conda

Loading models locally

1. LLaMA model (text generation)

2. Gemma model (text generation)

3. Stable Diffusion model (image generation)

4. Silero TTS model (text-to-speech)

5. Whisper model (audio transcription to text)

Running services

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages