FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.
- π Multiple llama.cpp instances - Run different models simultaneously
- π― Multi-GPU support - Distribute models across different GPUs
- π OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
- π Real-time dashboard - Monitor model status with a web interface
- π€ Chat & Completions - Full chat and text completion support
- π Embeddings & Reranking - Supports models for embeddings and reranking
- β‘ Auto-start - Automatically start default runners on launch
- π Model switching - Dynamically load/unload models as needed
π Want to get started in 5 minutes? Check out our QUICKSTART.md for a simple Docker setup with the Qwen3-4B model!
-
Install FlexLLama:
From GitHub:
pip install git+https://github.com/yazon/flexllama.git
From local source (after cloning):
# git clone https://github.com/yazon/flexllama.git # cd flexllama pip install .
-
Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:
cp config_example.json config.json
If you installed from git, you may need to download it from the repository.
-
Edit
config.json: Updateconfig.jsonwith the correct paths for yourllama-serverbinary and your model files (.gguf). -
Run FlexLLama:
python main.py config.json
or
flexllama config.json
-
Open dashboard:
http://localhost:8080
FlexLLama can be run using Docker and Docker Compose. We provide profiles for both CPU-only and GPU-accelerated (NVIDIA CUDA) environments.
-
Clone the repository:
git clone https://github.com/yazon/flexllama.git cd flexllama
After cloning, you can proceed with the quick start script or a manual setup.
For an easier start, the docker-start.sh helper script automates several setup steps. It checks your Docker environment, builds the correct image (CPU or GPU) and provides the commands to launch FlexLLama.
-
Make the script executable (Linux/Unix):
chmod +x docker-start.sh
-
Run the script: Use the
--gpuflag for NVIDIA GPU support.For CPU-only setup:
./docker-start.sh
or
./docker-start.ps1
For GPU-accelerated setup:
./docker-start.sh --gpu
or
./docker-start.ps1 -gpu
-
Follow the on-screen instructions: The script will guide you.
Manual Docker and Docker Compose Setup
If you prefer to run the steps manually, follow this guide:
-
Place your models:
# Create the models directory if it doesn't exist mkdir -p models # Copy your .gguf model files into it cp /path/to/your/model.gguf models/
-
Configure your models:
# Edit the Docker configuration to point to your models # β’ CPU-only: keep "n_gpu_layers": 0 # β’ GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0
-
Build and Start FlexLLama with Docker Compose (Recommended): Use the
--profileflag to select your environment. The service will be available athttp://localhost:8080.For CPU-only:
docker compose --profile cpu up --build -d
For GPU support (NVIDIA CUDA):
docker compose --profile gpu up --build -d
-
View Logs To monitor the output of your services, you can view their logs in real-time.
For the CPU service:
docker compose --profile cpu logs -f
For the GPU service:
docker compose --profile gpu logs -f
(Press
Ctrl+Cto stop viewing the logs.) -
(Alternative) Using
docker run: You can also build and run the containers manually.For CPU-only:
# Build the image docker build -t flexllama:latest . # Run the container docker run -d -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama:latest
For GPU support (NVIDIA CUDA):
# Build the image docker build -f Dockerfile.cuda -t flexllama-gpu:latest . # Run the container docker run -d --gpus all -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama-gpu:latest
-
Open the dashboard: Access the FlexLLama dashboard in your browser:
http://localhost:8080
FlexLLama is highly configurable through the config.json file. You can set up multiple runners, distribute models across GPUs, configure auto-unload timeouts, set environment variables, and much more.
π For detailed configuration options, examples, and advanced setups, see CONFIGURATION.md
- Edit
config.jsonto add your models and runners - Use
config_example.jsonas a reference - Validate your configuration:
python backend/config.py config.json - Set
auto_start_runners: trueto automatically load models on startup
FlexLLama includes a comprehensive test suite to validate your setup and ensure everything is working correctly.
The tests/ directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/ directory.
Prerequisites:
- For
test_basic.pyandtest_all_models.py, the main application must be running (flexllama config.json). - For
test_model_switching.py, the main application should not be running.
test_basic.py performs basic checks on the API endpoints to ensure they are responsive.
# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.pyWhat it tests:
/v1/modelsand/healthendpoints/v1/chat/completionswith both regular and streaming responses- Concurrent request handling
test_all_models.py runs a comprehensive test suite against every model defined in your config.json.
# Test all configured models
python tests/test_all_models.py config.jsonWhat it tests:
- Model loading and health checks
- Chat completions (regular and streaming) for each model
- Response time and error handling
test_model_switching.py verifies the dynamic loading and unloading of models.
# Run model switching tests
python tests/test_model_switching.py config.jsonWhat it tests:
- Dynamic model loading and switching
- Runner state management and health monitoring
- Proper cleanup of resources
This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.
π Ready to run multiple LLMs like a pro? Edit your config.json and start FlexLLama!

