Skip to content

πŸš€ FlexLLama - Lightweight self-hosted tool for running multiple llama.cpp server instances with OpenAI v1 API compatibility and multi-GPU support

License

Notifications You must be signed in to change notification settings

yazon/flexllama

Repository files navigation

FlexLlama Logo

FlexLLama - "One to rule them all"

FlexLLama GitHub stars GitHub top language GitHub repo size GitHub last commit GitHub License

FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.

Key Features of FlexLLama

  • πŸš€ Multiple llama.cpp instances - Run different models simultaneously
  • 🎯 Multi-GPU support - Distribute models across different GPUs
  • πŸ”Œ OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
  • πŸ“Š Real-time dashboard - Monitor model status with a web interface
  • πŸ€– Chat & Completions - Full chat and text completion support
  • πŸ” Embeddings & Reranking - Supports models for embeddings and reranking
  • ⚑ Auto-start - Automatically start default runners on launch
  • πŸ”„ Model switching - Dynamically load/unload models as needed

FlexLLama Dashboard

Quickstart

πŸš€ Want to get started in 5 minutes? Check out our QUICKSTART.md for a simple Docker setup with the Qwen3-4B model!

πŸ“¦ Local Installation

  1. Install FlexLLama:

    From GitHub:

    pip install git+https://github.com/yazon/flexllama.git

    From local source (after cloning):

    # git clone https://github.com/yazon/flexllama.git
    # cd flexllama
    pip install .
  2. Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:

    cp config_example.json config.json

    If you installed from git, you may need to download it from the repository.

  3. Edit config.json: Update config.json with the correct paths for your llama-server binary and your model files (.gguf).

  4. Run FlexLLama:

    python main.py config.json

    or

    flexllama config.json
  5. Open dashboard:

    http://localhost:8080
    

🐳 Docker

FlexLLama can be run using Docker and Docker Compose. We provide profiles for both CPU-only and GPU-accelerated (NVIDIA CUDA) environments.

  1. Clone the repository:

    git clone https://github.com/yazon/flexllama.git
    cd flexllama

After cloning, you can proceed with the quick start script or a manual setup.


Using the Quick Start Script (docker-start.sh)

For an easier start, the docker-start.sh helper script automates several setup steps. It checks your Docker environment, builds the correct image (CPU or GPU) and provides the commands to launch FlexLLama.

  1. Make the script executable (Linux/Unix):

    chmod +x docker-start.sh
  2. Run the script: Use the --gpu flag for NVIDIA GPU support.

    For CPU-only setup:

    ./docker-start.sh

    or

    ./docker-start.ps1

    For GPU-accelerated setup:

    ./docker-start.sh --gpu

    or

    ./docker-start.ps1 -gpu
  3. Follow the on-screen instructions: The script will guide you.


Manual Docker and Docker Compose Setup

If you prefer to run the steps manually, follow this guide:

  1. Place your models:

    # Create the models directory if it doesn't exist
    mkdir -p models
    # Copy your .gguf model files into it
    cp /path/to/your/model.gguf models/
  2. Configure your models:

    # Edit the Docker configuration to point to your models
    #   β€’ CPU-only: keep "n_gpu_layers": 0
    #   β€’ GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0
  3. Build and Start FlexLLama with Docker Compose (Recommended): Use the --profile flag to select your environment. The service will be available at http://localhost:8080.

    For CPU-only:

    docker compose --profile cpu up --build -d

    For GPU support (NVIDIA CUDA):

    docker compose --profile gpu up --build -d
  4. View Logs To monitor the output of your services, you can view their logs in real-time.

    For the CPU service:

    docker compose --profile cpu logs -f

    For the GPU service:

    docker compose --profile gpu logs -f

    (Press Ctrl+C to stop viewing the logs.)

  5. (Alternative) Using docker run: You can also build and run the containers manually.

    For CPU-only:

    # Build the image
    docker build -t flexllama:latest .
    # Run the container
    docker run -d -p 8080:8080 \
      -v $(pwd)/models:/app/models:ro \
      -v $(pwd)/docker/config.json:/app/config.json:ro \
      flexllama:latest

    For GPU support (NVIDIA CUDA):

    # Build the image
    docker build -f Dockerfile.cuda -t flexllama-gpu:latest .
    # Run the container
    docker run -d --gpus all -p 8080:8080 \
      -v $(pwd)/models:/app/models:ro \
      -v $(pwd)/docker/config.json:/app/config.json:ro \
      flexllama-gpu:latest
  6. Open the dashboard: Access the FlexLLama dashboard in your browser: http://localhost:8080

Configuration

FlexLLama is highly configurable through the config.json file. You can set up multiple runners, distribute models across GPUs, configure auto-unload timeouts, set environment variables, and much more.

πŸ“– For detailed configuration options, examples, and advanced setups, see CONFIGURATION.md

Quick Configuration Tips

  • Edit config.json to add your models and runners
  • Use config_example.json as a reference
  • Validate your configuration: python backend/config.py config.json
  • Set auto_start_runners: true to automatically load models on startup

Testing

FlexLLama includes a comprehensive test suite to validate your setup and ensure everything is working correctly.

Running Tests

The tests/ directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/ directory.

Prerequisites:

  • For test_basic.py and test_all_models.py, the main application must be running (flexllama config.json).
  • For test_model_switching.py, the main application should not be running.

Basic API Tests

test_basic.py performs basic checks on the API endpoints to ensure they are responsive.

# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.py

What it tests:

  • /v1/models and /health endpoints
  • /v1/chat/completions with both regular and streaming responses
  • Concurrent request handling

All Models Test

test_all_models.py runs a comprehensive test suite against every model defined in your config.json.

# Test all configured models
python tests/test_all_models.py config.json

What it tests:

  • Model loading and health checks
  • Chat completions (regular and streaming) for each model
  • Response time and error handling

Model Switching Test

test_model_switching.py verifies the dynamic loading and unloading of models.

# Run model switching tests
python tests/test_model_switching.py config.json

What it tests:

  • Dynamic model loading and switching
  • Runner state management and health monitoring
  • Proper cleanup of resources

License

This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.


πŸš€ Ready to run multiple LLMs like a pro? Edit your config.json and start FlexLLama!

About

πŸš€ FlexLLama - Lightweight self-hosted tool for running multiple llama.cpp server instances with OpenAI v1 API compatibility and multi-GPU support

Resources

License

Stars

Watchers

Forks

Packages

No packages published