Skip to content

πŸš€ RunPod LLM Pod Manager with FastAPI Proxy - Enterprise-grade caching, monitoring, security, and OpenAI-compatible endpoints for VSCode extensions

License

Notifications You must be signed in to change notification settings

phudson/runpod-llm-manager

Logo RunPod LLM Pod Manager with FastAPI Proxy

License: MIT Python 3.8+ SBOM Security

This advanced system automates the full lifecycle of GPU-backed LLM pods on RunPod, featuring enterprise-grade caching, monitoring, security, and a high-performance FastAPI reverse proxy for development tools like Continue, CodeGPT, and Prinova Cody.

πŸš€ Purpose

  • πŸš€ Launch ephemeral LLM pods with OpenAI-compatible endpoints
  • πŸ”’ Default to SECURE mode for privacy and isolation
  • ⚑ High-performance FastAPI reverse proxy with intelligent caching
  • πŸ“Š Real-time metrics and health monitoring
  • πŸ” SSL/TLS support for secure communication
  • πŸ’Ύ LRU cache with configurable size limits
  • πŸ“ˆ Structured JSON logging and performance profiling
  • πŸ₯ Comprehensive health checks and dashboard
  • ⏰ Track pod state for restarts, shutdowns, and cost control
  • πŸ’° Enforce runtime limits and prevent lingering charges
  • πŸ€– Support cron-based watchdog execution

πŸ“¦ Files

File Description
manage_pod.py πŸš€ Unified lifecycle controller: start, restart, terminate, watchdog
proxy_fastapi.py ⚑ FastAPI proxy with caching, metrics, SSL, and security monitoring
security_utils.py πŸ”’ Security utilities: SBOM generation, vulnerability scanning, compliance
pod_config.json βš™οΈ Configuration file with model, GPU, cache, SSL, and runtime settings
pod_state.json πŸ“Š Auto-generated state file storing pod ID, model, and runtime info
requirements.txt πŸ“¦ Python dependencies with security and compliance tools
SECURITY.md πŸ›‘οΈ Comprehensive security documentation and compliance guide
test_strategy.md πŸ§ͺ Detailed testing strategy for manual and automated pod management
LICENSE πŸ“œ MIT license for the project with LGPL compliance notes
.github/ βš™οΈ GitHub repository configuration and community health files
.gitignore 🚫 Prevents committing sensitive files and local artifacts
README.md πŸ“– This comprehensive documentation

🧰 Prerequisites & Installation

🧰 System Requirements

Operating System:

  • Linux (Ubuntu 20.04+, Debian, CentOS, etc.)
  • macOS (10.15+)
  • Windows with WSL2

Python Version:

  • Python 3.8 or higher

πŸ“¦ Installation

  1. Clone or download the repository:

    git clone <repository-url>
    cd runpod-llm-manager
  2. Install Python dependencies:

    # Core dependencies (required)
    pip install fastapi httpx uvicorn pydantic aiofiles
    
    # Security and compliance tools (recommended)
    pip install cyclonedx-bom safety pip-licenses

    Or install all at once:

    pip install -r requirements.txt

πŸ“œ License Compliance

This project uses a mix of permissive and copyleft licenses:

Permissive Licenses (MIT, BSD, Apache)

  • fastapi, httpx, uvicorn, pydantic, aiofiles, requests
  • No restrictions on use, modification, or distribution

Copyleft Licenses (LGPL)

  • chardet (LGPL v2.1+): Character encoding detection
  • frozendict (LGPL v3+): Immutable dictionary implementation

Distribution Compliance

Since this software may be distributed via GitHub:

  1. Source Code Availability: βœ… Complete source code is provided
  2. License Texts: βœ… All licenses are included in dependencies
  3. LGPL Compliance: βœ… Users can replace LGPL components if desired
  4. No Modifications: βœ… LGPL libraries are used unmodified

Note: As an individual developer distributing non-commercial software, you have additional fair use protections, but this documentation ensures compliance for all users.

  1. Set up environment variables:

    export RUNPOD_API_KEY="your-runpod-api-key-here"
  2. Create cache directory:

    mkdir -p /tmp/llm_cache

βœ… No NGINX required! The system uses a high-performance FastAPI proxy with built-in caching and monitoring.

⏱️ Cron Setup (Optional)

🐧 WSL2 / Ubuntu Setup Notes

To ensure runpod-llm-manager functions correctly inside WSL2 with Ubuntu:

βš™οΈ Enable systemd (Optional but Recommended)

If you're using newer Ubuntu builds on WSL2, you can enable systemd for better service management:

  1. Edit your WSL config:

    sudo nano /etc/wsl.conf

    Add:

    [boot]
    systemd=true
    
  2. Restart WSL:

    wsl.exe --shutdown

Note: systemd support requires WSL version 0.67.6 or newer. Run wsl --version to check.

πŸ”§ Enable cron in WSL2

WSL2 does not start cron automatically. To enable it:

  1. Install cron if not already present:

    sudo apt update
    sudo apt install cron
  2. Allow passwordless startup for cron (optional but recommended for automation):

    sudo visudo

    Add this line at the bottom:

    your_username ALL=NOPASSWD: /usr/sbin/service cron start
    
  3. Start cron manually once:

    sudo service cron start
  4. To ensure cron starts automatically when WSL2 boots, use Windows Task Scheduler to run:

    wsl -d Ubuntu -- sudo service cron start

    on login or system boot.

To automate pod lifecycle management and prevent lingering charges, add the following cron entries:

πŸ”„ Watchdog / Expiry Check (Every 5 Minutes)

Runs manage_pod.py to start, restart, or terminate pods based on runtime limits:

*/5 * * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py >> /var/log/runpod_watchdog.log 2>&1

πŸ›‘ Forced Termination (Midnight Daily)

Ensures all pods are terminated at midnight regardless of state:

0 0 * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py --shutdown >> /var/log/runpod_shutdown.log 2>&1

Replace /path/to/runpod-llm-manager/ with the actual path to your script. Ensure your user has permission to run Python without sudo. These cron jobs work with the existing code and provide automated lifecycle management.

οΏ½ Configuration

Create pod_config.json with comprehensive settings:

πŸ“‹ Basic Configuration

{
  "modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
  "gpu_type_id": "NVIDIA RTX A6000",
  "runtime_seconds": 3600,
  "template_id": "vllm"
}

βš™οΈ Advanced Configuration

{
  "modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
  "gpu_type_id": "NVIDIA RTX A6000",
  "runtime_seconds": 3600,
  "template_id": "vllm",

  // Proxy Configuration
  "proxy_port": 8000,
  "cache_dir": "/tmp/llm_cache",

  // SSL/TLS Configuration
  "use_https": false,
  "ssl_cert": "/path/to/cert.pem",
  "ssl_key": "/path/to/key.pem",

  // Cache Configuration
  "max_cache_size": 1000,
  "max_cache_bytes": 1073741824,

  // Performance & Monitoring
  "enable_profiling": false
}

πŸ” Configuration Parameters

Parameter Type Default Description
modelStoreId string required Model Store model identifier
gpu_type_id string required GPU type for pod deployment
runtime_seconds int 3600 Maximum pod runtime in seconds
template_id string "vllm" RunPod template identifier
proxy_port int 8000 Local proxy port
cache_dir string "/tmp/llm_cache" Cache directory path
use_https boolean false Enable SSL/TLS
ssl_cert string null SSL certificate file path
ssl_key string null SSL private key file path
max_cache_size int 1000 Maximum cached responses
max_cache_bytes int 1GB Maximum cache size in bytes
enable_profiling boolean false Enable debug/profiling endpoints
initial_wait_seconds int 10 Seconds to wait after pod creation before checking status
max_startup_attempts int 20 Maximum attempts to wait for pod to become ready
poll_interval_seconds int 5 Seconds between pod status checks during startup

πŸ” Discovering Supported Models via RunPod UI

RunPod supports a wide range of open-source models for vLLM pods. To explore available options:

🧭 Using Quick Deploy

  1. Go to RunPod Console
  2. Click Deploy a Pod
  3. Select Serverless > vLLM Worker
  4. In the Model dropdown, browse the list of supported Hugging Face models

These models are pre-tested for compatibility with RunPod’s vLLM container and expose an OpenAI-style API endpoint.

πŸ“Œ Notes

  • Most models listed are public and do not require a Hugging Face token.
  • If you select a gated model (e.g. meta-llama/Llama-3-8B-Instruct), you’ll need to provide a HF_TOKEN in your pod config.
  • You can also deploy any compatible Hugging Face model manually by specifying its name in your pod_config.json.

For examples of known working models, see the models list printed during --refresh-catalog in verbose mode.

πŸ§ͺ Usage

πŸš€ Basic Usage

# Start or restart pod with watchdog behavior
python3 manage_pod.py

# Force termination of active pod
python3 manage_pod.py --shutdown

# Dry run mode (no actual API calls)
python3 manage_pod.py --dry-run

# Verbose logging
python3 manage_pod.py --verbose

# Refresh catalog and validate configuration
python3 manage_pod.py --refresh-catalog

🌐 API Endpoints

Once running, your LLM is available at:

http://localhost:8000/v1/chat/completions

πŸ“Š Monitoring Endpoints

  • Health Check: GET /health - includes rate limiting status
  • Metrics: GET /metrics - performance and cache statistics
  • Dashboard: GET /dashboard - comprehensive system overview with security info
  • Debug Cache (if profiling enabled): GET /debug/cache

πŸ“ˆ Monitoring & Metrics

# Check proxy health
curl http://localhost:8000/health

# Get performance metrics
curl http://localhost:8000/metrics

# View comprehensive dashboard
curl http://localhost:8000/dashboard

πŸ”§ Environment Variables

# Required
export RUNPOD_API_KEY="your-api-key"

# Optional (for advanced features)
export MAX_CACHE_SIZE="2000"          # Increase cache size
export CACHE_SIZE_BYTES="2147483648"  # 2GB cache
export ENABLE_PROFILING="true"        # Enable debug endpoints
export PREWARM_CACHE="true"           # Pre-populate cache with common patterns

# Security configuration
export RATE_LIMIT_REQUESTS="60"       # Requests per window
export RATE_LIMIT_WINDOW="60"         # Window in seconds
export USE_HTTPS="false"              # Enable HTTPS
export SSL_CERT="/path/to/cert.pem"   # SSL certificate path
export SSL_KEY="/path/to/key.pem"     # SSL private key path

πŸ” Security & Compliance

EU Regulatory Compliance

This system implements security measures aligned with EU regulations including the Cyber Resilience Act (CRA) and GDPR. As non-commercial software developed by an individual, you're likely exempt from most CRA requirements, but these features ensure future compliance readiness.

πŸ›‘οΈ Security Features

  • Rate Limiting: 60 requests/minute per IP with RFC-compliant headers
  • Input Validation: Pydantic-based validation with content sanitization
  • Security Headers: XSS, CSRF, and content-type protection
  • HTTPS Enforcement: HSTS when SSL is enabled
  • CORS Protection: Restricted cross-origin access
  • Security Monitoring: Structured logging of security events

πŸ” Security Tools

# Generate comprehensive security report
python security_utils.py report

# Scan for vulnerabilities
python security_utils.py scan

# Check license compliance
python security_utils.py licenses

# Generate SBOM
python security_utils.py sbom

πŸ“Š Security Monitoring

  • Health Endpoint: /health - includes rate limit status
  • Security Dashboard: /dashboard - comprehensive system security info
  • Structured Logging: JSON-formatted security event logs

✨ Advanced Features

πŸ’Ύ Intelligent Caching System

  • LRU Eviction: Automatically removes least recently used cache entries
  • Size Management: Configurable cache limits (entries and bytes)
  • SHA256 Hashing: Fast, collision-resistant cache keys
  • Thread-Safe: Concurrent access protection
  • Performance: Sub-millisecond cache lookups

πŸ“Š Real-Time Monitoring

  • Health Checks: Comprehensive system health monitoring
  • Performance Metrics: Response times, cache hit rates, error rates
  • System Dashboard: Complete system overview with configuration
  • Structured Logging: JSON-formatted logs for easy parsing
  • Debug Endpoints: Cache inspection and profiling tools

πŸ” Security & SSL/TLS

  • File Permissions: Restricted access to sensitive files (0o600)
  • SSL/TLS Support: HTTPS with configurable certificates
  • Environment Validation: Early validation of API keys and configuration
  • Process Isolation: Secure subprocess management

⚑ Performance Optimizations

  • Async Processing: Non-blocking I/O with FastAPI and httpx
  • Connection Pooling: Efficient HTTP client reuse
  • Graceful Shutdown: 10-second timeout for clean process termination
  • Memory Management: Controlled cache growth with eviction policies

πŸ€– Automation & Reliability

  • Lock Files: Prevents concurrent execution conflicts
  • State Persistence: Survives system restarts
  • Error Recovery: Automatic pod restart on failures
  • Health Monitoring: Continuous proxy health validation

πŸ” Security Notes

  • βœ… No privileged ports required
  • βœ… No sudo needed for any operations
  • βœ… File permissions automatically restricted
  • βœ… SSL/TLS support for encrypted communication
  • βœ… Environment variables for sensitive data
  • βœ… Process isolation and secure cleanup

VSCode extension configuration

🧩 Continue Extension

To connect the Continue extension to your locally hosted RunPod LLM endpoint, create or update the configuration file at:

~/.continue/config.json

with the following content:

{
  "models": [
    {
      "title": "RunPod DeepSeek",
      "provider": "openai",
      "model": "deepseek-coder-33b-awq",
      "apiBase": "http://localhost:8080/v1"
    }
  ]
}
  • title: Friendly display name for your model in Continue.
  • provider: Must be "openai" since the RunPod endpoint is OpenAI-compatible.
  • model: The exact model identifier you configured for your pod.
  • apiBase: The local URL exposed by your FastAPI proxy (localhost and port should match your config, default: 8000).

This setup tells Continue to send requests to your RunPod pod’s OpenAI-compatible API endpoint running locally. Remember to restart the Continue extension after saving the config for changes to take effect.

🧩 VSCode CodeGPT Extension Configuration Example

To connect CodeGPT to your locally hosted RunPod LLM endpoint, open your VSCode settings.json file:

File β†’ Preferences β†’ Settings β†’ Open Settings (JSON)

Add the following configuration:

{
  "codegpt.model": "openai",
  "codegpt.apiKey": "sk-placeholder",
  "codegpt.apiBaseUrl": "http://localhost:8080/v1"
}
  • model: Set to "openai" to use OpenAI-compatible formatting.
  • apiKey: Required by CodeGPT even for local endpointsβ€”use any placeholder string.
  • apiBaseUrl: Must match your FastAPI proxy URL and port (default: http://localhost:8000/v1).

⚠️ CodeGPT requires a dummy API key even for local endpoints. You can use "sk-local" or "sk-placeholder".


🧩 VSCode Prinova Cody Extension Configuration Example

Prinova Cody (Sourcegraph Cody) connects to LLMs via a Sourcegraph instance. To use a custom LLM like your RunPod pod, you'll need:

  1. A Sourcegraph Enterprise instance
  2. Admin access to configure external LLM endpoints
  3. A generated access token

Once you have those:

  • Open Cody in VSCode
  • Click Sign In to Your Enterprise Instance
  • Enter your Sourcegraph URL
  • Paste your access token
  • Select your custom model from the dropdown (if configured)

⚠️ Cody does not support direct local endpoint configuration in VSCode. You must register your RunPod endpoint with a Sourcegraph instance first.

For full setup instructions, see Sourcegraph's Cody installation guide.


🧩 VSCode Kilo Code Extension Configuration

To connect the Kilo Code extension to your locally hosted RunPod LLM proxy:

  1. Install the Kilo Code extension in VSCode
  2. Go to VSCode Settings β†’ Extensions β†’ Kilo Code
  3. Configure the following basic settings:
{
  "kilo-code.api.baseUrl": "http://localhost:8000/v1",
  "kilo-code.api.key": "sk-local-proxy",
  "kilo-code.model.name": "deepseek-coder-33b-awq",
  "kilo-code.cache.enabled": true,
  "kilo-code.cache.directory": "/tmp/llm_cache"
}

Testing the Configuration

  1. Start the proxy:

    python3 manage_pod.py
  2. Verify proxy health:

    curl http://localhost:8000/health
  3. Test in VSCode:

    • Open a Python file in VSCode
    • Use Kilo Code autocomplete or chat features
    • Verify requests are routed through your local proxy

Troubleshooting

  • Connection issues: Ensure proxy is running on port 8000
  • Authentication errors: Verify the API key matches "sk-local-proxy"
  • Slow responses: Check cache is working with curl http://localhost:8000/metrics

πŸ”§ Troubleshooting

Common Issues & Solutions

❌ Import "uvicorn" could not be resolved

Solution: Install missing dependencies:

pip install uvicorn

❌ Permission denied when creating cache files

Solution: Ensure cache directory permissions:

mkdir -p /tmp/llm_cache
chmod 755 /tmp/llm_cache

❌ Port already in use

Solution: Change proxy port in pod_config.json:

{
  "proxy_port": 8001
}

❌ SSL certificate errors

Solution: Verify certificate files exist and have correct permissions:

ls -la /path/to/cert.pem /path/to/key.pem
chmod 600 /path/to/key.pem

❌ Pod fails health check

Solution: Check pod status and logs:

python3 manage_pod.py --verbose

❌ Cache not working

Solution: Check cache directory and permissions:

ls -la /tmp/llm_cache/

πŸ“Š Monitoring Commands

# Check if proxy is running
ps aux | grep proxy_fastapi

# Check proxy health
curl http://localhost:8000/health

# View cache statistics
curl http://localhost:8000/metrics

# View comprehensive dashboard
curl http://localhost:8000/dashboard

πŸ” Debug Mode

Enable debug endpoints for troubleshooting:

{
  "enable_profiling": true
}

Then access debug information:

curl http://localhost:8000/debug/cache

🧹 Cleanup

Manual Cleanup

# Stop the proxy and terminate pod
python3 manage_pod.py --shutdown

# Clean up cache files (optional)
rm -rf /tmp/llm_cache/*

# Remove state files
rm -f pod_state.json
rm -f /tmp/fastapi_proxy.pid

Automated Cleanup

The system automatically:

  • βœ… Terminates pods on expiry
  • βœ… Cleans up PID files on shutdown
  • βœ… Removes stale lock files
  • βœ… Evicts old cache entries

Cache Management

# View cache size
du -sh /tmp/llm_cache/

# Clear all cache
rm -rf /tmp/llm_cache/*
mkdir -p /tmp/llm_cache

⚑ Performance Optimization

Cache Performance Tips

  • Use fast storage: Place cache directory on SSD/NVMe for better performance
  • Pre-warm cache: Enable PREWARM_CACHE=true to populate cache with common patterns on startup
  • Monitor cache hit rates: Use /metrics endpoint to track cache effectiveness
  • Adjust cache size: Increase MAX_CACHE_SIZE for better hit rates on large codebases

SSL/TLS Configuration

{
  "use_https": true,
  "ssl_cert": "/path/to/cert.pem",
  "ssl_key": "/path/to/key.pem"
}

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.

Quick Start for Contributors

  1. Read the Code of Conduct
  2. Check the Contributing Guide
  3. Report issues using our issue templates
  4. Submit PRs using our pull request template

Development Guidelines

This system is designed to be:

  • Modular: Easy to extend with new features
  • Configurable: All major settings are configurable
  • Observable: Comprehensive logging and metrics
  • Secure: Follows security best practices

Adding New Features

  1. Add configuration options to pod_config.json
  2. Implement functionality in appropriate module
  3. Add metrics and logging
  4. Update documentation
  5. Test thoroughly

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

License Compliance Requirements

Please ensure compliance with:

  • MIT License: For the original project code
  • LGPL Compliance: For chardet and frozendict dependencies (see License Compliance section above)
  • RunPod Terms of Service: When using RunPod infrastructure
  • Hugging Face model licenses: For any models deployed
  • Local data privacy regulations: GDPR and other applicable laws

πŸ†˜ Support

Getting Help

  1. Check the troubleshooting section above
  2. Check proxy health: curl http://localhost:8000/health
  3. Enable verbose mode: python3 manage_pod.py --verbose
  4. Check pod status: python3 manage_pod.py --refresh-catalog

Common Resources


Last updated: 2025-09-22

About

πŸš€ RunPod LLM Pod Manager with FastAPI Proxy - Enterprise-grade caching, monitoring, security, and OpenAI-compatible endpoints for VSCode extensions

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages