This advanced system automates the full lifecycle of GPU-backed LLM pods on RunPod, featuring enterprise-grade caching, monitoring, security, and a high-performance FastAPI reverse proxy for development tools like Continue, CodeGPT, and Prinova Cody.
- π Launch ephemeral LLM pods with OpenAI-compatible endpoints
- π Default to SECURE mode for privacy and isolation
- β‘ High-performance FastAPI reverse proxy with intelligent caching
- π Real-time metrics and health monitoring
- π SSL/TLS support for secure communication
- πΎ LRU cache with configurable size limits
- π Structured JSON logging and performance profiling
- π₯ Comprehensive health checks and dashboard
- β° Track pod state for restarts, shutdowns, and cost control
- π° Enforce runtime limits and prevent lingering charges
- π€ Support cron-based watchdog execution
| File | Description |
|---|---|
manage_pod.py |
π Unified lifecycle controller: start, restart, terminate, watchdog |
proxy_fastapi.py |
β‘ FastAPI proxy with caching, metrics, SSL, and security monitoring |
security_utils.py |
π Security utilities: SBOM generation, vulnerability scanning, compliance |
pod_config.json |
βοΈ Configuration file with model, GPU, cache, SSL, and runtime settings |
pod_state.json |
π Auto-generated state file storing pod ID, model, and runtime info |
requirements.txt |
π¦ Python dependencies with security and compliance tools |
SECURITY.md |
π‘οΈ Comprehensive security documentation and compliance guide |
test_strategy.md |
π§ͺ Detailed testing strategy for manual and automated pod management |
LICENSE |
π MIT license for the project with LGPL compliance notes |
.github/ |
βοΈ GitHub repository configuration and community health files |
.gitignore |
π« Prevents committing sensitive files and local artifacts |
README.md |
π This comprehensive documentation |
Operating System:
- Linux (Ubuntu 20.04+, Debian, CentOS, etc.)
- macOS (10.15+)
- Windows with WSL2
Python Version:
- Python 3.8 or higher
-
Clone or download the repository:
git clone <repository-url> cd runpod-llm-manager
-
Install Python dependencies:
# Core dependencies (required) pip install fastapi httpx uvicorn pydantic aiofiles # Security and compliance tools (recommended) pip install cyclonedx-bom safety pip-licenses
Or install all at once:
pip install -r requirements.txt
This project uses a mix of permissive and copyleft licenses:
- fastapi, httpx, uvicorn, pydantic, aiofiles, requests
- No restrictions on use, modification, or distribution
- chardet (LGPL v2.1+): Character encoding detection
- frozendict (LGPL v3+): Immutable dictionary implementation
Since this software may be distributed via GitHub:
- Source Code Availability: β Complete source code is provided
- License Texts: β All licenses are included in dependencies
- LGPL Compliance: β Users can replace LGPL components if desired
- No Modifications: β LGPL libraries are used unmodified
Note: As an individual developer distributing non-commercial software, you have additional fair use protections, but this documentation ensures compliance for all users.
-
Set up environment variables:
export RUNPOD_API_KEY="your-runpod-api-key-here"
-
Create cache directory:
mkdir -p /tmp/llm_cache
β No NGINX required! The system uses a high-performance FastAPI proxy with built-in caching and monitoring.
To ensure runpod-llm-manager functions correctly inside WSL2 with Ubuntu:
If you're using newer Ubuntu builds on WSL2, you can enable systemd for better service management:
-
Edit your WSL config:
sudo nano /etc/wsl.conf
Add:
[boot] systemd=true -
Restart WSL:
wsl.exe --shutdown
Note:
systemdsupport requires WSL version 0.67.6 or newer. Runwsl --versionto check.
WSL2 does not start cron automatically. To enable it:
-
Install cron if not already present:
sudo apt update sudo apt install cron
-
Allow passwordless startup for cron (optional but recommended for automation):
sudo visudo
Add this line at the bottom:
your_username ALL=NOPASSWD: /usr/sbin/service cron start -
Start cron manually once:
sudo service cron start
-
To ensure cron starts automatically when WSL2 boots, use Windows Task Scheduler to run:
wsl -d Ubuntu -- sudo service cron start
on login or system boot.
To automate pod lifecycle management and prevent lingering charges, add the following cron entries:
Runs manage_pod.py to start, restart, or terminate pods based on runtime limits:
*/5 * * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py >> /var/log/runpod_watchdog.log 2>&1Ensures all pods are terminated at midnight regardless of state:
0 0 * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py --shutdown >> /var/log/runpod_shutdown.log 2>&1Replace
/path/to/runpod-llm-manager/with the actual path to your script. Ensure your user has permission to run Python withoutsudo. These cron jobs work with the existing code and provide automated lifecycle management.
Create pod_config.json with comprehensive settings:
{
"modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
"gpu_type_id": "NVIDIA RTX A6000",
"runtime_seconds": 3600,
"template_id": "vllm"
}{
"modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
"gpu_type_id": "NVIDIA RTX A6000",
"runtime_seconds": 3600,
"template_id": "vllm",
// Proxy Configuration
"proxy_port": 8000,
"cache_dir": "/tmp/llm_cache",
// SSL/TLS Configuration
"use_https": false,
"ssl_cert": "/path/to/cert.pem",
"ssl_key": "/path/to/key.pem",
// Cache Configuration
"max_cache_size": 1000,
"max_cache_bytes": 1073741824,
// Performance & Monitoring
"enable_profiling": false
}| Parameter | Type | Default | Description |
|---|---|---|---|
modelStoreId |
string | required | Model Store model identifier |
gpu_type_id |
string | required | GPU type for pod deployment |
runtime_seconds |
int | 3600 | Maximum pod runtime in seconds |
template_id |
string | "vllm" | RunPod template identifier |
proxy_port |
int | 8000 | Local proxy port |
cache_dir |
string | "/tmp/llm_cache" | Cache directory path |
use_https |
boolean | false | Enable SSL/TLS |
ssl_cert |
string | null | SSL certificate file path |
ssl_key |
string | null | SSL private key file path |
max_cache_size |
int | 1000 | Maximum cached responses |
max_cache_bytes |
int | 1GB | Maximum cache size in bytes |
enable_profiling |
boolean | false | Enable debug/profiling endpoints |
initial_wait_seconds |
int | 10 | Seconds to wait after pod creation before checking status |
max_startup_attempts |
int | 20 | Maximum attempts to wait for pod to become ready |
poll_interval_seconds |
int | 5 | Seconds between pod status checks during startup |
RunPod supports a wide range of open-source models for vLLM pods. To explore available options:
- Go to RunPod Console
- Click Deploy a Pod
- Select Serverless > vLLM Worker
- In the Model dropdown, browse the list of supported Hugging Face models
These models are pre-tested for compatibility with RunPodβs vLLM container and expose an OpenAI-style API endpoint.
- Most models listed are public and do not require a Hugging Face token.
- If you select a gated model (e.g.
meta-llama/Llama-3-8B-Instruct), youβll need to provide aHF_TOKENin your pod config. - You can also deploy any compatible Hugging Face model manually by specifying its name in your
pod_config.json.
For examples of known working models, see the
modelslist printed during--refresh-catalogin verbose mode.
# Start or restart pod with watchdog behavior
python3 manage_pod.py
# Force termination of active pod
python3 manage_pod.py --shutdown
# Dry run mode (no actual API calls)
python3 manage_pod.py --dry-run
# Verbose logging
python3 manage_pod.py --verbose
# Refresh catalog and validate configuration
python3 manage_pod.py --refresh-catalogOnce running, your LLM is available at:
http://localhost:8000/v1/chat/completions
- Health Check:
GET /health- includes rate limiting status - Metrics:
GET /metrics- performance and cache statistics - Dashboard:
GET /dashboard- comprehensive system overview with security info - Debug Cache (if profiling enabled):
GET /debug/cache
# Check proxy health
curl http://localhost:8000/health
# Get performance metrics
curl http://localhost:8000/metrics
# View comprehensive dashboard
curl http://localhost:8000/dashboard# Required
export RUNPOD_API_KEY="your-api-key"
# Optional (for advanced features)
export MAX_CACHE_SIZE="2000" # Increase cache size
export CACHE_SIZE_BYTES="2147483648" # 2GB cache
export ENABLE_PROFILING="true" # Enable debug endpoints
export PREWARM_CACHE="true" # Pre-populate cache with common patterns
# Security configuration
export RATE_LIMIT_REQUESTS="60" # Requests per window
export RATE_LIMIT_WINDOW="60" # Window in seconds
export USE_HTTPS="false" # Enable HTTPS
export SSL_CERT="/path/to/cert.pem" # SSL certificate path
export SSL_KEY="/path/to/key.pem" # SSL private key pathThis system implements security measures aligned with EU regulations including the Cyber Resilience Act (CRA) and GDPR. As non-commercial software developed by an individual, you're likely exempt from most CRA requirements, but these features ensure future compliance readiness.
- Rate Limiting: 60 requests/minute per IP with RFC-compliant headers
- Input Validation: Pydantic-based validation with content sanitization
- Security Headers: XSS, CSRF, and content-type protection
- HTTPS Enforcement: HSTS when SSL is enabled
- CORS Protection: Restricted cross-origin access
- Security Monitoring: Structured logging of security events
# Generate comprehensive security report
python security_utils.py report
# Scan for vulnerabilities
python security_utils.py scan
# Check license compliance
python security_utils.py licenses
# Generate SBOM
python security_utils.py sbom- Health Endpoint:
/health- includes rate limit status - Security Dashboard:
/dashboard- comprehensive system security info - Structured Logging: JSON-formatted security event logs
- LRU Eviction: Automatically removes least recently used cache entries
- Size Management: Configurable cache limits (entries and bytes)
- SHA256 Hashing: Fast, collision-resistant cache keys
- Thread-Safe: Concurrent access protection
- Performance: Sub-millisecond cache lookups
- Health Checks: Comprehensive system health monitoring
- Performance Metrics: Response times, cache hit rates, error rates
- System Dashboard: Complete system overview with configuration
- Structured Logging: JSON-formatted logs for easy parsing
- Debug Endpoints: Cache inspection and profiling tools
- File Permissions: Restricted access to sensitive files (0o600)
- SSL/TLS Support: HTTPS with configurable certificates
- Environment Validation: Early validation of API keys and configuration
- Process Isolation: Secure subprocess management
- Async Processing: Non-blocking I/O with FastAPI and httpx
- Connection Pooling: Efficient HTTP client reuse
- Graceful Shutdown: 10-second timeout for clean process termination
- Memory Management: Controlled cache growth with eviction policies
- Lock Files: Prevents concurrent execution conflicts
- State Persistence: Survives system restarts
- Error Recovery: Automatic pod restart on failures
- Health Monitoring: Continuous proxy health validation
- β No privileged ports required
- β
No
sudoneeded for any operations - β File permissions automatically restricted
- β SSL/TLS support for encrypted communication
- β Environment variables for sensitive data
- β Process isolation and secure cleanup
To connect the Continue extension to your locally hosted RunPod LLM endpoint, create or update the configuration file at:
~/.continue/config.json
with the following content:
{
"models": [
{
"title": "RunPod DeepSeek",
"provider": "openai",
"model": "deepseek-coder-33b-awq",
"apiBase": "http://localhost:8080/v1"
}
]
}title: Friendly display name for your model in Continue.provider: Must be"openai"since the RunPod endpoint is OpenAI-compatible.model: The exact model identifier you configured for your pod.apiBase: The local URL exposed by your FastAPI proxy (localhostand port should match your config, default: 8000).
This setup tells Continue to send requests to your RunPod podβs OpenAI-compatible API endpoint running locally. Remember to restart the Continue extension after saving the config for changes to take effect.
To connect CodeGPT to your locally hosted RunPod LLM endpoint, open your VSCode settings.json file:
File β Preferences β Settings β Open Settings (JSON)Add the following configuration:
{
"codegpt.model": "openai",
"codegpt.apiKey": "sk-placeholder",
"codegpt.apiBaseUrl": "http://localhost:8080/v1"
}model: Set to"openai"to use OpenAI-compatible formatting.apiKey: Required by CodeGPT even for local endpointsβuse any placeholder string.apiBaseUrl: Must match your FastAPI proxy URL and port (default:http://localhost:8000/v1).
β οΈ CodeGPT requires a dummy API key even for local endpoints. You can use"sk-local"or"sk-placeholder".
Prinova Cody (Sourcegraph Cody) connects to LLMs via a Sourcegraph instance. To use a custom LLM like your RunPod pod, you'll need:
- A Sourcegraph Enterprise instance
- Admin access to configure external LLM endpoints
- A generated access token
Once you have those:
- Open Cody in VSCode
- Click Sign In to Your Enterprise Instance
- Enter your Sourcegraph URL
- Paste your access token
- Select your custom model from the dropdown (if configured)
β οΈ Cody does not support direct local endpoint configuration in VSCode. You must register your RunPod endpoint with a Sourcegraph instance first.
For full setup instructions, see Sourcegraph's Cody installation guide.
To connect the Kilo Code extension to your locally hosted RunPod LLM proxy:
- Install the Kilo Code extension in VSCode
- Go to VSCode Settings β Extensions β Kilo Code
- Configure the following basic settings:
{
"kilo-code.api.baseUrl": "http://localhost:8000/v1",
"kilo-code.api.key": "sk-local-proxy",
"kilo-code.model.name": "deepseek-coder-33b-awq",
"kilo-code.cache.enabled": true,
"kilo-code.cache.directory": "/tmp/llm_cache"
}-
Start the proxy:
python3 manage_pod.py
-
Verify proxy health:
curl http://localhost:8000/health
-
Test in VSCode:
- Open a Python file in VSCode
- Use Kilo Code autocomplete or chat features
- Verify requests are routed through your local proxy
- Connection issues: Ensure proxy is running on port 8000
- Authentication errors: Verify the API key matches
"sk-local-proxy" - Slow responses: Check cache is working with
curl http://localhost:8000/metrics
Solution: Install missing dependencies:
pip install uvicornSolution: Ensure cache directory permissions:
mkdir -p /tmp/llm_cache
chmod 755 /tmp/llm_cacheSolution: Change proxy port in pod_config.json:
{
"proxy_port": 8001
}Solution: Verify certificate files exist and have correct permissions:
ls -la /path/to/cert.pem /path/to/key.pem
chmod 600 /path/to/key.pemSolution: Check pod status and logs:
python3 manage_pod.py --verboseSolution: Check cache directory and permissions:
ls -la /tmp/llm_cache/# Check if proxy is running
ps aux | grep proxy_fastapi
# Check proxy health
curl http://localhost:8000/health
# View cache statistics
curl http://localhost:8000/metrics
# View comprehensive dashboard
curl http://localhost:8000/dashboardEnable debug endpoints for troubleshooting:
{
"enable_profiling": true
}Then access debug information:
curl http://localhost:8000/debug/cache# Stop the proxy and terminate pod
python3 manage_pod.py --shutdown
# Clean up cache files (optional)
rm -rf /tmp/llm_cache/*
# Remove state files
rm -f pod_state.json
rm -f /tmp/fastapi_proxy.pidThe system automatically:
- β Terminates pods on expiry
- β Cleans up PID files on shutdown
- β Removes stale lock files
- β Evicts old cache entries
# View cache size
du -sh /tmp/llm_cache/
# Clear all cache
rm -rf /tmp/llm_cache/*
mkdir -p /tmp/llm_cache- Use fast storage: Place cache directory on SSD/NVMe for better performance
- Pre-warm cache: Enable
PREWARM_CACHE=trueto populate cache with common patterns on startup - Monitor cache hit rates: Use
/metricsendpoint to track cache effectiveness - Adjust cache size: Increase
MAX_CACHE_SIZEfor better hit rates on large codebases
{
"use_https": true,
"ssl_cert": "/path/to/cert.pem",
"ssl_key": "/path/to/key.pem"
}We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.
- Read the Code of Conduct
- Check the Contributing Guide
- Report issues using our issue templates
- Submit PRs using our pull request template
This system is designed to be:
- Modular: Easy to extend with new features
- Configurable: All major settings are configurable
- Observable: Comprehensive logging and metrics
- Secure: Follows security best practices
- Add configuration options to
pod_config.json - Implement functionality in appropriate module
- Add metrics and logging
- Update documentation
- Test thoroughly
This project is licensed under the MIT License - see the LICENSE file for details.
Please ensure compliance with:
- MIT License: For the original project code
- LGPL Compliance: For chardet and frozendict dependencies (see License Compliance section above)
- RunPod Terms of Service: When using RunPod infrastructure
- Hugging Face model licenses: For any models deployed
- Local data privacy regulations: GDPR and other applicable laws
- Check the troubleshooting section above
- Check proxy health:
curl http://localhost:8000/health - Enable verbose mode:
python3 manage_pod.py --verbose - Check pod status:
python3 manage_pod.py --refresh-catalog
Last updated: 2025-09-22