RunPod LLM Pod Manager with FastAPI Proxy

This advanced system automates the full lifecycle of GPU-backed LLM pods on RunPod, featuring enterprise-grade caching, monitoring, security, and a high-performance FastAPI reverse proxy for development tools like Continue, CodeGPT, and Prinova Cody.

🚀 Purpose

🚀 Launch ephemeral LLM pods with OpenAI-compatible endpoints
🔒 Default to SECURE mode for privacy and isolation
⚡ High-performance FastAPI reverse proxy with intelligent caching
📊 Real-time metrics and health monitoring
🔐 SSL/TLS support for secure communication
💾 LRU cache with configurable size limits
📈 Structured JSON logging and performance profiling
🏥 Comprehensive health checks and dashboard
⏰ Track pod state for restarts, shutdowns, and cost control
💰 Enforce runtime limits and prevent lingering charges
🤖 Support cron-based watchdog execution

📦 Files

File	Description
`manage_pod.py`	🚀 Unified lifecycle controller: start, restart, terminate, watchdog
`proxy_fastapi.py`	⚡ FastAPI proxy with caching, metrics, SSL, and security monitoring
`security_utils.py`	🔒 Security utilities: SBOM generation, vulnerability scanning, compliance
`pod_config.json`	⚙️ Configuration file with model, GPU, cache, SSL, and runtime settings
`pod_state.json`	📊 Auto-generated state file storing pod ID, model, and runtime info
`requirements.txt`	📦 Python dependencies with security and compliance tools
`SECURITY.md`	🛡️ Comprehensive security documentation and compliance guide
`test_strategy.md`	🧪 Detailed testing strategy for manual and automated pod management
`LICENSE`	📜 MIT license for the project with LGPL compliance notes
`.github/`	⚙️ GitHub repository configuration and community health files
`.gitignore`	🚫 Prevents committing sensitive files and local artifacts
`README.md`	📖 This comprehensive documentation

🧰 Prerequisites & Installation

🧰 System Requirements

Operating System:

Linux (Ubuntu 20.04+, Debian, CentOS, etc.)
macOS (10.15+)
Windows with WSL2

Python Version:

Python 3.8 or higher

📦 Installation

Clone or download the repository:

git clone <repository-url>
cd runpod-llm-manager

Install Python dependencies:

# Core dependencies (required)
pip install fastapi httpx uvicorn pydantic aiofiles

# Security and compliance tools (recommended)
pip install cyclonedx-bom safety pip-licenses

Or install all at once:

pip install -r requirements.txt

📜 License Compliance

This project uses a mix of permissive and copyleft licenses:

Permissive Licenses (MIT, BSD, Apache)

fastapi, httpx, uvicorn, pydantic, aiofiles, requests
No restrictions on use, modification, or distribution

Copyleft Licenses (LGPL)

chardet (LGPL v2.1+): Character encoding detection
frozendict (LGPL v3+): Immutable dictionary implementation

Distribution Compliance

Since this software may be distributed via GitHub:

Source Code Availability: ✅ Complete source code is provided
License Texts: ✅ All licenses are included in dependencies
LGPL Compliance: ✅ Users can replace LGPL components if desired
No Modifications: ✅ LGPL libraries are used unmodified

Note: As an individual developer distributing non-commercial software, you have additional fair use protections, but this documentation ensures compliance for all users.

Set up environment variables:

export RUNPOD_API_KEY="your-runpod-api-key-here"

Create cache directory:
```
mkdir -p /tmp/llm_cache
```

✅ No NGINX required! The system uses a high-performance FastAPI proxy with built-in caching and monitoring.

⏱️ Cron Setup (Optional)

🐧 WSL2 / Ubuntu Setup Notes

To ensure runpod-llm-manager functions correctly inside WSL2 with Ubuntu:

⚙️ Enable `systemd` (Optional but Recommended)

If you're using newer Ubuntu builds on WSL2, you can enable systemd for better service management:

Edit your WSL config:

sudo nano /etc/wsl.conf

Add:

[boot]
systemd=true

Restart WSL:
```
wsl.exe --shutdown
```

Note: systemd support requires WSL version 0.67.6 or newer. Run wsl --version to check.

🔧 Enable `cron` in WSL2

WSL2 does not start cron automatically. To enable it:

Install cron if not already present:
```
sudo apt update
sudo apt install cron
```
Allow passwordless startup for cron (optional but recommended for automation):
```
sudo visudo
```
Add this line at the bottom:
```
your_username ALL=NOPASSWD: /usr/sbin/service cron start
```
Start cron manually once:
```
sudo service cron start
```
To ensure cron starts automatically when WSL2 boots, use Windows Task Scheduler to run:
```
wsl -d Ubuntu -- sudo service cron start
```
on login or system boot.

To automate pod lifecycle management and prevent lingering charges, add the following cron entries:

🔄 Watchdog / Expiry Check (Every 5 Minutes)

Runs manage_pod.py to start, restart, or terminate pods based on runtime limits:

*/5 * * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py >> /var/log/runpod_watchdog.log 2>&1

🛑 Forced Termination (Midnight Daily)

Ensures all pods are terminated at midnight regardless of state:

0 0 * * * /usr/bin/python3 /path/to/runpod-llm-manager/manage_pod.py --shutdown >> /var/log/runpod_shutdown.log 2>&1

Replace /path/to/runpod-llm-manager/ with the actual path to your script. Ensure your user has permission to run Python without sudo. These cron jobs work with the existing code and provide automated lifecycle management.

� Configuration

Create pod_config.json with comprehensive settings:

📋 Basic Configuration

{
  "modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
  "gpu_type_id": "NVIDIA RTX A6000",
  "runtime_seconds": 3600,
  "template_id": "vllm"
}

⚙️ Advanced Configuration

{
  "modelStoreId": "deepseek-ai/deepseek-coder-33b-awq",
  "gpu_type_id": "NVIDIA RTX A6000",
  "runtime_seconds": 3600,
  "template_id": "vllm",

  // Proxy Configuration
  "proxy_port": 8000,
  "cache_dir": "/tmp/llm_cache",

  // SSL/TLS Configuration
  "use_https": false,
  "ssl_cert": "/path/to/cert.pem",
  "ssl_key": "/path/to/key.pem",

  // Cache Configuration
  "max_cache_size": 1000,
  "max_cache_bytes": 1073741824,

  // Performance & Monitoring
  "enable_profiling": false
}

🔍 Configuration Parameters

Parameter	Type	Default	Description
`modelStoreId`	string	required	Model Store model identifier
`gpu_type_id`	string	required	GPU type for pod deployment
`runtime_seconds`	int	3600	Maximum pod runtime in seconds
`template_id`	string	"vllm"	RunPod template identifier
`proxy_port`	int	8000	Local proxy port
`cache_dir`	string	"/tmp/llm_cache"	Cache directory path
`use_https`	boolean	false	Enable SSL/TLS
`ssl_cert`	string	null	SSL certificate file path
`ssl_key`	string	null	SSL private key file path
`max_cache_size`	int	1000	Maximum cached responses
`max_cache_bytes`	int	1GB	Maximum cache size in bytes
`enable_profiling`	boolean	false	Enable debug/profiling endpoints
`initial_wait_seconds`	int	10	Seconds to wait after pod creation before checking status
`max_startup_attempts`	int	20	Maximum attempts to wait for pod to become ready
`poll_interval_seconds`	int	5	Seconds between pod status checks during startup

🔍 Discovering Supported Models via RunPod UI

RunPod supports a wide range of open-source models for vLLM pods. To explore available options:

🧭 Using Quick Deploy

Go to RunPod Console
Click Deploy a Pod
Select Serverless > vLLM Worker
In the Model dropdown, browse the list of supported Hugging Face models

These models are pre-tested for compatibility with RunPod’s vLLM container and expose an OpenAI-style API endpoint.

📌 Notes

Most models listed are public and do not require a Hugging Face token.
If you select a gated model (e.g. meta-llama/Llama-3-8B-Instruct), you’ll need to provide a HF_TOKEN in your pod config.
You can also deploy any compatible Hugging Face model manually by specifying its name in your pod_config.json.

For examples of known working models, see the models list printed during --refresh-catalog in verbose mode.

🧪 Usage

🚀 Basic Usage

# Start or restart pod with watchdog behavior
python3 manage_pod.py

# Force termination of active pod
python3 manage_pod.py --shutdown

# Dry run mode (no actual API calls)
python3 manage_pod.py --dry-run

# Verbose logging
python3 manage_pod.py --verbose

# Refresh catalog and validate configuration
python3 manage_pod.py --refresh-catalog

🌐 API Endpoints

Once running, your LLM is available at:

http://localhost:8000/v1/chat/completions

📊 Monitoring Endpoints

Health Check: GET /health - includes rate limiting status
Metrics: GET /metrics - performance and cache statistics
Dashboard: GET /dashboard - comprehensive system overview with security info
Debug Cache (if profiling enabled): GET /debug/cache

📈 Monitoring & Metrics

# Check proxy health
curl http://localhost:8000/health

# Get performance metrics
curl http://localhost:8000/metrics

# View comprehensive dashboard
curl http://localhost:8000/dashboard

🔧 Environment Variables

# Required
export RUNPOD_API_KEY="your-api-key"

# Optional (for advanced features)
export MAX_CACHE_SIZE="2000"          # Increase cache size
export CACHE_SIZE_BYTES="2147483648"  # 2GB cache
export ENABLE_PROFILING="true"        # Enable debug endpoints
export PREWARM_CACHE="true"           # Pre-populate cache with common patterns

# Security configuration
export RATE_LIMIT_REQUESTS="60"       # Requests per window
export RATE_LIMIT_WINDOW="60"         # Window in seconds
export USE_HTTPS="false"              # Enable HTTPS
export SSL_CERT="/path/to/cert.pem"   # SSL certificate path
export SSL_KEY="/path/to/key.pem"     # SSL private key path

🔐 Security & Compliance

EU Regulatory Compliance

This system implements security measures aligned with EU regulations including the Cyber Resilience Act (CRA) and GDPR. As non-commercial software developed by an individual, you're likely exempt from most CRA requirements, but these features ensure future compliance readiness.

🛡️ Security Features

Rate Limiting: 60 requests/minute per IP with RFC-compliant headers
Input Validation: Pydantic-based validation with content sanitization
Security Headers: XSS, CSRF, and content-type protection
HTTPS Enforcement: HSTS when SSL is enabled
CORS Protection: Restricted cross-origin access
Security Monitoring: Structured logging of security events

🔍 Security Tools

# Generate comprehensive security report
python security_utils.py report

# Scan for vulnerabilities
python security_utils.py scan

# Check license compliance
python security_utils.py licenses

# Generate SBOM
python security_utils.py sbom

📊 Security Monitoring

Health Endpoint: /health - includes rate limit status
Security Dashboard: /dashboard - comprehensive system security info
Structured Logging: JSON-formatted security event logs

✨ Advanced Features

💾 Intelligent Caching System

LRU Eviction: Automatically removes least recently used cache entries
Size Management: Configurable cache limits (entries and bytes)
SHA256 Hashing: Fast, collision-resistant cache keys
Thread-Safe: Concurrent access protection
Performance: Sub-millisecond cache lookups

📊 Real-Time Monitoring

Health Checks: Comprehensive system health monitoring
Performance Metrics: Response times, cache hit rates, error rates
System Dashboard: Complete system overview with configuration
Structured Logging: JSON-formatted logs for easy parsing
Debug Endpoints: Cache inspection and profiling tools

🔐 Security & SSL/TLS

File Permissions: Restricted access to sensitive files (0o600)
SSL/TLS Support: HTTPS with configurable certificates
Environment Validation: Early validation of API keys and configuration
Process Isolation: Secure subprocess management

⚡ Performance Optimizations

Async Processing: Non-blocking I/O with FastAPI and httpx
Connection Pooling: Efficient HTTP client reuse
Graceful Shutdown: 10-second timeout for clean process termination
Memory Management: Controlled cache growth with eviction policies

🤖 Automation & Reliability

Lock Files: Prevents concurrent execution conflicts
State Persistence: Survives system restarts
Error Recovery: Automatic pod restart on failures
Health Monitoring: Continuous proxy health validation

🔐 Security Notes

✅ No privileged ports required
✅ No sudo needed for any operations
✅ File permissions automatically restricted
✅ SSL/TLS support for encrypted communication
✅ Environment variables for sensitive data
✅ Process isolation and secure cleanup

VSCode extension configuration

🧩 Continue Extension

To connect the Continue extension to your locally hosted RunPod LLM endpoint, create or update the configuration file at:

~/.continue/config.json

with the following content:

{
  "models": [
    {
      "title": "RunPod DeepSeek",
      "provider": "openai",
      "model": "deepseek-coder-33b-awq",
      "apiBase": "http://localhost:8080/v1"
    }
  ]
}

title: Friendly display name for your model in Continue.
provider: Must be "openai" since the RunPod endpoint is OpenAI-compatible.
model: The exact model identifier you configured for your pod.
apiBase: The local URL exposed by your FastAPI proxy (localhost and port should match your config, default: 8000).

This setup tells Continue to send requests to your RunPod pod’s OpenAI-compatible API endpoint running locally. Remember to restart the Continue extension after saving the config for changes to take effect.

🧩 VSCode CodeGPT Extension Configuration Example

To connect CodeGPT to your locally hosted RunPod LLM endpoint, open your VSCode settings.json file:

File → Preferences → Settings → Open Settings (JSON)

Add the following configuration:

{
  "codegpt.model": "openai",
  "codegpt.apiKey": "sk-placeholder",
  "codegpt.apiBaseUrl": "http://localhost:8080/v1"
}

model: Set to "openai" to use OpenAI-compatible formatting.
apiKey: Required by CodeGPT even for local endpoints—use any placeholder string.
apiBaseUrl: Must match your FastAPI proxy URL and port (default: http://localhost:8000/v1).

⚠️ CodeGPT requires a dummy API key even for local endpoints. You can use "sk-local" or "sk-placeholder".

🧩 VSCode Prinova Cody Extension Configuration Example

Prinova Cody (Sourcegraph Cody) connects to LLMs via a Sourcegraph instance. To use a custom LLM like your RunPod pod, you'll need:

A Sourcegraph Enterprise instance
Admin access to configure external LLM endpoints
A generated access token

Once you have those:

Open Cody in VSCode
Click Sign In to Your Enterprise Instance
Enter your Sourcegraph URL
Paste your access token
Select your custom model from the dropdown (if configured)

⚠️ Cody does not support direct local endpoint configuration in VSCode. You must register your RunPod endpoint with a Sourcegraph instance first.

For full setup instructions, see Sourcegraph's Cody installation guide.

🧩 VSCode Kilo Code Extension Configuration

To connect the Kilo Code extension to your locally hosted RunPod LLM proxy:

Install the Kilo Code extension in VSCode
Go to VSCode Settings → Extensions → Kilo Code
Configure the following basic settings:

{
  "kilo-code.api.baseUrl": "http://localhost:8000/v1",
  "kilo-code.api.key": "sk-local-proxy",
  "kilo-code.model.name": "deepseek-coder-33b-awq",
  "kilo-code.cache.enabled": true,
  "kilo-code.cache.directory": "/tmp/llm_cache"
}

Testing the Configuration

Start the proxy:
```
python3 manage_pod.py
```
Verify proxy health:
```
curl http://localhost:8000/health
```
Test in VSCode:
- Open a Python file in VSCode
- Use Kilo Code autocomplete or chat features
- Verify requests are routed through your local proxy

Troubleshooting

Connection issues: Ensure proxy is running on port 8000
Authentication errors: Verify the API key matches "sk-local-proxy"
Slow responses: Check cache is working with curl http://localhost:8000/metrics

🔧 Troubleshooting

Common Issues & Solutions

❌ `Import "uvicorn" could not be resolved`

Solution: Install missing dependencies:

pip install uvicorn

❌ `Permission denied` when creating cache files

Solution: Ensure cache directory permissions:

mkdir -p /tmp/llm_cache
chmod 755 /tmp/llm_cache

❌ `Port already in use`

Solution: Change proxy port in pod_config.json:

{
  "proxy_port": 8001
}

❌ SSL certificate errors

Solution: Verify certificate files exist and have correct permissions:

ls -la /path/to/cert.pem /path/to/key.pem
chmod 600 /path/to/key.pem

❌ Pod fails health check

Solution: Check pod status and logs:

python3 manage_pod.py --verbose

❌ Cache not working

Solution: Check cache directory and permissions:

ls -la /tmp/llm_cache/

📊 Monitoring Commands

# Check if proxy is running
ps aux | grep proxy_fastapi

# Check proxy health
curl http://localhost:8000/health

# View cache statistics
curl http://localhost:8000/metrics

# View comprehensive dashboard
curl http://localhost:8000/dashboard

🔍 Debug Mode

Enable debug endpoints for troubleshooting:

{
  "enable_profiling": true
}

Then access debug information:

curl http://localhost:8000/debug/cache

🧹 Cleanup

Manual Cleanup

# Stop the proxy and terminate pod
python3 manage_pod.py --shutdown

# Clean up cache files (optional)
rm -rf /tmp/llm_cache/*

# Remove state files
rm -f pod_state.json
rm -f /tmp/fastapi_proxy.pid

Automated Cleanup

The system automatically:

✅ Terminates pods on expiry
✅ Cleans up PID files on shutdown
✅ Removes stale lock files
✅ Evicts old cache entries

Cache Management

# View cache size
du -sh /tmp/llm_cache/

# Clear all cache
rm -rf /tmp/llm_cache/*
mkdir -p /tmp/llm_cache

⚡ Performance Optimization

Cache Performance Tips

Use fast storage: Place cache directory on SSD/NVMe for better performance
Pre-warm cache: Enable PREWARM_CACHE=true to populate cache with common patterns on startup
Monitor cache hit rates: Use /metrics endpoint to track cache effectiveness
Adjust cache size: Increase MAX_CACHE_SIZE for better hit rates on large codebases

SSL/TLS Configuration

{
  "use_https": true,
  "ssl_cert": "/path/to/cert.pem",
  "ssl_key": "/path/to/key.pem"
}

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.

Quick Start for Contributors

Read the Code of Conduct
Check the Contributing Guide
Report issues using our issue templates
Submit PRs using our pull request template

Development Guidelines

This system is designed to be:

Modular: Easy to extend with new features
Configurable: All major settings are configurable
Observable: Comprehensive logging and metrics
Secure: Follows security best practices

Adding New Features

Add configuration options to pod_config.json
Implement functionality in appropriate module
Add metrics and logging
Update documentation
Test thoroughly

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

License Compliance Requirements

Please ensure compliance with:

MIT License: For the original project code
LGPL Compliance: For chardet and frozendict dependencies (see License Compliance section above)
RunPod Terms of Service: When using RunPod infrastructure
Hugging Face model licenses: For any models deployed
Local data privacy regulations: GDPR and other applicable laws

🆘 Support

Getting Help

Check the troubleshooting section above
Check proxy health: curl http://localhost:8000/health
Enable verbose mode: python3 manage_pod.py --verbose
Check pod status: python3 manage_pod.py --refresh-catalog

Common Resources

Last updated: 2025-09-22

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
config.py		config.py
dependencies.py		dependencies.py
logo.svg		logo.svg
manage_pod.py		manage_pod.py
pod_config.json		pod_config.json
proxy_fastapi.py		proxy_fastapi.py
proxy_fastapi_models.py		proxy_fastapi_models.py
requirements.txt		requirements.txt
sbom.json		sbom.json
security_report.json		security_report.json
security_utils.py		security_utils.py
services.py		services.py
test_mocks.py		test_mocks.py
test_strategy.md		test_strategy.md

Uh oh!

License

phudson/runpod-llm-manager

Folders and files

Latest commit

History

Repository files navigation