A production-ready, containerized RAG (Retrieval-Augmented Generation) stack with comprehensive monitoring, observability, and enterprise-grade DevOps practices.
This diagram illustrates the complete architecture of the OpenSource RAG LLM Stack, showing the interaction between all components for retrieval-augmented generation, chat history management, and comprehensive monitoring.
RAG Flow: User β Open WebUI β Chroma (retrieve) β Open WebUI β Ollama (generate) β Open WebUI β User
This project demonstrates enterprise-grade AI infrastructure practices:
- Containerized Microservices: Docker Compose orchestration with complete service isolation
- Vector Database: Chroma for semantic search and embeddings storage
- LLM Integration: Containerized Ollama for reproducible LLM inference with Open WebUI interface
- Data Persistence: PostgreSQL with optimized schema for chat history and RAG documents
- Observability: Prometheus metrics collection with Grafana dashboards
- Monitoring: Real-time service health monitoring and performance metrics
- Security: Network isolation, environment-based configuration, and data encryption
- Docker & Docker Compose
- 8GB+ RAM (for LLM models)
# Clone the repository
git clone <your-repo-url>
cd LLM-RAG-Stack
# Quick start (includes model setup)
./start.sh
# Or manual setup:
# Start all services (includes Ollama)
docker-compose up -d
# Set up Ollama with a model
./scripts/setup-ollama.sh
# Check service status
docker-compose ps# Prerequisites: Install Ollama locally (https://ollama.ai)
# Start Ollama on your host machine
ollama serve
# Start the RAG stack (connects to local Ollama)
docker-compose -f local-ollama-docker-compose.yml up -d
# Check service status
docker-compose -f local-ollama-docker-compose.yml ps- Open WebUI: http://localhost:3000 (AI Chat Interface)
- Grafana: http://localhost:3001 (admin/admin123)
- Prometheus: http://localhost:9090 (Metrics)
- Chroma API: http://localhost:8000 (Vector Database)
- PostgreSQL: localhost:5432 (Database)
- Ollama API: http://localhost:11434 (LLM Service)
# List available models
docker exec -it ollama ollama list
# Pull a new model
docker exec -it ollama ollama pull llama3.2:3b
# Remove a model
docker exec -it ollama ollama rm llama3.2:3b
# Run the setup script for guided model installation
./scripts/setup-ollama.sh- llama3.2:3b (3B params, ~2GB) - Best balance of speed and quality
- llama3.2:1b (1B params, ~1GB) - Fastest, good for basic tasks
- mistral:7b (7B params, ~4GB) - High quality, slower
- codellama:7b (7B params, ~4GB) - Specialized for coding tasks
- gemma:2b (2B params, ~1.5GB) - Google's efficient model
The project uses Docker Compose for reproducible, self-contained infrastructure:
# Complete self-contained setup with Ollama
services:
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes:
- ollama-data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_ORIGINS=*
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports: ["3000:8080"]
environment:
- OLLAMA_API_BASE_URL=http://ollama:11434
- VECTOR_DB=chroma
- DATABASE_URL=postgresql://user:password@postgres:5432/chatdb
chroma:
image: ghcr.io/chroma-core/chroma:latest
ports: ["8000:8000"]
environment:
- CHROMA_DB_IMPL=duckdb+parquet
postgres:
image: postgres:15-alpine
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: chatdbFor users with existing Ollama installations, use local-ollama-docker-compose.yml:
# Connects to local Ollama installation
services:
open-webui:
environment:
- OLLAMA_API_BASE_URL=http://host.docker.internal:11434# Observability Services
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana-oss:latest
ports: ["3001:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123The database is initialized with an optimized schema for RAG operations:
-- Chat Sessions Management
CREATE TABLE chat_sessions (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id VARCHAR(255) NOT NULL,
session_name VARCHAR(255),
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
-- Message Storage with Full-Text Search
CREATE TABLE chat_messages (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
session_id UUID REFERENCES chat_sessions(id),
role VARCHAR(50) CHECK (role IN ('user', 'assistant', 'system')),
content TEXT NOT NULL,
token_count INTEGER DEFAULT 0
);
-- RAG Document Storage
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
title VARCHAR(500),
content TEXT NOT NULL,
source VARCHAR(500),
embedding_id VARCHAR(255), -- Chroma reference
metadata JSONB DEFAULT '{}'::jsonb
);
-- Performance Indexes
CREATE INDEX idx_documents_content_gin ON documents
USING gin(to_tsvector('english', content));# Access Open WebUI
open http://localhost:3000
# Navigate to Knowledge section
# Upload documents (PDF, TXT, etc.)
# System automatically:
# - Chunks documents
# - Generates embeddings
# - Stores in Chroma vector database# Check Chroma collections
curl -s http://localhost:8000/api/v2/tenants/default/databases/default/collections | jq '.'
# Verify heartbeat
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8000/api/v2/heartbeat- Ask questions in Open WebUI that reference uploaded content
- System retrieves relevant chunks from Chroma
- Augments prompts with retrieved context
- Generates responses using Ollama LLM
- Service Health:
up{job=~"prometheus|postgres_exporter"} - Database Performance: PostgreSQL exporter metrics
- Request Rates: HTTP request monitoring
- Resource Usage: Container and system metrics
Pre-configured dashboards include:
- RAG Stack Overview: Service health and performance
- Database Metrics: PostgreSQL performance monitoring
- System Resources: CPU, memory, and disk usage
- Request Analytics: API call patterns and response times
Here's how the RAG Stack Monitoring dashboard looks:
The dashboard provides real-time insights into:
- Service Health Status: Live monitoring of all stack components
- Active Services Count: Overview of running services
- Request Rate Monitoring: API performance metrics
- Database Performance: PostgreSQL metrics and health
# Grafana automatically configures:
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
# Dashboards auto-loaded from:
# monitoring/grafana/dashboards/# Production environment variables
export POSTGRES_PASSWORD=secure_password
export GRAFANA_ADMIN_PASSWORD=secure_admin_password
export OLLAMA_API_BASE_URL=https://your-ollama-instance.com- Horizontal Scaling: Multiple Ollama instances behind load balancer
- Database Scaling: PostgreSQL read replicas for query performance
- Vector DB Scaling: Chroma clustering for high availability
- Monitoring: Prometheus federation for multi-instance monitoring
- Change default passwords in production
- Use Docker secrets for sensitive data
- Configure network security policies
- Enable SSL/TLS for all services
- Implement proper backup strategies
# View logs
docker-compose logs [service-name]
# Restart services
docker-compose restart [service-name]
# Clean restart
docker-compose down
docker-compose up -d
# For local Ollama setup, use:
# docker-compose -f local-ollama-docker-compose.yml [command]# Check Chroma connection
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8000/api/v2/heartbeat
# Create tenant/database if needed
curl -X POST http://localhost:8000/api/v2/tenants \
-H "Content-Type: application/json" \
-d '{"name": "default"}'
curl -X POST http://localhost:8000/api/v2/tenants/default/databases \
-H "Content-Type: application/json" \
-d '{"name": "default"}'# Check PostgreSQL status
docker-compose -f local-ollama-docker-compose.yml logs postgres
# Verify database initialization
docker exec -it postgres psql -U user -d chatdb -c "\dt"All data is persisted in Docker volumes:
ollama-data: LLM models and Ollama configurationsopenwebui-data: WebUI configurations and user datachroma-data: Vector embeddings and collectionspgdata: PostgreSQL database filesgrafana-data: Dashboard configurations and user settingsprometheus-data: Metrics time-series data
- Infrastructure as Code: Docker Compose for reproducible deployments
- Monitoring: Comprehensive observability with Prometheus and Grafana
- Data Management: Optimized PostgreSQL schema with full-text search
- Security: Network isolation and environment-based configuration
- Scalability: Microservices architecture for horizontal scaling
- Vector Search: Chroma for semantic similarity search
- Containerized LLM: Ollama in Docker for reproducible model inference
- RAG Pipeline: Complete retrieval-augmented generation workflow
- Document Processing: Automatic chunking and embedding generation
- Chat History: Persistent conversation management
- Model Management: Easy model switching and versioning with Docker volumes
AI/ML Infrastructure Engineer with expertise in:
-
Containerized AI/ML workloads
-
Vector databases and RAG systems
-
Observability and monitoring
-
Enterprise DevOps practices
-
GitHub: [Your GitHub Profile]
-
LinkedIn: [Your LinkedIn Profile]
-
Portfolio: [Your Portfolio Website]
This project demonstrates modern AI infrastructure practices, enterprise-grade monitoring, and production-ready RAG system implementation.

