This repository is a showcase and evolving codebase for building and orchestrating AI systems from the ground upβdesigned by a Full Stack AI Engineer with end-to-end expertise across mathematics, data engineering, software development, and Kubernetes-based infrastructure.
This project demonstrates a complete, production-grade architecture to:
- Operate an LLM cluster using Ollama models.
- Harness GPU acceleration using the NVIDIA GPU Operator on Kubernetes.
- Use Custom Resource Definitions (CRDs) and controllers to coordinate model behaviors.
- Create an ecosystem where multiple models can collaborate to perform higher-level tasks (question answering, summarization, classification, etc.).
- Establish infrastructure-as-code patterns using Kustomize, Flux, and GitOps principles.
AI systems are rarely βone model fits all.β This project introduces a framework where specialized AI agents (LLMs), hosted as services across a Kubernetes cluster, can interoperate to complete sophisticated tasks.
Inspired by:
- Full-stack software engineering principles
- Multi-agent systems
- MLOps best practices
- Declarative infrastructure management
.
βββ infra/
β βββ cluster-iac/ # Infrastructure as Code (Terraform) for deploying an EKS cluster with requisite GPU support
β βββ base/ # Base Kustomize configurations (Flux, GPU Operator, etc.)
β βββ overlays/ # Cluster-specific configurations
β βββ flux/ # Flux GitOps setup
β βββ monitoring/ # Prometheus/Grafana, if used
β
βββ crds/ # Custom Resource Definitions (YAML) and Go types
β βββ ollamaagent_crd.yaml # Defines OllamaAgent behavior/contract
β βββ ollamamodeldefinition_crd.yaml
β βββ taskorchestration_crd.yaml
β
βββ controllers/ # Golang operators/controllers (kubebuilder-based)
β βββ ollamaagent_controller.go
β βββ ollamamodeldefinition_controller.go
β βββ taskorchestration_controller.go
β
βββ ollama-operators/ # Model server orchestration logic
β βββ agent-specialization/ # Specialized agent roles (Q&A, summarizer, etc.)
β βββ service-deployments/ # Helm or Kustomize configs for model deployments
β βββ collab-logic/ # Logic for inter-agent communication & orchestration
β
βββ data/ # Data pipeline logic (ETL, tokenization, chunking, etc.)
β βββ etl-pipeline/
β βββ example-datasets/
β
βββ api/ # API gateway and backend logic (Go or Python)
β βββ routes/ # Task submission endpoints
β βββ orchestration/ # Converts user requests into CRs for processing
β
βββ examples/ # Example workflows and scenarios
β βββ question-answering/
β βββ summarization-pipeline/
β βββ multi-model-chat/
β
βββ docs/ # Architecture diagrams and documentation
β βββ architecture.md
β βββ ollama-crd-spec.md
β βββ orchestration-diagram.png
β
βββ README.md
Layer | Technology | Purpose |
---|---|---|
Container Runtime | containerd | Lightweight, Kubernetes-native runtime |
GPU Provisioning | NVIDIA GPU Operator | Automatically manage GPU drivers + toolkit |
GitOps | Flux | Declarative and auditable infra delivery |
K8s Package Manager | Kustomize + Helm | Infra and app lifecycle management |
Model Hosting | Ollama (on GPU nodes) | LLM serving engine |
Task Coordination | Custom Resource Definitions (CRDs) | Define and manage complex task orchestration |
Monitoring | Prometheus + Grafana (optional) | Cluster and model performance observability |
The system uses Kubernetes CRDs to implement the A2A (Agent-to-Agent) protocol, enabling seamless communication between AI agents. Our CRDs define both the agent deployment and task orchestration aspects of the system.
A CRD for deploying and managing individual model agents that implement the A2A protocol.
apiVersion: ai.stack/v1alpha1
kind: OllamaAgent
metadata:
name: summarizer-agent
spec:
# Reference to the OllamaModelDefinition
modelDefinition:
name: summarizer-model
version: "1.0.0"
# Core agent configuration
role: summarizer
# A2A protocol implementation
agentCard:
capabilities:
- summarization
- text-analysis
endpoint: "/api/v1/agent"
authentication:
type: "bearer"
# Resource requirements
resources:
gpu: 1
memory: "8Gi"
cpu: "2"
# A2A server configuration
server:
streaming: true
pushNotifications: true
webhookConfig:
retryPolicy: exponential
maxRetries: 3
# Model-specific settings
modelConfig:
temperature: 0.7
contextWindow: 4096
responseFormat: "json"
A CRD that defines how to build a custom Ollama model with specific capabilities and behaviors. When created, it triggers the build process within the cluster.
apiVersion: ai.stack/v1alpha1
kind: OllamaModelDefinition
metadata:
name: summarizer-model
spec:
# Base model configuration
from: llama2
# Model build parameters
build:
# System prompt defining agent behavior
system: |
You are a specialized summarization agent that excels at:
1. Extracting key information from documents
2. Creating concise summaries
3. Identifying main themes and topics
# Parameters for model behavior
parameters:
temperature: 0.7
contextWindow: 4096
responseFormat: json
# Model adaptation and fine-tuning
template: |
{{ if .System }}{{.System}}{{ end }}
Context: {{.Input}}
Instructions: Create a summary that includes:
- Main points
- Key findings
- Action items
Response format:
{{.ResponseFormat}}
# Custom function definitions
functions:
- name: extract_key_points
description: "Extract main points from the text"
parameters:
type: object
properties:
main_points:
type: array
items:
type: string
themes:
type: array
items:
type: string
# Model tags for versioning and identification
tags:
version: "1.0.0"
type: "summarizer"
capabilities: ["text-analysis", "summarization"]
# Resource requirements for build process
buildResources:
gpu: 1
memory: "16Gi"
cpu: "4"
status:
phase: Building # Building, Complete, Failed
buildStartTime: "2025-04-23T13:30:00Z"
lastBuildTime: "2025-04-23T13:35:00Z"
modelHash: "sha256:abc123..."
conditions:
- type: Built
status: "True"
reason: "BuildSucceeded"
message: "Model successfully built and registered"
A CRD that manages complex task workflows between multiple agents.
apiVersion: ai.stack/v1alpha1
kind: TaskOrchestration
metadata:
name: document-analysis
spec:
# Task definition
input:
text: "Analyze and summarize this document"
format: "text/plain"
# A2A task workflow
pipeline:
- name: document-analyzer
agentRef: analyzer-agent
timeout: "5m"
retries: 2
artifacts:
- name: analysis-result
type: "application/json"
- name: summarizer
agentRef: summarizer-agent
dependsOn: ["document-analyzer"]
inputFrom:
- taskRef: document-analyzer
artifactName: analysis-result
- name: quality-check
agentRef: qa-agent
dependsOn: ["summarizer"]
condition: "success"
# A2A protocol settings
communication:
streaming: true
pushNotifications:
enabled: true
endpoint: "http://callback-service/webhook"
# Output configuration
output:
storage:
type: "s3"
bucket: "ai-results"
prefix: "outputs/"
format:
- type: "application/json"
- type: "text/markdown"
# Error handling
errorPolicy:
maxRetries: 3
backoffLimit: 600
failureAction: "rollback"
The controllers implement the A2A protocol's core functionality:
-
Agent Discovery:
- Automatically generates and manages
.well-known/agent.json
endpoints - Handles capability registration and updates
- Manages agent metadata and health checks
- Automatically generates and manages
-
Task Management:
- Implements A2A task lifecycle (submitted β working β completed/failed)
- Handles streaming updates via Server-Sent Events (SSE)
- Manages task artifacts and state transitions
-
Communication:
- Implements A2A message formats and parts
- Handles both synchronous and streaming communication
- Manages push notifications and webhooks
-
Resource Orchestration:
- GPU allocation and scheduling
- Memory and compute resource management
- Model loading and unloading
We provide a consistent development environment using VS Code Dev Containers. This ensures all developers have the same tools and versions.
-
Prerequisites:
-
Getting Started:
# Clone the repository git clone https://github.com/yourusername/fullStackOllama.git cd fullStackOllama # Open in VS Code code . # Click "Reopen in Container" when prompted # or use Command Palette (F1) -> "Remote-Containers: Reopen in Container"
The dev container includes:
- All required development tools
- Pre-configured pre-commit hooks
- VS Code extensions for Terraform, Go, and Kubernetes
- AWS and Kubernetes config mounting
Alternatively, if you prefer local installation:
This repository uses pre-commit hooks to ensure code quality and consistency. The following checks are performed before each commit:
-
General Checks
- Trailing whitespace removal
- End of file fixing
- YAML syntax validation
- Large file checks
- Merge conflict detection
- Private key detection
-
Terraform Checks
- Format validation (
terraform fmt
) - Configuration validation (
terraform validate
) - Documentation updates
- Security scanning (Checkov)
- Linting (TFLint)
- Format validation (
-
Go Code Checks
- Format validation (
go fmt
) - Code analysis (
go vet
) - Comprehensive linting (golangci-lint)
- Format validation (
-
Custom Validations
- CRD syntax and structure validation
- Model definition validation
- Kubernetes resource validation
-
Install pre-commit:
brew install pre-commit
-
Install required tools:
brew install terraform-docs tflint checkov go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
-
Install the pre-commit hooks:
pre-commit install
-
(Optional) Run against all files:
pre-commit run --all-files
The same checks are run in CI/CD pipelines to ensure consistency. See the GitHub Actions workflows for details.
The model building process follows GitOps principles, ensuring that all changes are tracked, reviewed, and automatically deployed:
-
Model Definition
# models/summarizer/model.yaml apiVersion: ai.stack/v1alpha1 kind: OllamaModelDefinition metadata: name: summarizer-model spec: from: llama2 build: system: | You are a specialized summarization agent...
-
Pull Request Flow
- Create branch:
feature/add-summarizer-model
- Add/modify model definition in
models/
directory - Create PR with changes
- Automated validation:
- YAML syntax
- Model definition schema
- Resource requirements check
- Security scanning
- PR review and approval
- Merge to main branch
- Create branch:
-
Flux Synchronization
# infra/base/models/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ../../models # Watches the models directory
- Flux detects changes in the
models/
directory - Applies new/modified OllamaModelDefinition to the cluster
- Triggers the build controller
- Flux detects changes in the
-
Build Process
sequenceDiagram participant Flux participant API Server participant Build Controller participant Build Job participant Registry Flux->>API Server: Apply OllamaModelDefinition API Server->>Build Controller: Notify new/modified definition Build Controller->>Build Job: Create build job Build Job->>Build Job: Execute ollama create Build Job->>Registry: Push built model Build Job->>API Server: Update status Build Controller->>API Server: Update conditions
-
Build Controller Actions
- Creates a Kubernetes Job for building
- Mounts required GPU resources
- Executes
ollama create
with definition - Monitors build progress
- Updates status conditions
- Handles failures and retries
- Registers successful builds
-
Model Registration
- Successful builds are registered in the cluster
- Model becomes available for OllamaAgent instances
- Version tracking and rollback support
- Automatic cleanup of old versions
-
Monitoring & Logs
# Example build job logs 2025-04-23T13:30:00Z [INFO] Starting build for summarizer-model 2025-04-23T13:30:05Z [INFO] Downloading base model llama2 2025-04-23T13:31:00Z [INFO] Applying model adaptations 2025-04-23T13:32:00Z [INFO] Registering model summarizer-model:1.0.0 2025-04-23T13:32:05Z [INFO] Build complete
- All model definitions are version controlled
- PR reviews ensure quality and security
- Base models are pulled from trusted sources
- Build jobs run in isolated environments
- Resource limits are strictly enforced
- Model provenance is tracked and verified
- Build jobs are scheduled based on GPU availability
- Parallel builds are supported with resource quotas
- Failed builds are automatically cleaned up
- Successful builds are cached for reuse
- Version tags ensure reproducibility
- Kubernetes cluster with GPU-enabled nodes (AWS EKS, GKE, or bare-metal)
- NVIDIA GPU Operator installed
- Kubectl + Kustomize + Helm
- Golang (for controller development)
- Set up Infrastructure
# Deploy the EKS cluster using Terraform
cd infra/cluster-iac
terraform init
terraform apply
- Bootstrap Flux
The repository includes a bootstrap script to set up Flux with the correct configuration:
# Option 1: Using environment variable
export GITHUB_TOKEN=your_github_token
./scripts/bootstrap-flux.sh
# Option 2: Passing token directly
./scripts/bootstrap-flux.sh -t your_github_token
# Additional options available:
./scripts/bootstrap-flux.sh -h # Show help
The bootstrap script will:
- Install Flux CLI if not present
- Clean up any existing Flux installation
- Configure Flux with your GitHub repository
- Set up monitoring and logging components
- Verify the installation and show status
- Apply CRDs
kubectl apply -f crds/
4. **Deploy example agents**
kubectl apply -f ollama-operators/service-deployments/
5. **Submit an orchestration task**
kubectl apply -f examples/question-answering/task.yaml
See docs/architecture.md
and docs/orchestration-diagram.png
for detailed system visuals.
This project is a personal and professional showcase. However, contributors are welcome! PRs, Issues, and suggestions encouraged.
This project is also a journey of exploration. Through it, we aim to learn and demonstrate:
- GPU scheduling with Kubernetes
- Multi-agent AI orchestration
- Building CRDs and operators with Go
- Best practices in GitOps and cloud-native ML
- Open-source model hosting and scaling
MIT License