Skip to content

Vaishnav88sk/claritty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Claritty Logo

Claritty - AI-SRE for Kubernetes

License: MIT Go Version Latest Release CI Status PRs Welcome

Production-grade AIOps platform for cluster observability, incident response & auto-remediation

Claritty is an open-source, cloud-native AI Site Reliability Engineering platform for Kubernetes clusters. It combines real-time cluster telemetry with a 6-stage AI agent pipeline to automatically detect, diagnose, and remediate incidents, reducing MTTR from hours to minutes.


πŸš€ Installation (For End Users)

You can quickly install and deploy Claritty without needing to build from source. Choose the mode you want to use.

Option 1: Install Clarctl CLI (Local Tool)

Download the pre-compiled binary to your local machine to diagnose your clusters instantly:

# 1. Download the latest binary (Linux/macOS)
curl -sL https://raw.githubusercontent.com/Vaishnav88sk/claritty/clarctl-go/clarctl-go/install.sh | bash
# 2. Run help

clarctl -h
# 3. Run a scan!
clarctl scan

Option 2: Deploy the SRE Agent & Hub (In-Cluster)

Deploy the centralized dashboard and the agent into your clusters for continuous monitoring. For detailed steps, see INSTALLATION.md.

Start the Hub Server (Dashboard)

# Run the Hub via Docker Compose
export DATABASE_URL="postgresql://user:pass@host:5432/claritty?sslmode=require"
curl -sL https://raw.githubusercontent.com/Vaishnav88sk/claritty/master/sre-agent/docker-compose.yml -o docker-compose.yml
docker-compose up -d
# View dashboard at http://localhost:8822

Deploy the Agent to your Clusters

# Apply agent manifests
kubectl apply -f https://raw.githubusercontent.com/Vaishnav88sk/claritty/master/sre-agent/deploy/agent-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/Vaishnav88sk/claritty/master/sre-agent/deploy/agent-configmap.yaml
kubectl apply -f https://raw.githubusercontent.com/Vaishnav88sk/claritty/master/sre-agent/deploy/agent-deployment.yaml

(Remember to update the ConfigMap with your specific Hub IP and Cluster Name!)


🌟 The Two Modes of Claritty

Claritty provides two powerful ways to interact with your Kubernetes infrastructure, depending on your needs:

1. Clarctl CLI (Local Remediation Tool)

A powerful command-line interface run from your local machine. It connects to your current Kubernetes context to instantly analyze namespaces or specific pods, generate an RCA (Root Cause Analysis), and offer interactive, step-by-step remediation commands. Perfect for on-call engineers debugging live incidents.

2. SRE Agent & Hub (Centralized Platform)

A lightweight, in-cluster daemon (the Agent) that continuously monitors your infrastructure. It autonomously performs the 6-stage AI pipeline on failing resources and pushes structured incident reports to a centralized Hub server. The Hub provides a beautiful web dashboard for a multi-cluster overview, Slack alerts, and detailed RCA records. Perfect for production monitoring.


✨ Features

  • πŸ“Š Node-level & Pod-level Metrics: Real-time CPU, memory, and resource usage collection.
  • ⚑ Auto Incident Detection: Detects complex cascading failures, API server throttling, DNS resolution timeouts, Split-Brain StatefulSets, network partition deadlocks, alongside standard CrashLoopBackOff, OOMKilled, and Pending states.
  • 🧠 6-Stage AI Agent Pipeline: Triage -> Metrics -> Logs -> Infra -> Runbook -> Commander agents collaboratively diagnose root causes.
  • 🚨 Interactive Auto-Remediation (CLI): Proposes step-by-step kubectl fixes locally. Prompts y / dry / n before executing anything.
  • 🌐 Centralized Dashboard (Agent): Web UI to view multi-cluster health, active incidents, and automated remediation plans.
  • πŸ”’ Safety First: All remediation commands are validated against a strict allowlist. Destructive commands are flagged.
  • πŸ“– Built-in Runbooks: Battle-tested YAML runbooks for common failure modes embedded directly in the logic.
  • πŸ—„ Incident History: Database-backed incident logging with MTTR tracking and status lifecycle.

πŸ—οΈ Architecture

Clarctl CLI Architecture

Runs locally on the engineer's machine. Developer Terminal -> clarctl -> Kubeconfig -> K8s API -> AI Pipeline -> Terminal Output

SRE Agent & Hub Architecture (Hub-Spoke Model)

Cluster A (prod) ──► claritty-agent ─┐
Cluster B (dev)  ──► claritty-agent ─┼──► Hub Server (port 8822) ──► Web Dashboard + Slack Alerts
Cluster C (qa)   ──► claritty-agent β”€β”˜         β”‚
                                    PostgreSQL Database

πŸ› οΈ Getting Started (For Developers)

If you want to contribute, modify the code, or build from source:

# Clone the repository
git clone https://github.com/Vaishnav88sk/claritty.git
cd claritty

# Building the CLI
cd clarctl-go
go mod tidy
go build -o clarctl .

# Running the Hub from source
cd ../sre-agent/hub
export DATABASE_URL="postgresql://user:pass@host:5432/claritty"
go run .

# Running the Agent from source locally
cd ../agent
export CLARITTY_CLUSTER_NAME="local-dev"
export CLARITTY_HUB_URL="http://localhost:8822"
export GROQ_API_KEY="your_key_here"
go run .

πŸ’» Sample Examples & Output

CLI Interactive RCA

Running clarctl scan namespace prod when a pod is crash-looping:

[Claritty] Scanning namespace 'prod'...
[!] Detected issue: payment-service-84f9b8c-x2z9 (CrashLoopBackOff)
[AI Pipeline] Triage -> Logs -> Metrics -> Infra -> Commander...

🚨 ROOT CAUSE (SEV 1 - 95% Confidence):
The payment-service pod is failing to start because it cannot connect to the Redis cache at 'redis.prod.svc.cluster.local:6379'. Connection refused.

πŸ”§ PROPOSED REMEDIATION:
Step 1: Check if the Redis service is running.
Command: kubectl get svc redis -n prod
Execute? [y/N/dry]: y
...

Claritty CLI Output Demo

Hub Dashboard Incident Card

When the sre-agent runs in the cluster, it pushes structured JSON to the Hub:

{
  "cluster": "prod-us-east",
  "namespace": "billing",
  "severity": "SEV2",
  "title": "OOMKilled Event on Invoice Generator",
  "root_cause": "Container 'worker' exceeded its memory limit of 512Mi. Last usage spike reached 512.4Mi during a large PDF generation task.",
  "remediation_plan": [
    {
      "step_number": 1,
      "description": "Increase memory limits for the invoice deployment.",
      "command": "kubectl set resources deployment invoice-generator -n billing --limits=memory=1Gi",
      "is_destructive": false
    }
  ]
}

Claritty Hub Dashboard Output Demo


πŸ“‹ Incident Categories Detected

Claritty's pipeline is trained to handle a vast array of Kubernetes failure states:

  • Pod Lifecycle Failures: CrashLoopBackOff, ImagePullBackOff, CreateContainerConfigError.
  • Resource Starvation: OOMKilled, CPU Throttling, Node Disk Pressure.
  • Network Issues: Service resolution failures, DNS timeouts, missing endpoints.
  • Storage Issues: Unbound PersistentVolumeClaims, mounting failures.
  • RBAC & Security: Unauthorized API calls, missing service account permissions.

βš–οΈ Comparison with Industry Tools

Feature Claritty OpenSRE Datadog / New Relic Prometheus/Thanos Robusta
In-cluster agent βœ… Deployment 1 replica βœ… Sidekick framework βœ… βœ… βœ…
AI-powered RCA βœ… 6-stage LLM pipeline βœ… Episodic Memory LLM ❌ (Mostly manual) ❌ Partial
Multi-cluster hub βœ… Open Source Hub ❌ Slack/API focused βœ… SaaS βœ… Thanos βœ… SaaS
Self-hosted βœ… βœ… ❌ SaaS only βœ… Partial
Cost Free / Open Source Free / Open Source $$$$ Free Free/Paid

πŸ“ Checkpoints & Future Roadmap

  • CLI for local cluster diagnosis.
  • Multi-agent collaborative LLM pipeline.
  • Agent deployment for continuous in-cluster monitoring.
  • Hub server & Web UI for multi-cluster overview.
  • PostgreSQL persistence & Slack integration.
  • Add complete K8s observability next (Custom metrics, distributed tracing integration, eBPF network flows).

Claritty is actively maintained and built for modern SRE teams. Contributions and feedback are welcome!

About

Claritty is an open-source AI-SRE platform for Kubernetes clusters. It combines real-time cluster telemetry with a 6-stage AI agent pipeline to automatically detect, diagnose, and remediate incidents, reducing MTTR.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors