Comprehensive Observability Multi-Agent Platform for Adaptive System Solutions
AI-powered incident investigation platform that reduces MTTR by 67-90% using parallel OODA loops, ICS principles, and scientific methodology.
The Problem: Traditional incident investigation tools require senior engineers to manually connect dots between metrics, logs, and traces. Average MTTR: 2-4 hours. Knowledge concentrated in a few experts.
The Solution: COMPASS uses AI agents with scientific methodology to systematically test hypotheses in parallel, filtering out noise and presenting only high-confidence root causes to humans.
Key Differentiators:
- π§ͺ Scientific rigor: Systematic hypothesis disproof (not just pattern matching)
- β‘ Parallel OODA loops: 5+ agents testing simultaneously (10x faster than sequential)
- π€ Bring your own LLM: OpenAI, Anthropic, or any provider (cost-controlled)
- π₯ Learning Teams approach: Focus on contributing causes, not blame
- π Automatic post-mortems: Markdown documentation for every investigation
- π° Cost-aware: $10/investigation routine, $20 critical (transparent budgets)
Current Status: Production-grade foundation ready for Database Agent implementation
π Phase 5 Complete - Multi-Agent Orchestrator (Production-Ready)
Current Capabilities:
- β Multi-Agent Orchestration - Sequential dispatch of Application, Database, Network agents
- β Production-grade agents - ApplicationAgent and NetworkAgent with 95%+ test coverage
- β
CLI Interface -
investigate-orchestratorcommand with budget management - β Cost Control - Per-agent budget tracking, transparent cost breakdown
- β Hypothesis Ranking - Confidence-based ranking across all agents
- β Graceful Degradation - Continues investigation even if agents fail
- β OpenTelemetry Tracing - Distributed tracing from day 1
Recent Achievements (Phase 5):
- Orchestrator: Sequential multi-agent coordination (15/15 tests passing)
- Competitive Review: Agent Beta promoted for architectural simplification
- Complexity Reduction: Removed ThreadPoolExecutor (saved 4 hours, zero threading bugs)
- CLI Integration: Full investigation workflow from command line
- Documentation: Comprehensive decision rationale and design docs
Previous Achievements:
- Day 4: Agent LLM/MCP integration, ADR documentation (Handoff)
- Day 3: OpenAI/Anthropic integration, fixed 8 critical bugs (Report)
- Day 2: Scientific framework with quality-weighted confidence scoring (Report)
Next: Post-implementation competitive review, then Phase 6 optimization
Last Updated: 2025-11-21
Investigate an incident using orchestrated multi-agent system:
# Simple investigation
python -m compass.cli.main investigate-orchestrator INC-12345
# With budget and affected services
python -m compass.cli.main investigate-orchestrator INC-12345 \
--budget 15.00 \
--affected-services payment,checkout \
--severity criticalWhat you get:
- Sequential dispatch of Application, Database, and Network agents
- Observations consolidated from all agents
- Top 5 hypotheses ranked by confidence
- Per-agent cost breakdown with budget utilization
Example Output:
π Initializing investigation for INC-12345
π° Budget: $15.00
π Affected Services: payment, checkout
β οΈ Severity: critical
π Observing incident (sequential agent dispatch)...
β
Collected 12 observations
π§ Generating hypotheses...
β
Generated 5 hypotheses
π Top Hypotheses (ranked by confidence):
1. [network] DNS resolution timeout detected
Confidence: 92.00%
2. [application] High error rate in payment service
Confidence: 85.00%
3. [database] Connection pool nearing exhaustion
Confidence: 78.00%
π° Cost Breakdown:
Application: $2.1500
Database: $1.8500
Network: $0.9500
βββββββββββββββββββββββββ
Total: $4.9500 / $15.00
Utilization: 33.0%
Complete demo environment with real observability stack:
# 1. Start demo environment
./scripts/run-demo.sh
# 2. Trigger an incident (missing index, lock contention, or pool exhaustion)
./scripts/trigger-incident.sh missing_index
# 3. Investigate with COMPASS (classic mode)
poetry run compass investigate \
--service payment-service \
--symptom "slow database queries and high latency" \
--severity highFull demo guide: DEMO.md (~10 minutes first run)
- Start here:
docs/product/COMPASS_Product_Reference_Document_v1_1.md - Understand the architecture:
docs/architecture/COMPASS_MVP_Architecture_Reference.md - Build guide:
docs/guides/COMPASS_MVP_Build_Guide.md - Development workflow:
docs/guides/compass-tdd-workflow.md
compass/
βββ docs/ # All documentation
β βββ architecture/ # System architecture documents
β βββ product/ # Product strategy and requirements
β βββ guides/ # Build guides and workflows
β βββ reference/ # Quick references and indexes
β βββ research/ # Research papers (PDFs)
β
βββ src/ # Source code (in development)
β βββ compass/ # Main Python package
β β βββ core/ # OODA loop, scientific framework
β β βββ agents/ # Agent implementations
β β βββ cli/ # CLI interface
β β βββ api/ # API server
β β βββ integrations/ # MCP integrations
β βββ tests/ # Test suite
β
βββ planning/ # Planning conversations
β βββ conversations/ # Original HTML chats
β βββ transcripts/ # Extracted text transcripts
β
βββ examples/ # Example configurations and templates
β βββ configurations/ # Sample YAML configs
β βββ templates/ # Agent templates
β
βββ deployment/ # Deployment configurations
β βββ k8s/ # Kubernetes manifests
β βββ docker/ # Docker files
β
βββ scripts/ # Utility scripts
COMPASS uses AI agents organized according to Incident Command System (ICS) principles to investigate incidents using parallel OODA loops and scientific methodology.
Key Differentiators:
- Parallel OODA Loops: 5+ agents test hypotheses simultaneously
- Scientific Rigor: Systematic hypothesis disproof before human escalation
- Learning Culture: Learning Teams methodology vs traditional RCA
- Human-in-the-Loop: Level 1 autonomy - AI proposes, humans decide
- Language: Python only (readability over complexity)
- Database: PostgreSQL + pgvector
- Observability: LGTM stack (Loki, Grafana, Tempo, Mimir)
- Deployment: Kubernetes (Tilt for local dev)
- LLM: Provider agnostic (OpenAI, Anthropic, Copilot, Ollama)
Agent Hierarchy (ICS-based):
Orchestrator
βββ Database Manager β Workers
βββ Network Manager β Workers
βββ Application Manager β Workers
βββ Infrastructure Manager β Workers
OODA Loop Phases:
- Observe: Parallel data gathering
- Orient: Hypothesis generation and ranking
- Decide: Human decision points
- Act: Evidence gathering and hypothesis testing
-
Product Overview
COMPASS_Product_Reference_Document_v1_1.md- Complete product specification
-
Architecture
COMPASS_MVP_Architecture_Reference.md- MVP architectureCOMPASS_MVP_Technical_Design.md- Technical design details
-
Build Guides
COMPASS_MVP_Build_Guide.md- Step-by-step build instructionscompass-tdd-workflow.md- TDD development process
compass-quick-reference.md- Quick reference guideCOMPASS_CONVERSATIONS_INDEX.md- Searchable index of all planning conversationsINDEXING_SYSTEM_SUMMARY.md- How to use the conversation index
Scientific Framework:
COMPASS_SCIENTIFIC_FRAMEWORK_DOCS.mdcompass_scientific_framework.py- Core implementation
Enterprise Features:
COMPASS_Enterprise_Knowledge_Architecture.mdcompass_enterprise_knowledge_guide.md- Enterprise user guide
Human-AI Interface:
Research Papers (in docs/research/):
- ICS-Based Multi-Agent AI Systems for Incident Investigation
- Evaluation of Learning Teams vs Root Cause Analysis
- Problems with Root Cause Analysis
- Product vision and requirements
- Complete architecture design
- Scientific framework specification
- Multi-agent coordination design
- Enterprise knowledge integration design
- CLI interface design
- Prototype code (scientific framework, database agent)
- Comprehensive documentation
- Test framework design
- MVP implementation (not started)
Phase 1: Foundation (Weeks 1-2)
- Basic LGTM integration
- Single agent (database)
- CLI interface
- Cost tracking
Phase 2: Trust (Weeks 3-4)
- Hypothesis confidence scoring
- Evidence linking
- Graceful failure handling
Phase 3: Value (Weeks 5-6)
- Pattern learning
- Personal runbooks
- Metrics tracking
All planning conversations are indexed and searchable:
# Search the conversation index
grep -i "topic_name" docs/reference/COMPASS_CONVERSATIONS_INDEX.md
# Example: Find information about cost management
grep -i "cost" docs/reference/COMPASS_CONVERSATIONS_INDEX.mdSee docs/reference/INDEXING_SYSTEM_SUMMARY.md for detailed usage.
- Getting Started:
docs/guides/ - Architecture Details:
docs/architecture/ - Product Strategy:
docs/product/ - Research Background:
docs/research/ - Planning History:
planning/
From docs/guides/claude.md:
- Production-First: Every component production-ready from inception
- Test-Driven Development: TDD rigorously from day 1
- OODA Loop Focus: Optimize for iteration speed over perfect analysis
- Scientific Method: Systematically disprove hypotheses before presenting
- Human Authority: Humans decide, AI advises and accelerates
- Cost Management: Token budget caps, transparent pricing
- Learning Culture: Focus on contributing causes, not blame
See development guides:
compass-tdd-workflow.md- Test-driven developmentcompass-claude-code-instructions.md- Claude Code workflowcompass-day1-startup.md- Day 1 setup guide
[To be determined]
[To be added]
Ready to build! See docs/guides/COMPASS_MVP_Build_Guide.md to get started.