-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗
Status: Proposal
Date: 2025-10-19
Version: 1.0
Authors: vLLM-SR Team
Executive Summary
This proposal outlines the integration of vLLM Semantic Router into HuggingChat as a new MoM (Mixture-of-Models) routing option. The integration will enable advanced intelligent routing capabilities including semantic caching, PII detection, and chain-of-thought (CoT) transparency, while maintaining full backward compatibility with the existing Omni (Arch router) implementation.
1. Motivation
Current State
- HuggingChat currently supports Omni routing via the Arch router (
src/lib/server/router/arch.ts) - Arch router provides basic route selection using LLM-based decision-making
- Limited visibility into routing decisions and no semantic caching capabilities
Desired State
- Support MoM (Mixture-of-Models) routing via vLLM Semantic Router
- Enable advanced features: semantic caching, PII detection, intelligent routing
- Provide transparent chain-of-thought (CoT) information for routing decisions
- Maintain coexistence of both Omni and MoM routers for gradual rollout
Business Value
- Performance: Semantic caching reduces latency for repeated queries
- Security: PII detection protects user privacy
- Transparency: CoT information builds user trust
- Flexibility: Users can choose between Omni and MoM routing strategies
- Dashboard Integration: vLLM-SR dashboard provides monitoring and analytics
About vLLM Semantic Router
vLLM Semantic Router is an intelligent routing system that embodies the Mixture-of-Models (MoM) philosophy, with modelName (MoM):
curl -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of x^2?"}
]
}'- Intelligent Routing: Routes requests to the optimal model based on semantic understanding of the query, not just keyword matching
- Semantic Caching: Leverages semantic similarity to cache responses, dramatically reducing latency for similar queries (not just exact matches)
- Semantic Chain Architecture: Evolving toward a composable semantic chain where all stages are orchestrated in an extensible pipeline, enabling future enhancements and custom stage integration in work-in-progress "SemanticChain".
- Three-Stage Pipeline (Extensible & Composable):
- Stage 1 - Prompt Guard: Security-first approach with jailbreak detection and PII protection
- Stage 2 - Router Memory: Intelligent semantic caching for performance optimization
- Stage 3 - Smart Routing: Multi-level intelligent routing combining three complementary strategies:
- Domain Understanding: Semantic classification of queries into domains (math, coding, general, etc.)
- Similarity-Based Routing: Semantic similarity matching to route similar queries to optimal models
- Keyword-Based Routing: Keyword pattern matching for explicit intent detection
- These three routing strategies work together to provide comprehensive query understanding and optimal model selection
- Future stages can be added to the pipeline without disrupting existing functionality
- Mixture-of-Models Philosophy: Recognizes that no single model is optimal for all tasks. By intelligently routing different types of queries to different specialized models, it achieves:
- Better accuracy through task-specific model selection
- Cost optimization by using smaller models for simple tasks
- Performance improvement through semantic understanding
- Transparency via chain-of-thought visibility
- Production-Ready: Battle-tested with comprehensive error handling, monitoring, and dashboard support
- Open Source: vLLM Community-driven development with active maintenance and feature additions
2. Goals
Primary Goals
- ✅ Integrate vLLM Semantic Router as a new MoM routing option
- ✅ Extract and store chain-of-thought (CoT) metadata from vLLM-SR responses
- ✅ Support both Omni and MoM routers coexisting in the same system
- ✅ Expose CoT information to frontend for visualization
Secondary Goals
- ✅ Support A/B testing between Omni and MoM routers
- ✅ Integrate with vLLM-SR dashboard for monitoring
3. Non-Goals
- ❌ Replace Omni router entirely (maintain coexistence)
- ❌ Modify vLLM Semantic Router codebase
- ❌ Implement custom semantic caching in HuggingChat (use vLLM-SR's caching)
- ❌ Create new dashboard (integrate with existing vLLM-SR dashboard)
- ❌ Support non-OpenAI-compatible endpoints for MoM
4. Design Principles
1. Backward Compatibility
- Existing Omni router functionality remains unchanged
- No breaking changes to current APIs or configurations
- Both routers can be configured independently
2. Transparency
- CoT information is always extracted and stored when available
- Users can see routing decisions and reasoning
- Clear logging for debugging and monitoring
3. Graceful Degradation
- MoM → Omni → Fallback model (three-tier fallback strategy)
- System continues functioning even if vLLM-SR is unavailable
- Clear error messages for integration issues
4. Separation of Concerns
- Router implementations are isolated in separate modules
- Router selection logic is centralized in endpoint.ts
- Configuration is environment-based
5. Performance First
- Minimal overhead for router selection
- Efficient CoT parsing and storage
- Support for streaming responses with CoT metadata
5. Architecture
5.1 System Overview
┌─────────────────────────────────────────────────────────────┐
│ HuggingChat Frontend │
│ (Model Selection UI) │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Router Endpoint (endpoint.ts) │
│ - Detects router type (MoM vs Omni) │
│ - Dispatches to appropriate router │
│ - Handles fallback logic │
└────────────────────────┬────────────────────────────────────┘
│
┌────────────────┼────────────────┐
↓ ↓ ↓
┌─────────┐ ┌──────────────┐ ┌──────────┐
│ Omni │ │ MoM │ │ Fallback │
│ Router │ │ Router │ │ Model │
│(arch.ts)│ │(vllm-sr.ts) │ │ │
└────┬────┘ └──────┬───────┘ └──────────┘
│ │
↓ ↓
┌─────────┐ ┌──────────────────┐
│ Arch │ │ vLLM Semantic │
│ Model │ │ Router │
│ Endpoint│ │ Endpoint │
└─────────┘ └──────────────────┘
│ │
└────────────────┼────────────────┐
↓ ↓
┌─────────────────────────┐
│ Route Resolution & │
│ Model Selection │
│ (policy.ts) │
└─────────────────────────┘
│
↓
┌─────────────────────────┐
│ Candidate Models │
│ (with fallbacks) │
└─────────────────────────┘
5.2 Component Details
A. Router Selection Logic (endpoint.ts)
- Detects model alias: "MoM" → vLLM-SR, "omni" → Arch
- Calls appropriate router function
- Passes CoT metadata to response stream
B. MoM Router (vllm-semantic-router.ts)
- Sends requests to vLLM-SR endpoint
- Extracts CoT from response headers
- Parses three-stage routing information
- Returns
RouteSelectionwith CoT metadata
C. Type Extensions (types.ts)
- Extends
RouteSelectionwith CoT metadata - Defines CoT structure for three stages
- Maintains backward compatibility
6. Request Flow
6.1 MoM Routing Flow (Happy Path)
1. User selects "MoM" model
↓
2. Router Endpoint receives request
↓
3. Detect model alias = "MoM"
↓
4. Call vllmSelectRoute()
↓
5. Send request to vLLM-SR endpoint
↓
6. Extract CoT from response headers
↓
7. Parse three-stage routing information
↓
8. Return RouteSelection with CoT metadata
↓
9. Resolve route to candidate models
↓
10. Try each candidate with fallback
↓
11. Stream response with CoT metadata
↓
12. Frontend displays routing decision + CoT
7. Data Structures
8.1 Extended RouteSelection Type
interface RouteSelection {
routeName: string;
error?: {
message: string;
statusCode?: number;
};
cotMetadata?: {
// Stage 1: Prompt Guard
stage1?: {
jailbreak: boolean;
jailbreakConfidence?: number;
pii: boolean;
result: 'Continue' | 'BLOCKED';
};
// Stage 2: Router Memory
stage2?: {
cacheStatus: 'HIT' | 'MISS';
action: 'Retrieve Memory' | 'Update Memory';
result: 'Fast Response' | 'Continue';
};
// Stage 3: Smart Routing
stage3?: {
domain: string;
reasoning: boolean;
model: string;
optimized: boolean;
result: 'Continue';
};
// Raw CoT string for debugging
rawCot?: string;
};
}8.2 Extended Message Type
type Message = Partial<Timestamps> & {
// ... existing fields ...
routerMetadata?: {
route: string;
model: string;
provider?: string;
// New: CoT information
cot?: RouteSelection['cotMetadata'];
routerType?: 'omni' | 'mom'; // Which router was used
};
};9. Implementation Phases
Phase 1: Core MoM Router (Week 1-2)
- Create
src/lib/server/router/vllm-semantic-router.ts - Implement
vllmSelectRoute()function - Add MoM configuration environment variables
- Implement basic error handling and fallback
Phase 2: Router Integration (Week 2-3)
- Update
endpoint.tsto support router selection - Add "MoM" router alias in
models.ts - Implement router dispatch logic
- Update
types.tswith CoT metadata
Phase 3: CoT Extraction & Storage (Week 3-4)
- Parse CoT from vLLM-SR response headers
- Extend
RouteSelectionwith CoT metadata - Update
Messagetype with CoT information - Implement CoT serialization
Phase 4: Frontend Integration (Week 4-5)
- Update router endpoint to pass CoT to frontend
- Implement CoT visualization in UI
- Add CoT display in chat interface
- Support toggling CoT visibility
Phase 5: Monitoring & Optimization (Week 5-6)
- Implement metrics collection
- Add performance benchmarking
- Create A/B testing framework
- Dashboard integration
10. CoT Information Extraction
11.1 Response Header Format
x-vllm-semantic-router-cot: 🔀 vLLM Semantic Router - Chain-Of-Thought 🔀
→ 🛡️ ***Stage 1 - Prompt Guard***: ✅ *No Jailbreak* → ✅ *No PII* → 💯 ***Continue***
→ 🔥 ***Stage 2 - Router Memory***: 🌊 *MISS* → 🧠 *Update Memory* → 💯 ***Continue***
→ 🧠 ***Stage 3 - Smart Routing***: 📂 *math* → 🧠 *Reasoning On* → 🥷 *deepseek-v3* → 🎯 *Prompt Optimized* → 💯 ***Continue***
10.2 Parsing Strategy
- Extract raw CoT string from header
- Parse three stages using regex patterns
- Extract key information: jailbreak, PII, cache status, domain, model, reasoning
- Store structured data in
cotMetadata - Keep raw string for debugging
11. Success Criteria
- ✅ MoM router successfully routes requests to vLLM-SR
- ✅ CoT information extracted and displayed in UI
- ✅ Zero breaking changes to existing APIs
- ✅ Both routers coexist without conflicts
12. Open Questions & Discussion Points
- CoT Header Format: Confirm exact header name and format with vLLM team
- Cache Visibility: Should cache hit rate be exposed to users?
- Default Router: Should MoM become default after stabilization?
- Dashboard Integration: Timeline for vLLM-SR dashboard integration?
13. References
- vLLM Semantic Router: https://github.com/vllm-project/semantic-router
- CoT Format: https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/tools/openwebui-pipe/vllm-sr-cot.md
- Dashboard Integration Issue: HuggingChat Omni Support 🤗 vllm-project/semantic-router#473
- Current Omni Router:
src/lib/server/router/arch.ts - Router Endpoint:
src/lib/server/router/endpoint.ts
Appendix A: File Structure
src/lib/server/router/
├── arch.ts (Existing: Omni router)
├── vllm-semantic-router.ts (New: MoM router)
├── endpoint.ts (Updated: Router selection logic)
├── policy.ts (Existing: Route resolution)
├── types.ts (Updated: CoT metadata types)
└── index.ts (Updated: Export new router)
src/lib/server/
├── models.ts (Updated: MoM router alias)
└── config.ts (Updated: MoM configuration)
src/lib/types/
└── Message.ts (Updated: CoT in routerMetadata)
Document Version: 1.0
Last Updated: 2025-10-19
Status: Ready for Review
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request