HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗

# **HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗**

**Status:** Proposal  
**Date:** 2025-10-19  
**Version:** 1.0
**Authors**: vLLM-SR Team

---

## Executive Summary

This proposal outlines the integration of **vLLM Semantic Router** into HuggingChat as a new **MoM (Mixture-of-Models)** routing option. The integration will enable advanced intelligent routing capabilities including semantic caching, PII detection, and chain-of-thought (CoT) transparency, while maintaining full backward compatibility with the existing Omni (Arch router) implementation.

---

## 1. Motivation

### Current State

- HuggingChat currently supports **Omni** routing via the Arch router (`src/lib/server/router/arch.ts`)
- Arch router provides basic route selection using LLM-based decision-making
- Limited visibility into routing decisions and no semantic caching capabilities

### Desired State

- Support **MoM (Mixture-of-Models)** routing via vLLM Semantic Router
- Enable advanced features: semantic caching, PII detection, intelligent routing
- Provide transparent chain-of-thought (CoT) information for routing decisions
- Maintain coexistence of both Omni and MoM routers for gradual rollout

### Business Value

1. **Performance**: Semantic caching reduces latency for repeated queries
2. **Security**: PII detection protects user privacy
3. **Transparency**: CoT information builds user trust
4. **Flexibility**: Users can choose between Omni and MoM routing strategies
5. **Dashboard Integration**: vLLM-SR dashboard provides monitoring and analytics

### About vLLM Semantic Router

**vLLM Semantic Router** is an intelligent routing system that embodies the **Mixture-of-Models (MoM)** philosophy, with modelName (**MoM**):

```shell
curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of x^2?"}
    ]
  }'
```

- **Intelligent Routing**: Routes requests to the optimal model based on semantic understanding of the query, not just keyword matching
- **Semantic Caching**: Leverages semantic similarity to cache responses, dramatically reducing latency for similar queries (not just exact matches)
- **Semantic Chain Architecture**: Evolving toward a composable semantic chain where all stages are orchestrated in an extensible pipeline, enabling future enhancements and custom stage integration in work-in-progress "SemanticChain".
- **Three-Stage Pipeline** (Extensible & Composable):
  - **Stage 1 - Prompt Guard**: Security-first approach with jailbreak detection and PII protection
  - **Stage 2 - Router Memory**: Intelligent semantic caching for performance optimization
  - **Stage 3 - Smart Routing**: Multi-level intelligent routing combining three complementary strategies:
    - **Domain Understanding**: Semantic classification of queries into domains (math, coding, general, etc.)
    - **Similarity-Based Routing**: Semantic similarity matching to route similar queries to optimal models
    - **Keyword-Based Routing**: Keyword pattern matching for explicit intent detection
    - These three routing strategies work together to provide comprehensive query understanding and optimal model selection
  - Future stages can be added to the pipeline without disrupting existing functionality
- **Mixture-of-Models Philosophy**: Recognizes that no single model is optimal for all tasks. By intelligently routing different types of queries to different specialized models, it achieves:
  - Better accuracy through task-specific model selection
  - Cost optimization by using smaller models for simple tasks
  - Performance improvement through semantic understanding
  - Transparency via chain-of-thought visibility
- **Production-Ready**: Battle-tested with comprehensive error handling, monitoring, and dashboard support
- **Open Source**: vLLM Community-driven development with active maintenance and feature additions

---

## 2. Goals

### Primary Goals

- ✅ Integrate vLLM Semantic Router as a new MoM routing option
- ✅ Extract and store chain-of-thought (CoT) metadata from vLLM-SR responses
- ✅ Support both Omni and MoM routers coexisting in the same system
- ✅ Expose CoT information to frontend for visualization

### Secondary Goals

- ✅ Support A/B testing between Omni and MoM routers
- ✅ Integrate with vLLM-SR dashboard for monitoring

---

## 3. Non-Goals

- ❌ Replace Omni router entirely (maintain coexistence)
- ❌ Modify vLLM Semantic Router codebase
- ❌ Implement custom semantic caching in HuggingChat (use vLLM-SR's caching)
- ❌ Create new dashboard (integrate with existing vLLM-SR dashboard)
- ❌ Support non-OpenAI-compatible endpoints for MoM

---

## 4. Design Principles

### 1. **Backward Compatibility**

- Existing Omni router functionality remains unchanged
- No breaking changes to current APIs or configurations
- Both routers can be configured independently

### 2. **Transparency**

- CoT information is always extracted and stored when available
- Users can see routing decisions and reasoning
- Clear logging for debugging and monitoring

### 3. **Graceful Degradation**

- MoM → Omni → Fallback model (three-tier fallback strategy)
- System continues functioning even if vLLM-SR is unavailable
- Clear error messages for integration issues

### 4. **Separation of Concerns**

- Router implementations are isolated in separate modules
- Router selection logic is centralized in endpoint.ts
- Configuration is environment-based

### 5. **Performance First**

- Minimal overhead for router selection
- Efficient CoT parsing and storage
- Support for streaming responses with CoT metadata

---

## 5. Architecture

### 5.1 System Overview

```
┌─────────────────────────────────────────────────────────────┐
│                    HuggingChat Frontend                      │
│                  (Model Selection UI)                        │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│              Router Endpoint (endpoint.ts)                   │
│  - Detects router type (MoM vs Omni)                        │
│  - Dispatches to appropriate router                         │
│  - Handles fallback logic                                   │
└────────────────────────┬────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ↓                ↓                ↓
   ┌─────────┐    ┌──────────────┐   ┌──────────┐
   │  Omni   │    │     MoM      │   │ Fallback │
   │ Router  │    │    Router    │   │  Model   │
   │(arch.ts)│    │(vllm-sr.ts)  │   │          │
   └────┬────┘    └──────┬───────┘   └──────────┘
        │                │
        ↓                ↓
   ┌─────────┐    ┌──────────────────┐
   │   Arch  │    │  vLLM Semantic   │
   │  Model  │    │     Router       │
   │ Endpoint│    │    Endpoint      │
   └─────────┘    └──────────────────┘
        │                │
        └────────────────┼────────────────┐
                         ↓                ↓
                    ┌─────────────────────────┐
                    │  Route Resolution &     │
                    │  Model Selection        │
                    │  (policy.ts)            │
                    └─────────────────────────┘
                         │
                         ↓
                    ┌─────────────────────────┐
                    │  Candidate Models       │
                    │  (with fallbacks)       │
                    └─────────────────────────┘
```

### 5.2 Component Details

#### A. Router Selection Logic (`endpoint.ts`)

- Detects model alias: "MoM" → vLLM-SR, "omni" → Arch
- Calls appropriate router function
- Passes CoT metadata to response stream

#### B. MoM Router (`vllm-semantic-router.ts`)

- Sends requests to vLLM-SR endpoint
- Extracts CoT from response headers
- Parses three-stage routing information
- Returns `RouteSelection` with CoT metadata

#### C. Type Extensions (`types.ts`)

- Extends `RouteSelection` with CoT metadata
- Defines CoT structure for three stages
- Maintains backward compatibility

---

## 6. Request Flow

### 6.1 MoM Routing Flow (Happy Path)

```
1. User selects "MoM" model
   ↓
2. Router Endpoint receives request
   ↓
3. Detect model alias = "MoM"
   ↓
4. Call vllmSelectRoute()
   ↓
5. Send request to vLLM-SR endpoint
   ↓
6. Extract CoT from response headers
   ↓
7. Parse three-stage routing information
   ↓
8. Return RouteSelection with CoT metadata
   ↓
9. Resolve route to candidate models
   ↓
10. Try each candidate with fallback
    ↓
11. Stream response with CoT metadata
    ↓
12. Frontend displays routing decision + CoT
```

---

## 7. Data Structures

### 8.1 Extended RouteSelection Type

```typescript
interface RouteSelection {
  routeName: string;
  error?: {
    message: string;
    statusCode?: number;
  };
  cotMetadata?: {
    // Stage 1: Prompt Guard
    stage1?: {
      jailbreak: boolean;
      jailbreakConfidence?: number;
      pii: boolean;
      result: 'Continue' | 'BLOCKED';
    };
    // Stage 2: Router Memory
    stage2?: {
      cacheStatus: 'HIT' | 'MISS';
      action: 'Retrieve Memory' | 'Update Memory';
      result: 'Fast Response' | 'Continue';
    };
    // Stage 3: Smart Routing
    stage3?: {
      domain: string;
      reasoning: boolean;
      model: string;
      optimized: boolean;
      result: 'Continue';
    };
    // Raw CoT string for debugging
    rawCot?: string;
  };
}
```

### 8.2 Extended Message Type

```typescript
type Message = Partial<Timestamps> & {
  // ... existing fields ...
  
  routerMetadata?: {
    route: string;
    model: string;
    provider?: string;
    // New: CoT information
    cot?: RouteSelection['cotMetadata'];
    routerType?: 'omni' | 'mom'; // Which router was used
  };
};
```

---

## 9. Implementation Phases

### Phase 1: Core MoM Router (Week 1-2)

- [ ] Create `src/lib/server/router/vllm-semantic-router.ts`
- [ ] Implement `vllmSelectRoute()` function
- [ ] Add MoM configuration environment variables
- [ ] Implement basic error handling and fallback

### Phase 2: Router Integration (Week 2-3)

- [ ] Update `endpoint.ts` to support router selection
- [ ] Add "MoM" router alias in `models.ts`
- [ ] Implement router dispatch logic
- [ ] Update `types.ts` with CoT metadata

### Phase 3: CoT Extraction & Storage (Week 3-4)

- [ ] Parse CoT from vLLM-SR response headers
- [ ] Extend `RouteSelection` with CoT metadata
- [ ] Update `Message` type with CoT information
- [ ] Implement CoT serialization

### Phase 4: Frontend Integration (Week 4-5)

- [ ] Update router endpoint to pass CoT to frontend
- [ ] Implement CoT visualization in UI
- [ ] Add CoT display in chat interface
- [ ] Support toggling CoT visibility

### Phase 5: Monitoring & Optimization (Week 5-6)

- [ ] Implement metrics collection
- [ ] Add performance benchmarking
- [ ] Create A/B testing framework
- [ ] Dashboard integration

---

## 10. CoT Information Extraction

### 11.1 Response Header Format

```
x-vllm-semantic-router-cot: 🔀 vLLM Semantic Router - Chain-Of-Thought 🔀
  → 🛡️ ***Stage 1 - Prompt Guard***: ✅ *No Jailbreak* → ✅ *No PII* → 💯 ***Continue***
  → 🔥 ***Stage 2 - Router Memory***: 🌊 *MISS* → 🧠 *Update Memory* → 💯 ***Continue***
  → 🧠 ***Stage 3 - Smart Routing***: 📂 *math* → 🧠 *Reasoning On* → 🥷 *deepseek-v3* → 🎯 *Prompt Optimized* → 💯 ***Continue***
```

### 10.2 Parsing Strategy

- Extract raw CoT string from header
- Parse three stages using regex patterns
- Extract key information: jailbreak, PII, cache status, domain, model, reasoning
- Store structured data in `cotMetadata`
- Keep raw string for debugging

---

## 11. Success Criteria

- ✅ MoM router successfully routes requests to vLLM-SR
- ✅ CoT information extracted and displayed in UI
- ✅ Zero breaking changes to existing APIs
- ✅ Both routers coexist without conflicts

---

## 12. Open Questions & Discussion Points

1. **CoT Header Format**: Confirm exact header name and format with vLLM team
2. **Cache Visibility**: Should cache hit rate be exposed to users?
3. **Default Router**: Should MoM become default after stabilization?
4. **Dashboard Integration**: Timeline for vLLM-SR dashboard integration?

---

## 13. References

- **vLLM Semantic Router**: <https://github.com/vllm-project/semantic-router>
- **CoT Format**: <https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/tools/openwebui-pipe/vllm-sr-cot.md>
- **Dashboard Integration Issue**: <https://github.com/vllm-project/semantic-router/issues/473>
- **Current Omni Router**: `src/lib/server/router/arch.ts`
- **Router Endpoint**: `src/lib/server/router/endpoint.ts`

---

## Appendix A: File Structure

```
src/lib/server/router/
├── arch.ts                      (Existing: Omni router)
├── vllm-semantic-router.ts      (New: MoM router)
├── endpoint.ts                  (Updated: Router selection logic)
├── policy.ts                    (Existing: Route resolution)
├── types.ts                     (Updated: CoT metadata types)
└── index.ts                     (Updated: Export new router)

src/lib/server/
├── models.ts                    (Updated: MoM router alias)
└── config.ts                    (Updated: MoM configuration)

src/lib/types/
└── Message.ts                   (Updated: CoT in routerMetadata)
```

---

**Document Version**: 1.0
**Last Updated**: 2025-10-19
**Status**: Ready for Review


HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗 #1947

Description

HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗

Executive Summary

1. Motivation

Current State

Desired State

Business Value

About vLLM Semantic Router

2. Goals

Primary Goals

Secondary Goals

3. Non-Goals

4. Design Principles

1. Backward Compatibility

2. Transparency

3. Graceful Degradation

4. Separation of Concerns

5. Performance First

5. Architecture

5.1 System Overview

5.2 Component Details

A. Router Selection Logic (endpoint.ts)

B. MoM Router (vllm-semantic-router.ts)

C. Type Extensions (types.ts)

6. Request Flow

6.1 MoM Routing Flow (Happy Path)

7. Data Structures

8.1 Extended RouteSelection Type

8.2 Extended Message Type

9. Implementation Phases

Phase 1: Core MoM Router (Week 1-2)

Phase 2: Router Integration (Week 2-3)

Phase 3: CoT Extraction & Storage (Week 3-4)

Phase 4: Frontend Integration (Week 4-5)

Phase 5: Monitoring & Optimization (Week 5-6)

10. CoT Information Extraction

11.1 Response Header Format

10.2 Parsing Strategy

11. Success Criteria

12. Open Questions & Discussion Points

13. References

Appendix A: File Structure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. Router Selection Logic (`endpoint.ts`)

B. MoM Router (`vllm-semantic-router.ts`)

C. Type Extensions (`types.ts`)