Skip to content

HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗 #1947

@Xunzhuo

Description

@Xunzhuo

HuggingChat MoM (Mixture-of-Models) Integration Proposal 🤗

Status: Proposal
Date: 2025-10-19
Version: 1.0
Authors: vLLM-SR Team


Executive Summary

This proposal outlines the integration of vLLM Semantic Router into HuggingChat as a new MoM (Mixture-of-Models) routing option. The integration will enable advanced intelligent routing capabilities including semantic caching, PII detection, and chain-of-thought (CoT) transparency, while maintaining full backward compatibility with the existing Omni (Arch router) implementation.


1. Motivation

Current State

  • HuggingChat currently supports Omni routing via the Arch router (src/lib/server/router/arch.ts)
  • Arch router provides basic route selection using LLM-based decision-making
  • Limited visibility into routing decisions and no semantic caching capabilities

Desired State

  • Support MoM (Mixture-of-Models) routing via vLLM Semantic Router
  • Enable advanced features: semantic caching, PII detection, intelligent routing
  • Provide transparent chain-of-thought (CoT) information for routing decisions
  • Maintain coexistence of both Omni and MoM routers for gradual rollout

Business Value

  1. Performance: Semantic caching reduces latency for repeated queries
  2. Security: PII detection protects user privacy
  3. Transparency: CoT information builds user trust
  4. Flexibility: Users can choose between Omni and MoM routing strategies
  5. Dashboard Integration: vLLM-SR dashboard provides monitoring and analytics

About vLLM Semantic Router

vLLM Semantic Router is an intelligent routing system that embodies the Mixture-of-Models (MoM) philosophy, with modelName (MoM):

curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of x^2?"}
    ]
  }'
  • Intelligent Routing: Routes requests to the optimal model based on semantic understanding of the query, not just keyword matching
  • Semantic Caching: Leverages semantic similarity to cache responses, dramatically reducing latency for similar queries (not just exact matches)
  • Semantic Chain Architecture: Evolving toward a composable semantic chain where all stages are orchestrated in an extensible pipeline, enabling future enhancements and custom stage integration in work-in-progress "SemanticChain".
  • Three-Stage Pipeline (Extensible & Composable):
    • Stage 1 - Prompt Guard: Security-first approach with jailbreak detection and PII protection
    • Stage 2 - Router Memory: Intelligent semantic caching for performance optimization
    • Stage 3 - Smart Routing: Multi-level intelligent routing combining three complementary strategies:
      • Domain Understanding: Semantic classification of queries into domains (math, coding, general, etc.)
      • Similarity-Based Routing: Semantic similarity matching to route similar queries to optimal models
      • Keyword-Based Routing: Keyword pattern matching for explicit intent detection
      • These three routing strategies work together to provide comprehensive query understanding and optimal model selection
    • Future stages can be added to the pipeline without disrupting existing functionality
  • Mixture-of-Models Philosophy: Recognizes that no single model is optimal for all tasks. By intelligently routing different types of queries to different specialized models, it achieves:
    • Better accuracy through task-specific model selection
    • Cost optimization by using smaller models for simple tasks
    • Performance improvement through semantic understanding
    • Transparency via chain-of-thought visibility
  • Production-Ready: Battle-tested with comprehensive error handling, monitoring, and dashboard support
  • Open Source: vLLM Community-driven development with active maintenance and feature additions

2. Goals

Primary Goals

  • ✅ Integrate vLLM Semantic Router as a new MoM routing option
  • ✅ Extract and store chain-of-thought (CoT) metadata from vLLM-SR responses
  • ✅ Support both Omni and MoM routers coexisting in the same system
  • ✅ Expose CoT information to frontend for visualization

Secondary Goals

  • ✅ Support A/B testing between Omni and MoM routers
  • ✅ Integrate with vLLM-SR dashboard for monitoring

3. Non-Goals

  • ❌ Replace Omni router entirely (maintain coexistence)
  • ❌ Modify vLLM Semantic Router codebase
  • ❌ Implement custom semantic caching in HuggingChat (use vLLM-SR's caching)
  • ❌ Create new dashboard (integrate with existing vLLM-SR dashboard)
  • ❌ Support non-OpenAI-compatible endpoints for MoM

4. Design Principles

1. Backward Compatibility

  • Existing Omni router functionality remains unchanged
  • No breaking changes to current APIs or configurations
  • Both routers can be configured independently

2. Transparency

  • CoT information is always extracted and stored when available
  • Users can see routing decisions and reasoning
  • Clear logging for debugging and monitoring

3. Graceful Degradation

  • MoM → Omni → Fallback model (three-tier fallback strategy)
  • System continues functioning even if vLLM-SR is unavailable
  • Clear error messages for integration issues

4. Separation of Concerns

  • Router implementations are isolated in separate modules
  • Router selection logic is centralized in endpoint.ts
  • Configuration is environment-based

5. Performance First

  • Minimal overhead for router selection
  • Efficient CoT parsing and storage
  • Support for streaming responses with CoT metadata

5. Architecture

5.1 System Overview

┌─────────────────────────────────────────────────────────────┐
│                    HuggingChat Frontend                      │
│                  (Model Selection UI)                        │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│              Router Endpoint (endpoint.ts)                   │
│  - Detects router type (MoM vs Omni)                        │
│  - Dispatches to appropriate router                         │
│  - Handles fallback logic                                   │
└────────────────────────┬────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ↓                ↓                ↓
   ┌─────────┐    ┌──────────────┐   ┌──────────┐
   │  Omni   │    │     MoM      │   │ Fallback │
   │ Router  │    │    Router    │   │  Model   │
   │(arch.ts)│    │(vllm-sr.ts)  │   │          │
   └────┬────┘    └──────┬───────┘   └──────────┘
        │                │
        ↓                ↓
   ┌─────────┐    ┌──────────────────┐
   │   Arch  │    │  vLLM Semantic   │
   │  Model  │    │     Router       │
   │ Endpoint│    │    Endpoint      │
   └─────────┘    └──────────────────┘
        │                │
        └────────────────┼────────────────┐
                         ↓                ↓
                    ┌─────────────────────────┐
                    │  Route Resolution &     │
                    │  Model Selection        │
                    │  (policy.ts)            │
                    └─────────────────────────┘
                         │
                         ↓
                    ┌─────────────────────────┐
                    │  Candidate Models       │
                    │  (with fallbacks)       │
                    └─────────────────────────┘

5.2 Component Details

A. Router Selection Logic (endpoint.ts)

  • Detects model alias: "MoM" → vLLM-SR, "omni" → Arch
  • Calls appropriate router function
  • Passes CoT metadata to response stream

B. MoM Router (vllm-semantic-router.ts)

  • Sends requests to vLLM-SR endpoint
  • Extracts CoT from response headers
  • Parses three-stage routing information
  • Returns RouteSelection with CoT metadata

C. Type Extensions (types.ts)

  • Extends RouteSelection with CoT metadata
  • Defines CoT structure for three stages
  • Maintains backward compatibility

6. Request Flow

6.1 MoM Routing Flow (Happy Path)

1. User selects "MoM" model
   ↓
2. Router Endpoint receives request
   ↓
3. Detect model alias = "MoM"
   ↓
4. Call vllmSelectRoute()
   ↓
5. Send request to vLLM-SR endpoint
   ↓
6. Extract CoT from response headers
   ↓
7. Parse three-stage routing information
   ↓
8. Return RouteSelection with CoT metadata
   ↓
9. Resolve route to candidate models
   ↓
10. Try each candidate with fallback
    ↓
11. Stream response with CoT metadata
    ↓
12. Frontend displays routing decision + CoT

7. Data Structures

8.1 Extended RouteSelection Type

interface RouteSelection {
  routeName: string;
  error?: {
    message: string;
    statusCode?: number;
  };
  cotMetadata?: {
    // Stage 1: Prompt Guard
    stage1?: {
      jailbreak: boolean;
      jailbreakConfidence?: number;
      pii: boolean;
      result: 'Continue' | 'BLOCKED';
    };
    // Stage 2: Router Memory
    stage2?: {
      cacheStatus: 'HIT' | 'MISS';
      action: 'Retrieve Memory' | 'Update Memory';
      result: 'Fast Response' | 'Continue';
    };
    // Stage 3: Smart Routing
    stage3?: {
      domain: string;
      reasoning: boolean;
      model: string;
      optimized: boolean;
      result: 'Continue';
    };
    // Raw CoT string for debugging
    rawCot?: string;
  };
}

8.2 Extended Message Type

type Message = Partial<Timestamps> & {
  // ... existing fields ...
  
  routerMetadata?: {
    route: string;
    model: string;
    provider?: string;
    // New: CoT information
    cot?: RouteSelection['cotMetadata'];
    routerType?: 'omni' | 'mom'; // Which router was used
  };
};

9. Implementation Phases

Phase 1: Core MoM Router (Week 1-2)

  • Create src/lib/server/router/vllm-semantic-router.ts
  • Implement vllmSelectRoute() function
  • Add MoM configuration environment variables
  • Implement basic error handling and fallback

Phase 2: Router Integration (Week 2-3)

  • Update endpoint.ts to support router selection
  • Add "MoM" router alias in models.ts
  • Implement router dispatch logic
  • Update types.ts with CoT metadata

Phase 3: CoT Extraction & Storage (Week 3-4)

  • Parse CoT from vLLM-SR response headers
  • Extend RouteSelection with CoT metadata
  • Update Message type with CoT information
  • Implement CoT serialization

Phase 4: Frontend Integration (Week 4-5)

  • Update router endpoint to pass CoT to frontend
  • Implement CoT visualization in UI
  • Add CoT display in chat interface
  • Support toggling CoT visibility

Phase 5: Monitoring & Optimization (Week 5-6)

  • Implement metrics collection
  • Add performance benchmarking
  • Create A/B testing framework
  • Dashboard integration

10. CoT Information Extraction

11.1 Response Header Format

x-vllm-semantic-router-cot: 🔀 vLLM Semantic Router - Chain-Of-Thought 🔀
  → 🛡️ ***Stage 1 - Prompt Guard***: ✅ *No Jailbreak* → ✅ *No PII* → 💯 ***Continue***
  → 🔥 ***Stage 2 - Router Memory***: 🌊 *MISS* → 🧠 *Update Memory* → 💯 ***Continue***
  → 🧠 ***Stage 3 - Smart Routing***: 📂 *math* → 🧠 *Reasoning On* → 🥷 *deepseek-v3* → 🎯 *Prompt Optimized* → 💯 ***Continue***

10.2 Parsing Strategy

  • Extract raw CoT string from header
  • Parse three stages using regex patterns
  • Extract key information: jailbreak, PII, cache status, domain, model, reasoning
  • Store structured data in cotMetadata
  • Keep raw string for debugging

11. Success Criteria

  • ✅ MoM router successfully routes requests to vLLM-SR
  • ✅ CoT information extracted and displayed in UI
  • ✅ Zero breaking changes to existing APIs
  • ✅ Both routers coexist without conflicts

12. Open Questions & Discussion Points

  1. CoT Header Format: Confirm exact header name and format with vLLM team
  2. Cache Visibility: Should cache hit rate be exposed to users?
  3. Default Router: Should MoM become default after stabilization?
  4. Dashboard Integration: Timeline for vLLM-SR dashboard integration?

13. References


Appendix A: File Structure

src/lib/server/router/
├── arch.ts                      (Existing: Omni router)
├── vllm-semantic-router.ts      (New: MoM router)
├── endpoint.ts                  (Updated: Router selection logic)
├── policy.ts                    (Existing: Route resolution)
├── types.ts                     (Updated: CoT metadata types)
└── index.ts                     (Updated: Export new router)

src/lib/server/
├── models.ts                    (Updated: MoM router alias)
└── config.ts                    (Updated: MoM configuration)

src/lib/types/
└── Message.ts                   (Updated: CoT in routerMetadata)

Document Version: 1.0
Last Updated: 2025-10-19
Status: Ready for Review

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions