docs: added more info to load balancing & passthrough endpoints

mubashir1osmani · mubashir1osmani · commit eab55fea809f · 2025-09-03T23:32:06.000-04:00
diff --git a/docs/my-website/docs/pass_through/intro.md b/docs/my-website/docs/pass_through/intro.md
@@ -11,3 +11,43 @@ These endpoints are useful for 2 scenarios:
 ## How is your request handled? 
 
 The request is passed through to the provider's endpoint. The response is then passed back to the client. **No translation is done.**
+
+### Request Forwarding Process
+
+1. **Request Reception**: LiteLLM receives your request at `/provider/endpoint`
+2. **Authentication**: Your LiteLLM API key is validated and mapped to the provider's API key
+3. **Request Transformation**: Request is reformatted for the target provider's API
+4. **Forwarding**: Request is sent to the actual provider endpoint
+5. **Response Handling**: Provider response is returned directly to you
+
+### Authentication Flow
+
+```mermaid
+graph LR
+    A[Client Request] --> B[LiteLLM Proxy]
+    B --> C[Validate LiteLLM API Key]
+    C --> D[Map to Provider API Key]
+    D --> E[Forward to Provider]
+    E --> F[Return Response]
+```
+
+**Key Points:**
+- Use your **LiteLLM API key** in requests, not the provider's key
+- LiteLLM handles the provider authentication internally
+- Same authentication works across all passthrough endpoints
+
+### Error Handling
+
+**Provider Errors**: Forwarded directly to you with original error codes and messages
+
+**LiteLLM Errors**: 
+- `401`: Invalid LiteLLM API key
+- `404`: Provider or endpoint not supported
+- `500`: Internal routing/forwarding errors
+
+### Benefits
+
+- **Unified Authentication**: One API key for all providers
+- **Centralized Logging**: All requests logged through LiteLLM
+- **Cost Tracking**: Usage tracked across all endpoints
+- **Access Control**: Same permissions apply to passthrough endpoints
diff --git a/docs/my-website/docs/proxy/load_balancing.md b/docs/my-website/docs/proxy/load_balancing.md
@@ -13,6 +13,23 @@ For more details on routing strategies / params, see [Routing](../routing.md)
 
 :::
 
+## How Load Balancing Works
+
+LiteLLM automatically distributes requests across multiple deployments of the same model using its built-in router. the proxy routes traffic to optimize performance and reliability.
+
+"simple-shuffle" routing strategy is used by default
+
+### Routing Strategies
+
+| Strategy | Description | When to Use |
+|----------|-------------|-------------|
+| **simple-shuffle** (recommended) | Randomly distributes requests | General purpose, good for even load distribution |
+| **least-busy** | Routes to deployment with fewest active requests | High concurrency scenarios |
+| **usage-based-routing** (bad for perf) | Routes to deployment with lowest current usage (RPM/TPM) | When you want to respect rate limits evenly |
+| **latency-based-routing** | Routes to fastest responding deployment | Latency-critical applications |
+| **cost-based-routing** | Routes to deployment with lowest cost | Cost-sensitive applications |
+
+
 ## Quick Start - Load Balancing
 #### Step 1 - Set deployments on config
 
@@ -106,49 +123,13 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
     ]
 }'
 ```
-</TabItem>
-<TabItem value="langchain" label="Langchain">
-
-```python
-from langchain.chat_models import ChatOpenAI
-from langchain.prompts.chat import (
-    ChatPromptTemplate,
-    HumanMessagePromptTemplate,
-    SystemMessagePromptTemplate,
-)
-from langchain.schema import HumanMessage, SystemMessage
-import os 
-
-os.environ["OPENAI_API_KEY"] = "anything"
-
-chat = ChatOpenAI(
-    openai_api_base="http://0.0.0.0:4000",
-    model="gpt-3.5-turbo",
-)
-
-messages = [
-    SystemMessage(
-        content="You are a helpful assistant that im using to make a test request to."
-    ),
-    HumanMessage(
-        content="test from litellm. tell me why it's amazing in 1 sentence"
-    ),
-]
-response = chat(messages)
-
-print(response)
-```
-
-</TabItem>
-
-</Tabs>
 
 
 ### Test - Loadbalancing
 
 In this request, the following will occur:
 1. A rate limit exception will be raised 
-2. LiteLLM proxy will retry the request on the model group (default is 3).
+2. LiteLLM proxy will retry the request on the model group (default retries are 3).
 
 ```bash
 curl -X POST 'http://0.0.0.0:4000/chat/completions' \
@@ -256,4 +237,16 @@ model_group_alias: Optional[Dict[str, Union[str, RouterModelGroupAliasItem]]] =
 class RouterModelGroupAliasItem(TypedDict):
     model: str
     hidden: bool  # if 'True', don't return on `/v1/models`, `/v1/model/info`, `/v1/model_group/info`
-```
+```
+
+### When You'll See Load Balancing in Action
+
+**Immediate Effects:**
+
+- Different deployments serve subsequent requests (visible in logs)
+- Better response times during high traffic
+
+**Observable Benefits:**
+- **Higher throughput**: More requests handled simultaneously across deployments
+- **Improved reliability**: If one deployment fails, traffic automatically routes to healthy ones
+- **Better resource utilization**: Load spread evenly across all available deployments