-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add circuit breaker for upstream provider overload protection #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement per-provider circuit breakers that detect upstream rate limiting (429/503/529 status codes) and temporarily stop sending requests when providers are overloaded. Key features: - Per-provider circuit breakers (Anthropic, OpenAI) - Configurable failure threshold, time window, and cooldown period - Half-open state allows gradual recovery testing - Prometheus metrics for monitoring (state gauge, trips counter, rejects counter) - Thread-safe implementation with proper state machine transitions - Disabled by default for backward compatibility Circuit breaker states: - Closed: normal operation, tracking failures within sliding window - Open: all requests rejected with 503, waiting for cooldown - Half-Open: limited requests allowed to test if upstream recovered Status codes that trigger circuit breaker: - 429 Too Many Requests - 503 Service Unavailable - 529 Anthropic Overloaded Relates to: coder/internal#1153
…solation - Replace custom circuit breaker implementation with sony/gobreaker - Change from per-provider to per-endpoint circuit breakers (e.g., OpenAI chat completions failing won't block responses API) - Simplify API: CircuitBreakers manages all breakers internally - Update metrics to include endpoint label - Simplify tests to focus on key behaviors Based on PR review feedback suggesting use of established library and per-endpoint granularity for better fault isolation.
Rename fields to match gobreaker naming convention: - Window -> Interval - Cooldown -> Timeout - HalfOpenMaxRequests -> MaxRequests - FailureThreshold type int64 -> uint32
…onfigs Address PR review feedback: 1. Middleware pattern - Circuit breaker is now HTTP middleware that wraps handlers, capturing response status codes directly instead of extracting from provider-specific error types. 2. Per-provider configs - NewCircuitBreakers takes map[string]CircuitBreakerConfig keyed by provider name. Providers not in the map have no circuit breaker. 3. Remove provider overfitting - Deleted extractStatusCodeFromError() which hardcoded AnthropicErrorResponse and OpenAIErrorResponse types. Middleware now uses statusCapturingWriter to inspect actual HTTP response codes. 4. Configurable failure detection - IsFailure func in config allows providers to define custom status codes as failures. Defaults to 429/503/529. 5. Fix gauge values - State gauge now uses 0 (closed), 0.5 (half-open), 1 (open) 6. Integration tests - Replaced unit tests with httptest-based integration tests that verify actual behavior: upstream errors trip circuit, requests get blocked, recovery after timeout, per-endpoint isolation. 7. Error message - Changed from 'upstream rate limiting' to 'circuit breaker is open'
- Add CircuitBreaker interface with Allow(), RecordSuccess(), RecordFailure() - Add NoopCircuitBreaker struct for providers without circuit breaker config - Add gobreakerCircuitBreaker wrapping sony/gobreaker implementation - CircuitBreakers.Get() returns NoopCircuitBreaker when provider not configured - Add http.Flusher support to statusCapturingWriter for SSE streaming - Add Unwrap() for ResponseWriter interface detection
- Changed CircuitBreaker interface to Execute(fn func() int) (statusCode, rejected) - Use gobreaker.Execute() to properly handle both ErrOpenState and ErrTooManyRequests - NoopCircuitBreaker.Execute simply runs the function and returns not rejected - Simplified middleware by removing separate Allow/Record pattern
dannykopping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking better now, but still a few things which need addressing please.
circuit_breaker.go
Outdated
| } | ||
|
|
||
| // DefaultCircuitBreakerConfig returns sensible defaults for circuit breaker configuration. | ||
| func DefaultCircuitBreakerConfig() CircuitBreakerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be used like:
cbConfig := aibridge.DefaultCircuitBreakerConfig()
providers := []aibridge.Provider{
aibridge.NewOpenAIProvider(aibridge.OpenAIConfig{
BaseURL: "https://api.openai.com",
Key: "test-key",
CircuitBreaker: &cbConfig,
}),
}
config.go
Outdated
| type ProviderConfig struct { | ||
| BaseURL, Key string | ||
| BaseURL, Key string | ||
| CircuitBreaker *CircuitBreakerConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you imagine this will be configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
circuitBreakerConfig = &aibridge.CircuitBreakerConfig{
FailureThreshold: 5,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
MaxRequests: 3,
IsFailure: DefaultIsFailure,
}
providers := []aibridge.Provider{
aibridge.NewOpenAIProvider(aibridge.OpenAIConfig{
BaseURL: coderAPI.DeploymentValues.AI.BridgeConfig.OpenAI.BaseURL.String(),
Key: coderAPI.DeploymentValues.AI.BridgeConfig.OpenAI.Key.String(),
CircuitBreaker: circuitBreakerConfig,
}),
}
ssncferreira
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like a nice implementation and good test coverage 👍
Added a few comments and suggestions
Resolve conflicts: - Move circuit breaker code to circuitbreaker package - Update imports to use new package structure (provider/, config/, metrics/, etc.) - Update integration tests to use new imports
ssncferreira
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thanks for addressing the comments!
dannykopping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, but there are a few things we need to address first 👍
dannykopping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one note about marshaling the errors in a way that will be more maintainable, but I'm fine for that to be a follow-up.
Thanks for your patience on this one 👍 good work @kacpersaw
| if p.config.OpenErrorResponse != nil { | ||
| return p.config.OpenErrorResponse() | ||
| } | ||
| return []byte(`{"error":"circuit breaker is open"}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each provider has its own error type which can be marshaled the way they expect to return errors. We should use that.
See intercept/messages/base.go -> newErrorResponse() for example.
Co-authored-by: Danny Kopping <[email protected]>
Summary
Implement circuit breakers for upstream AI providers that detect rate limiting and overload conditions (429/503/529 status codes) and temporarily stop sending requests when providers are overloaded.
This completes the overload protection story by adding the aibridge-specific component that couldn't be implemented as generic HTTP middleware in coderd (since it requires understanding upstream provider responses).
Key Features
/v1/messagesand/v1/chat/completionsare isolated), preventing one endpoint's issues from affecting othersIsFailurefunction to determine which status codes trigger the circuitgithub.com/sony/gobreaker/v2librarysync.Mapfor concurrent access to per-endpoint breakersCircuitBreakerConfigin provider config; nil means no protectionCircuit Breaker States
Status Codes That Trigger Circuit Breaker (Default)
Other error codes (400, 401, 500, 502, etc.) do not trigger the circuit breaker since they indicate different issues that circuit breaking wouldn't help with. Custom
IsFailurefunctions can be provided to change this behavior.Default Configuration
FailureThreshold5Interval10sTimeout30sMaxRequests3Prometheus Metrics
circuit_breaker_stateprovider,endpointcircuit_breaker_trips_totalprovider,endpointcircuit_breaker_rejects_totalprovider,endpointFiles Changed
circuit_breaker.go- Core implementation:CircuitBreakerConfig,ProviderCircuitBreakers,CircuitBreakerMiddlewarecircuit_breaker_test.go- Unit tests for middleware, failure detection, state transitionscircuit_breaker_integration_test.go- End-to-end tests with mock upstream for Anthropic and OpenAI providersbridge.go- Wire up circuit breakers per provider with metrics and logging callbacksprovider.go- AddCircuitBreakerConfig()method toProviderinterfaceprovider_anthropic.go- ImplementCircuitBreakerConfig()for Anthropicprovider_openai.go- ImplementCircuitBreakerConfig()for OpenAIconfig.go- AddCircuitBreakerfield toProviderConfigmetrics.go- Add Prometheus metrics for circuit breaker state, trips, and rejectsUsage
Testing
All tests pass:
Related
aibridgedinternal#1153