Skip to content

Implement openai-proxy MVP#1

Open
mfittko wants to merge 30 commits intomainfrom
feat/initial-openai-proxy
Open

Implement openai-proxy MVP#1
mfittko wants to merge 30 commits intomainfrom
feat/initial-openai-proxy

Conversation

@mfittko
Copy link
Copy Markdown
Contributor

@mfittko mfittko commented Mar 31, 2026

Summary

Implement the first production-shaped openai-proxy MVP as a standalone Ruby OpenAI proxy.

The scope is intentionally narrow:

  • transparent OpenAI-compatible proxying on /v1/*
  • project-scoped upstream API keys stored in MySQL with AES-256-GCM encryption
  • short-lived proxy tokens minted through a minimal management API and CLI
  • per-process in-memory token validation cache on the hot path
  • asynchronous usage logging through an in-memory queue with CloudWatch delivery when configured, otherwise JSONL emission to an opt-in file or stdout

Changes

  • add the Rack + Puma service, routing, management endpoints, and transparent /v1/* proxy behavior
  • add the management CLI for project listing, project updates, and token minting against the same management API
  • add MySQL persistence for projects and tokens, plus encrypted upstream API key storage
  • add explicit hot-path profiling and benchmark helpers, including llm-proxy benchmark compatibility via upstream timing headers
  • add buffered and streaming response handling with usage extraction from JSON and SSE-style responses
  • add browser-facing CORS support, including configurable allowed origins and preview-host handling
  • add a per-process in-memory token cache
  • add async observability delivery with CloudWatch as the configured sink and JSONL fallback logging to file or stdout when CloudWatch is not configured
  • add Docker, Compose, CI workflows, RuboCop, and RSpec coverage for application, proxy, CLI, observability, and integration paths
  • add Ruby 4 runtime/tooling support and a multi-stage runtime image

Architectural Direction

  • Keep the request path minimal: validate token cheaply, resolve the upstream API key, and forward the request.
  • Avoid external systems on the hot path beyond MySQL fallback and the upstream OpenAI-compatible request.
  • Keep observability off-path, but always on: queue in memory, then ship to CloudWatch or emit structured JSON lines locally.
  • Do not broaden scope into admin UI, provider abstraction, or response caching in this MVP.

Testing

  • bundle exec rspec
  • make lint
  • make coverage
  • npm run build in sofatutor/.cobain/cdk for the matching Cobain stack cleanup
  • local benchmark validation through llm-proxy benchmark against the compose deployment

Notes

  • main already contains the initial repo bootstrap; this PR contains the actual implementation.
  • The related sofatutor Cobain stack was updated to match the current openaiproxy deployment assumptions.

@mfittko mfittko self-assigned this Mar 31, 2026
@mfittko mfittko requested a review from Ayushi1296 March 31, 2026 18:09
@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Mar 31, 2026

Tested locally using llm-proxy benchmark suite:

PROXY_TOKEN='sk-JHjWPxjirUU3Hl5UFfQIZw' /Users/manuelfittko/github/llm-proxy/bin/llm-proxy benchmark --base-url http://127.0.0.1:18080 --endpoint /v1/chat/completions --method POST --token-env PROXY_TOKEN --requests 1000 --concurrency 50 --json '{"model":"gpt-4.1-nano","messages":[{"role":"user","content":"Reply with one word: ping"}],"max_tokens":8}'
Requests sent: 1000, completed: 1000, failed: 1
+------------------------------------------------+
| Total requests        | 1000                   |
| Concurrency           | 50                     |
| Duration (s)          | 20.15                  |
| Success               | 999                    |
| Failed                | 1                      |
| Requests/sec          | 49.64                  |
| Avg latency           | 564.714ms              |
| Min latency           | 294.613ms              |
| Max latency           | 4.747s                 |
| p90 latency           | 733.548ms              |
| p90 mean latency      | 486.881ms              |
| Upstream latency avg  | 550.395ms              |
| Upstream latency min  | 292.097ms              |
| Upstream latency max  | 4.744s                 |
| Upstream latency p90  | 663.882ms              |
| Upstream latency p90 mean | 480.739ms              |
| Proxy latency avg     | 14.321ms               |
| Proxy latency min     | 1.281ms                |
| Proxy latency max     | 320.235ms              |
| Proxy latency p90     | 7.140ms                |
| Proxy latency p90 mean | 2.902ms                |
+------------------------------------------------+
| Response code                                  |
| 200                   | 999                    |
| Network error         | 1                      |
+------------------------------------------------+

@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Mar 31, 2026

I did not yet verify streaming support and other stuff, but it's generally working with completions. Also the cloudwatch logging is not yet verified. I'll roll this out to AWS and then give it a full test later on.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the first MVP of openai-proxy: a standalone Rack + Puma Ruby service that mints short-lived proxy tokens per project and transparently proxies /v1/* requests to OpenAI, with Redis hot-path helpers, optional response caching, and async CloudWatch usage shipping.

Changes:

  • Add core proxy application: routing, management endpoints (projects + token minting), and transparent /v1/* forwarding (buffered + streaming).
  • Add persistence + security primitives: MySQL repositories for projects/tokens and AES-256-GCM encryption for stored upstream API keys.
  • Add hot-path + ops tooling: Redis token cache, Redis HTTP response cache, CloudWatch usage worker/sink, Docker/Compose, CI workflows, RuboCop, and RSpec (unit + integration + optional real-API smoke).

Reviewed changes

Copilot reviewed 61 out of 62 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
spec/worker_spec.rb Unit coverage for observability worker batching/shutdown behavior.
spec/usage_queue_spec.rb Unit coverage for Redis-backed usage queue push/pop_batch.
spec/usage_event_builder_spec.rb Unit coverage for extracting usage from JSON + SSE responses.
spec/token_validator_spec.rb Unit coverage for token validation error cases and cache warming.
spec/token_repository_spec.rb Unit coverage for token persistence and schema bootstrapping.
spec/token_generator_spec.rb Unit coverage for token format/validation.
spec/token_cache_spec.rb Unit coverage for Redis token caching TTL/serialization.
spec/support/collecting_usage_queue.rb Test helper queue implementation for event capture assertions.
spec/streaming_body_spec.rb Unit coverage for StreamingBody chunk yielding and close semantics.
spec/spec_helper.rb RSpec + SimpleCov configuration (including branch coverage).
spec/response_cache_spec.rb Unit coverage for Redis-backed response cache (entry + alias behavior).
spec/proxy_spec.rb Proxy behavior tests: JSON, streaming, caching, and upstream failures.
spec/project_repository_spec.rb Unit coverage for project persistence and encryption-at-rest expectations.
spec/project_record_spec.rb Unit coverage for API key obfuscation helper.
spec/project_api_key_cipher_spec.rb Unit coverage for AES-GCM encrypt/decrypt and plaintext pass-through.
spec/openai_proxy_spec.rb App graph construction test ensuring singleton build + dependency wiring.
spec/integration/real_openai_spec.rb Optional real-OpenAI smoke test for end-to-end proxying.
spec/integration/compose_stack_spec.rb Compose-backed integration test for full proxy flow and caching.
spec/config_spec.rb Unit coverage for env-based config parsing and validation.
spec/cloudwatch_log_sink_spec.rb Unit coverage for CloudWatch sink enablement/stream handling/retries.
spec/cache_helpers_spec.rb Unit coverage for cache-control parsing and cache key stability.
spec/application_spec.rb Unit coverage for Rack app routing/auth/validation/proxy dispatch.
Rakefile Adds default RSpec rake task.
openai_proxy.gemspec Defines gem metadata and runtime dependencies.
Makefile Developer commands for test/coverage/lint/run and syntax checks.
lib/openai_proxy/version.rb Introduces gem version constant.
lib/openai_proxy/token_validator.rb Token validation with cache + repository lookup and error codes.
lib/openai_proxy/token_repository.rb Token persistence, lookup join, and schema bootstrap.
lib/openai_proxy/token_record.rb Token record struct with expiry and cache TTL helpers.
lib/openai_proxy/token_generator.rb Token generation + format validation.
lib/openai_proxy/token_cache.rb Redis token cache serialization/TTL behavior.
lib/openai_proxy/streaming_body.rb Streaming Rack body backed by queue + worker thread.
lib/openai_proxy/response_cache.rb Redis response cache (entry + alias indirection).
lib/openai_proxy/proxy.rb Core upstream forwarding (buffered + streaming), caching, usage capture.
lib/openai_proxy/project_repository.rb Project persistence and API key encryption integration.
lib/openai_proxy/project_record.rb Project record struct with API key obfuscation.
lib/openai_proxy/project_api_key_cipher.rb AES-256-GCM encryption/decryption for stored upstream keys.
lib/openai_proxy/observability/worker.rb Background worker loop draining Redis usage queue to sink.
lib/openai_proxy/observability/usage_queue.rb Redis list-based queue implementation for usage events.
lib/openai_proxy/observability/usage_event_builder.rb Builds usage events from request/response (incl. SSE parsing).
lib/openai_proxy/observability/cloudwatch_log_sink.rb CloudWatch Logs sink implementation for publishing usage events.
lib/openai_proxy/log_sanitizer.rb Redaction helper for logs (Bearer/sk-* patterns).
lib/openai_proxy/config.rb Environment-driven configuration (timeouts, cache, limits, etc.).
lib/openai_proxy/cache_helpers.rb Cache-control parsing, TTL decisions, and stable cache key helpers.
lib/openai_proxy/application.rb Rack application routing for health, management API, and proxying.
lib/openai_proxy.rb Top-level require + dependency graph construction (DB/Redis/worker/proxy).
Gemfile.lock Locks dependency versions for the application/gem.
Gemfile Declares dependencies for runtime and development/test groups.
exe/openai_proxy CLI entrypoint to run the Rack app via Rackup::Server.
Dockerfile Container build for running the proxy under Puma.
docker-compose.yml Local stack (proxy + MySQL + Redis) with health checks and env wiring.
docker-compose.integration.yml Integration override using an upstream echo server + cache enabled.
db/schema.sql MySQL schema for projects and tokens tables.
config/puma.rb Puma runtime configuration (bind/port, threads, workers, preload).
config.ru Rack config to run the application and shutdown resources at exit.
.rubocop.yml RuboCop configuration for Ruby 3.3 + RSpec/Performance cops.
.rspec RSpec defaults (require helper, documentation format).
.github/workflows/test.yml CI: unit (coverage) + compose-backed integration job.
.github/workflows/release.yml CI: stable-tag gated GitHub Release creation.
.github/workflows/lint.yml CI: RuboCop + syntax checks.
.github/workflows/docker.yml CI: docker build/push workflow with tag sanitization logic.
.github/scripts/release-tag.sh Tag classification/version extraction helper for releases.

@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Apr 1, 2026

I like the overall direction here — Rack + Puma, Sequel/MySQL for durable state, Redis for hot-path cache, and off-path usage shipping all make sense for the MVP.

That said, I think the scope may be a bit too broad for a first production-ready cut. The two areas that make me want to narrow scope are:

  1. Response caching - This adds a lot of semantic and operational surface area for an MVP whose main job is transparent proxying. I’d consider deferring response caching and keeping only token caching for now, unless we already know this is required for launch.

  2. Schema management at app boot - Running schema setup from application code is convenient, but it mixes runtime serving with schema lifecycle. I’d prefer an explicit migration/setup step in deploys, with the app assuming the schema already exists.

A smaller concern, but worth noting:

• the in-process CloudWatch worker is okay for MVP, but if we expect multiple Puma workers or stricter delivery guarantees, we may eventually want to split that into a separate worker process.

If the goal is to get the narrowest reliable replacement shipped quickly, I’d strongly consider trimming the first version down to:

• transparent /v1/* proxying

• project + token management

• MySQL-backed persistence

• Redis-backed token cache

• async usage queueing

…and leave response caching / more advanced operational behavior for a follow-up PR.

@korny
Copy link
Copy Markdown
Member

korny commented Apr 1, 2026

Just a comment, since I saw the versions in the README: We should be forward looking and already use Ruby 4 and MySQL 8.4 here.

@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Apr 3, 2026

I would split the observability discussion into two parts.

  1. What I think should stay in this MVP PR
  • Keep streamed usage extraction for the SSE variants we actually need to support here.
  • In particular, keep support for:
    • /v1/responses style streams where usage arrives on response.completed
    • chat-completions style streams where a later SSE chunk carries usage
  • That part looks justified, because otherwise streamed requests across different OpenAI-style endpoints will log usage inconsistently.
  1. What I think can move to a follow-up PR
  • The fallback token-count estimation when the upstream does not emit usage at all.
  • The extra response-content reconstruction that exists mainly to support that estimation path.

So the simplification I would recommend is:

  • Keep real upstream streamed usage extraction.
  • Drop the estimated/fallback usage path for now, if we are comfortable with usage being absent in logs when the upstream response does not provide it.

That gives us a simpler and easier-to-defend MVP:

  • streamed usage is handled consistently when upstream provides it
  • we avoid carrying speculative estimation logic in the initial merge

Separately, for the profiling work:

  • keeping the x-upstream-request-start and x-upstream-request-stop response headers in lib/openai_proxy/proxy.rb seems fine
  • the part that still feels like follow-up scope is the broader profiler plumbing and benchmark/profiling support layered through the hot path

@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Apr 3, 2026

During iteration on this PR, earlier runtime assumptions around Redis were dropped in favor of an in-process cache and in-memory observability queue

Some commits also reflect narrowing the MVP toward a smaller hot-path and simpler runtime model

That context is still relevant when reading the branch history, but it is easier to keep it in comments than in the main PR description.

@mfittko
Copy link
Copy Markdown
Contributor Author

mfittko commented Apr 3, 2026

Updated benchmark after moving away from redis and landing profiling optimizations:

bin/llm-proxy benchmark --base-url http://127.0.0.1:18080 --endpoint /v1/chat/completions --method POST --token-env PROXY_TOKEN --requests 1000 --concurrency 50 --json '{"model":"gpt-4.1-nano","messages":[{"role":"user","content":"Reply with one word: ping"}],"max_tokens":8}'
Requests sent: 1000, completed: 1000, failed: 0
+------------------------------------------------+
| Total requests        | 1000                   |
| Concurrency           | 50                     |
| Duration (s)          | 11.88                  |
| Success               | 1000                   |
| Failed                | 0                      |
| Requests/sec          | 84.18                  |
| Avg latency           | 485.728ms              |
| Min latency           | 326.550ms              |
| Max latency           | 1.875s                 |
| p90 latency           | 574.351ms              |
| p90 mean latency      | 452.753ms              |
| Upstream latency avg  | 479.429ms              |
| Upstream latency min  | 325.697ms              |
| Upstream latency max  | 1.874s                 |
| Upstream latency p90  | 571.485ms              |
| Upstream latency p90 mean | 448.889ms              |
| Proxy latency avg     | 6.300ms                |
| Proxy latency min     | 532.055µs              |
| Proxy latency max     | 158.905ms              |
| Proxy latency p90     | 5.989ms                |
| Proxy latency p90 mean | 1.986ms                |
+------------------------------------------------+
| Response code                                  |
| 200                   | 1000                   |
+------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants