Skip to content

feat: Integrate OpenTelemetry#189

Merged
khvn26 merged 20 commits intomainfrom
feat/otel
Apr 1, 2026
Merged

feat: Integrate OpenTelemetry#189
khvn26 merged 20 commits intomainfrom
feat/otel

Conversation

@khvn26
Copy link
Copy Markdown
Member

@khvn26 khvn26 commented Mar 30, 2026

Contributes to #182.

Adds OpenTelemetry integration to flagsmith-common, enabling trace and structured log export over OTLP.

Tracing

When OTEL_EXPORTER_OTLP_ENDPOINT is set, ensure_cli_env() configures:

  • TracerProvider with OTLP/HTTP span export, W3C TraceContext + Baggage propagation
  • DjangoInstrumentor — creates a root span per HTTP request
  • Psycopg2Instrumentor — creates child spans for each SQL query with db.system, db.statement, and db.name attributes. SQL commenter is enabled, adding trace context as SQL comments for database-side correlation.
  • RedisInstrumentor — creates child spans for each Redis command

RouteLoggerMiddleware renames spans using get_route_template (e.g. GET /api/v1/projects/{pk}/) and sets http.route to the same normalised route. URL paths can be excluded from tracing via OTEL_TRACING_EXCLUDED_URL_PATHS.

Structured logs

A custom structlog processor emits each log event as:

  1. An OTLP log record with body, event_name, severity_text, severity_number, and attributes (double-underscore keys are converted to dots, e.g. organisation__idorganisation.id). W3C baggage entries from the current OTel context are copied into log attributes. A flagsmith.event attribute duplicates the event name for backends that don't surface EventName.
  2. A span event on the active span (if one exists) with the same name and attributes, making structlog events visible in trace backends (e.g. SigNoz's Events tab).

A companion processor add_otel_trace_context injects trace_id and span_id from the active span into the structlog event dict for JSON log output.

Configuration

Variable Description Default
OTEL_EXPORTER_OTLP_ENDPOINT Base OTLP endpoint (e.g. http://collector:4318). If unset, no OTel setup occurs. (disabled)
OTEL_SERVICE_NAME The service.name resource attribute. flagsmith-api
OTEL_TRACING_EXCLUDED_URL_PATHS Comma-separated URL paths to exclude from tracing. (none)

How to review

Code structure

The entry point is src/common/core/main.py — the if otel_endpoint: block in ensure_cli_env() conditionally imports and wires up OTel. When the env var is unset, no OTel code is imported.

All OTel logic lives in src/common/core/otel.py:

  • setup_tracing() — context manager that sets the global TracerProvider, configures W3C propagators, and instruments Django, psycopg2, and Redis. Teardown uninstruments in reverse order and shuts down the provider.
  • make_structlog_otel_processor() — factory that returns a structlog processor. The processor maps event dicts to OTLP log records (attribute key conversion, reserved key filtering, EventName construction, baggage propagation) and attaches them as span events on the active span. The flagsmith.event attribute is a workaround for SigNoz not surfacing OTel's EventName field.
  • build_tracer_provider() / build_otel_log_provider() — factory functions for the trace and log providers, both using batch processors with OTLP/HTTP export.
  • add_otel_trace_context — structlog processor that injects trace_id/span_id into the event dict.

src/common/gunicorn/middleware.py has one addition: span.set_attribute("http.route", route_template) — overrides the raw regex from Django's instrumentor with the normalised route template.

pyproject.toml adds opentelemetry-instrumentation-psycopg2, opentelemetry-instrumentation-redis, and redis to the common-core extra.

Testing

  • Unit tests (tests/unit/common/core/test_otel.py) — test the structlog processor in isolation using InMemoryLogExporter, and verify instrumentor setup/teardown via mocks.
  • Integration tests (tests/integration/core/test_otel.py) — test the full stack with real Django test client, real database, and InMemorySpanExporter. The tracing tests assert the complete span output (exact span names, kinds, attributes, parent-child relationships, SQL statements) rather than spot-checking individual fields.

The integration tracing tests use a module-scoped fixture because OTel only allows setting the global TracerProvider once per process. Per-test isolation is achieved via the span_exporter fixture which clears the exporter between tests.

@khvn26 khvn26 requested a review from a team as a code owner March 30, 2026 18:59
@khvn26 khvn26 requested review from emyller and removed request for a team March 30, 2026 18:59
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.88%. Comparing base (2e0459d) to head (7657026).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #189      +/-   ##
==========================================
+ Coverage   96.48%   96.88%   +0.40%     
==========================================
  Files          97      101       +4     
  Lines        3552     3978     +426     
==========================================
+ Hits         3427     3854     +427     
+ Misses        125      124       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@khvn26 khvn26 mentioned this pull request Mar 31, 2026
Copy link
Copy Markdown
Contributor

@emyller emyller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few questions around, none a blocker I believe. Also, couple non-action feedback:

  • I think context-manager-based design makes it trippy to understand what's going on.
  • Both unit and integration tests look very similar in purpose sometimes. Feels duplicated — but I also appreciate both unit and integration test suites are accomplishing their goal here. Still, in practice, are we maybe probing for the same thing twice?

@khvn26
Copy link
Copy Markdown
Member Author

khvn26 commented Apr 1, 2026

@emyller

I think context-manager-based design makes it trippy to understand what's going on.

It's essential to handle graceful shutdowns in a manageable way. Without this, we'll lose spans.

Both unit and integration tests look very similar in purpose sometimes

Fair observation. The difference is that unit tests configure structlog directly while integration tests go through setup_logging().

I inventoried all tests and found three unit tests that were genuinely redundant with integration counterparts.

Dropped all three in 7657026. The remaining unit tests cover edge cases the integration tests don't exercise: kebab-case normalisation, dunder vs single underscore distinction, all 5 severity levels (parametrised), module name fallback when no logger name is given, span event emission, and the factory/lifecycle functions.

@khvn26 khvn26 requested a review from emyller April 1, 2026 12:10
Copy link
Copy Markdown
Contributor

@emyller emyller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LET'S GO!

@khvn26 khvn26 merged commit 2134b53 into main Apr 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants