Make log-based metrics available with diagnostics tooling #613

daniyelnnr · 2025-10-03T20:37:15Z

What is the purpose of this pull request?

This PR introduces a modern metrics tooling layer using @vtex/diagnostics-nodejs (OpenTelemetry based) to replace the legacy MetricsAccumulator system. It provides a high-level API for recording metrics while maintaining full backward compatibility with existing code.

Key additions:

New DiagnosticsMetrics class with simplified API (recordLatency, incrementCounter, setGauge)
Attribute cardinality limiting (max 5 attributes) to prevent metrics explosion
Migration of internal HTTP client, HTTP handler, and GraphQL metrics to new system
Comprehensive test suite with minimal mocking approach (all tests passing)

Migration status:

✅ HTTP Agent statistics (gauges)
✅ Incoming request tracking (counters)
✅ HTTP Client metrics (latency + counters with attributes)
✅ HTTP Handler metrics (latency + counters with attributes)
✅ GraphQL field metrics (@metric directive)

Both legacy and new systems run in parallel - no breaking changes for external apps.

What problem is this solving?

Current limitations:

Legacy MetricsAccumulator uses in-memory aggregation with snapshot-and-clear semantics
Metrics exported via console.log requiring log parsing
Client-side percentile calculation is memory-intensive
No cardinality control leads to potential metrics explosion
Dynamic metric names instead of attribute-based approach

Solution:

OpenTelemetry-native metrics with proper SDK handling of aggregation
Attribute-based metrics following OTel best practices (e.g., http_client_requests_total with status attribute)
Built-in cardinality protection (5-attribute limit with warnings)
Shared histogram (io_app_operation_duration_milliseconds) for all latencies
Component differentiation via component attribute (http-client, http-handler, graphql)

How should this be manually tested?

Note: This cannot be fully tested locally as it requires the VTEX IO runtime in cloud environment.

Testing approach:

Deploy to test cluster first
Verify metrics exported to observability backend
Validate metric names match spec:
- Histogram: io_app_operation_duration_milliseconds
- Counters: http_client_requests_total, http_handler_requests_total, graphql_field_requests_total
- Gauges: http_agent_sockets_current, http_server_requests_total
Check attributes are present: component, status, status_code, route_id, client_metric, cache_state
Verify legacy metrics still work (backward compatibility)
Test with external IO App (e.g., render-server) to ensure no breaking changes
Monitor for console warnings when global.diagnosticsMetrics unavailable

Validation checklist:

Metrics appear in observability backend
Attribute cardinality stays within limits
Legacy MetricsAccumulator still functions
No errors in service logs
Performance overhead acceptable

Screenshots or example usage

New API usage:

// Recording latency (primary use case)
const start = process.hrtime()
// ... do work ...
global.diagnosticsMetrics.recordLatency(process.hrtime(start), {
  component: 'http-client',
  client_metric: 'checkout-api',
  status: 'success',
  cache_state: 'hit'
})

// Incrementing counters
global.diagnosticsMetrics.incrementCounter('http_client_requests_total', 1, {
  component: 'http-client',
  status: 'success',
  status_code: 200
})

// Setting gauge values
global.diagnosticsMetrics.setGauge('http_agent_sockets_current', 42, {})

Metric structure:

- io_app_operation_duration_milliseconds{component="http-client",client_metric="pages-api",status="success",cache_state="hit"}
- http_client_requests_total{component="http-client",status="success",status_code=200}
- http_handler_requests_total{component="http-handler",route_id="render",status="success",status_code=200}
- graphql_field_requests_total{component="graphql",field_name="products",status="success"}

Types of changes

Bug fix (a non-breaking change which fixes an issue)
New feature (a non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Requires change to documentation, which has been updated accordingly.

This commit introduces the DiagnosticsMetrics class, which provides a high-level API for recording metrics using the `@vtex/diagnostics-nodejs` library. It includes functionality for recording latency, incrementing counters, and setting gauge values while managing attribute limits to prevent high cardinality. The new metrics system is initialized in the startApp function, allowing for backward compatibility with the existing MetricsAccumulator.

This commit introduces a comprehensive test suite for the DiagnosticsMetrics class, covering initialization, latency recording, counter incrementing, and gauge setting functionalities. The tests ensure proper handling of attributes, including limits on the number of attributes, and verify that metrics are recorded correctly under various scenarios, including error handling during initialization.

This commit introduces the `updateHttpAgentMetrics` method in the HttpAgentSingleton class, which periodically exports the current state of the HTTP agent as diagnostic metrics. It reports the current number of sockets, free sockets, and pending requests as gauges. This stats are used in `MetricsAccumulator` to produce the same metrics in log-based format.

This commit updates the `trackStatus` function to include a call to `HttpAgentSingleton.updateHttpAgentMetrics()`, enhancing the diagnostics metrics by exporting HTTP agent statistics. This addition complements the existing legacy status tracking functionality.

This commit introduces a new test suite for the HttpAgentSingleton class, validating the functionality of HTTP agent statistics retrieval and metrics reporting. The tests cover scenarios for both populated and empty agent states, ensuring accurate gauge reporting to diagnostics metrics.

This commit updates the trackIncomingRequestStats middleware to include detailed metric reporting for total incoming requests. It introduces a cumulative counter for total requests ensuring that metrics are reported to diagnostics if available.

This commit modifies the requestClosed function to include detailed reporting of closed request metrics to diagnostics. It ensures that metrics are incremented and reported with relevant route and status information, improving the overall monitoring capabilities of the middleware.

This commit enhances the requestAborted function to report metrics for aborted requests to diagnostics. It includes relevant route and status information, improving monitoring capabilities.

This commit introduces a comprehensive test suite for the requestStats middleware, validating the functionality of tracking incoming request statistics, including total, aborted, and closed requests. The tests ensure accurate reporting to diagnostics metrics and handle various scenarios, including the absence of global diagnostics metrics.

- Add OpenTelemetry `Attributes` import in `metricsMiddleware` to annotate HTTP client metric dimensions - Captures elapsed request time and feed it to `global.diagnosticsMetrics.recordLatency` - Create and increment a request counter

- Capture the resolved cache state for each HTTP client response - Include `cache_state` on latency histograms and the primary request counter - Emit `http_client_cache_total` diagnostics counter to replace legacy extension stats

- Read retry count from config with safe default so zero attempts are tracked - Retain legacy status extension updates while preparing to deprecate them - Emit `http_client_requests_retried_total` counter with base attributes for diagnostics

This commit introduces a test suite for the metricsMiddleware, validating the functionality of metrics recording for various HTTP client scenarios, including successful requests, cache hits, retries, and error handling.

The HTTP timings middleware now takes advantage of DiagnosticsMetrics whenever it is available so we emit request latency and counter telemetry with stable OpenTelemetry-style attributes. This ensure `http-handler` metrics from legacy log-based approach are proper instrumented with diagnostics patterns. Legacy batching remains in place for backwards compatibility, and we log a warning when the diagnostics client is missing to make instrumentation gaps visible. - extend context destructuring to capture the route type for tagging - record latency histograms for every request and keep legacy batching intact - increment the new http_handler_requests_total counter with consistent attributes

This commit introduces a test suite for the timings middleware, validating the recording of diagnostics metrics for various HTTP request scenarios, including successful requests, error responses, and legacy compatibility. The tests ensure accurate logging of timing and billing information, as well as proper handling of different status codes and graceful degradation when diagnostics metrics are unavailable.

This commit updates the Metric schema directive to integrate new diagnostics metrics, allowing for the recording of latency histograms and counters with stable attributes. It maintains backward compatibility with legacy metrics while providing warnings when diagnostics metrics are unavailable. Key changes include: - Refactoring status handling for clarity - Adding attributes for component, field name, and status in diagnostics metrics - Recording latency for all requests and incrementing a counter for GraphQL field requests

This commit introduces a comprehensive test suite for the Metric schema directive, validating the recording of metrics for both successful and failed GraphQL field resolutions. The tests ensure accurate logging of latency and error metrics, while also checking for proper handling when diagnostics metrics are unavailable. Key additions include: - Tests for successful field resolution metrics - Tests for failed field resolution metrics - Warning checks when diagnostics metrics are not present

arturpimentel

Phew!! That was a long one 🤣

I admit that at first I tried to read everything, but by the 6th commit onwards I started to skim the test files. So don't count only on me in this one 😬

From what I gathered, this is thoroughly tested, which is awesome. The implementation files look good too. Nicely done!

src/metrics/DiagnosticsMetrics.ts

arturpimentel · 2025-10-08T18:19:29Z

src/metrics/DiagnosticsMetrics.ts

+        this.metricsClient = await Promise.race([
+          getMetricClient(),
+          timeoutPromise
+        ])


Out of curiosity: what happens if this initialization times out? Do we try again later?

In this case, if the initialization times out, there's no retry.
This rationale was chosen to align with what's already being done in the logs scope—but we can rethink this, no problem. If the timeout occurs, the metricsClient remains undefined, and this is reported in the console, allowing the application to continue normally without crashing. This approach allows telemetry failures to not impact availability, allowing requests to still be served by the application.

But that also means that if telemetry is not available at startup, then we would need to restart the affected pods to re-enable it once it's back, right? I would consider a retry mechanism (maybe with some backoff) so we can ease our life when this situation happens.

BTW this can also be a FUP

src/metrics/DiagnosticsMetrics.test.ts

arturpimentel · 2025-10-08T19:12:06Z

src/service/worker/runtime/http/middlewares/requestStats.ts

+      status_code: statusCode,
+    })
+  } else {
+    console.warn('DiagnosticsMetrics not available. Request aborted metric not reported.')


Should those warnings be observed (not just being printed to stdout)?

I'm really not sure if this should be addressed; we ended up running into a problem of emitting metrics about missing metrics - I confess I'm not sure if this scope should be covered now. One way would be to treat this scenario as a follow-up and keep the behavior via stdout warnings for now. What do you think?

Ok, that's good enough for me!

This commit modifies the DiagnosticsMetrics class to log warnings about exceeded attribute limits only if it is in a linked app context

daniyelnnr added 18 commits October 3, 2025 15:16

Add diagnostics metrics reporting for aborted requests in middleware

ec73853

This commit enhances the requestAborted function to report metrics for aborted requests to diagnostics. It includes relevant route and status information, improving monitoring capabilities.

Instrument http-client metrics with diagnostics

2243831

- Add OpenTelemetry `Attributes` import in `metricsMiddleware` to annotate HTTP client metric dimensions - Captures elapsed request time and feed it to `global.diagnosticsMetrics.recordLatency` - Create and increment a request counter

Add cache state in http-client metrics

c3de301

- Capture the resolved cache state for each HTTP client response - Include `cache_state` on latency histograms and the primary request counter - Emit `http_client_cache_total` diagnostics counter to replace legacy extension stats

Add unit tests for metricsMiddleware

e08415e

This commit introduces a test suite for the metricsMiddleware, validating the functionality of metrics recording for various HTTP client scenarios, including successful requests, cache hits, retries, and error handling.

Export DiagnosticsMetrics

a2c1eb2

daniyelnnr requested a review from a team October 3, 2025 20:37

daniyelnnr self-assigned this Oct 3, 2025

daniyelnnr requested review from wisneycardeal and removed request for a team October 3, 2025 20:37

daniyelnnr added the enhancement label Oct 3, 2025

daniyelnnr requested review from a team and removed request for wisneycardeal October 3, 2025 20:37

arturpimentel approved these changes Oct 8, 2025

View reviewed changes

daniyelnnr added 2 commits October 13, 2025 20:00

Update DiagnosticsMetrics to conditionally log limit warnings

a5b3105

This commit modifies the DiagnosticsMetrics class to log warnings about exceeded attribute limits only if it is in a linked app context

Add a description explaning unit test that handles intialization errors

e87afb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make log-based metrics available with diagnostics tooling #613

Make log-based metrics available with diagnostics tooling #613

daniyelnnr commented Oct 3, 2025 •

edited

Loading

Uh oh!

arturpimentel left a comment

Uh oh!

Uh oh!

arturpimentel Oct 8, 2025

Uh oh!

daniyelnnr Oct 13, 2025

Uh oh!

arturpimentel Oct 14, 2025

Uh oh!

arturpimentel Oct 14, 2025

Uh oh!

Uh oh!

arturpimentel Oct 8, 2025

Uh oh!

daniyelnnr Oct 14, 2025

Uh oh!

arturpimentel Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make log-based metrics available with diagnostics tooling #613

Are you sure you want to change the base?

Make log-based metrics available with diagnostics tooling #613

Conversation

daniyelnnr commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of this pull request?

What problem is this solving?

How should this be manually tested?

Screenshots or example usage

Types of changes

Uh oh!

arturpimentel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arturpimentel Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

daniyelnnr Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

arturpimentel Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

arturpimentel Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arturpimentel Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

daniyelnnr Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

arturpimentel Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniyelnnr commented Oct 3, 2025 •

edited

Loading

arturpimentel Oct 14, 2025 •

edited

Loading