Skip to content

Conversation

daniyelnnr
Copy link
Contributor

@daniyelnnr daniyelnnr commented Oct 3, 2025

What is the purpose of this pull request?

This PR introduces a modern metrics tooling layer using @vtex/diagnostics-nodejs (OpenTelemetry based) to replace the legacy MetricsAccumulator system. It provides a high-level API for recording metrics while maintaining full backward compatibility with existing code.

Key additions:

  • New DiagnosticsMetrics class with simplified API (recordLatency, incrementCounter, setGauge)
  • Attribute cardinality limiting (max 5 attributes) to prevent metrics explosion
  • Migration of internal HTTP client, HTTP handler, and GraphQL metrics to new system
  • Comprehensive test suite with minimal mocking approach (all tests passing)

Migration status:

  • ✅ HTTP Agent statistics (gauges)
  • ✅ Incoming request tracking (counters)
  • ✅ HTTP Client metrics (latency + counters with attributes)
  • ✅ HTTP Handler metrics (latency + counters with attributes)
  • ✅ GraphQL field metrics (@metric directive)

Both legacy and new systems run in parallel - no breaking changes for external apps.

What problem is this solving?

Current limitations:

  • Legacy MetricsAccumulator uses in-memory aggregation with snapshot-and-clear semantics
  • Metrics exported via console.log requiring log parsing
  • Client-side percentile calculation is memory-intensive
  • No cardinality control leads to potential metrics explosion
  • Dynamic metric names instead of attribute-based approach

Solution:

  • OpenTelemetry-native metrics with proper SDK handling of aggregation
  • Attribute-based metrics following OTel best practices (e.g., http_client_requests_total with status attribute)
  • Built-in cardinality protection (5-attribute limit with warnings)
  • Shared histogram (io_app_operation_duration_milliseconds) for all latencies
  • Component differentiation via component attribute (http-client, http-handler, graphql)

How should this be manually tested?

Note: This cannot be fully tested locally as it requires the VTEX IO runtime in cloud environment.

Testing approach:

  1. Deploy to test cluster first
  2. Verify metrics exported to observability backend
  3. Validate metric names match spec:
    • Histogram: io_app_operation_duration_milliseconds
    • Counters: http_client_requests_total, http_handler_requests_total, graphql_field_requests_total
    • Gauges: http_agent_sockets_current, http_server_requests_total
  4. Check attributes are present: component, status, status_code, route_id, client_metric, cache_state
  5. Verify legacy metrics still work (backward compatibility)
  6. Test with external IO App (e.g., render-server) to ensure no breaking changes
  7. Monitor for console warnings when global.diagnosticsMetrics unavailable

Validation checklist:

  • Metrics appear in observability backend
  • Attribute cardinality stays within limits
  • Legacy MetricsAccumulator still functions
  • No errors in service logs
  • Performance overhead acceptable

Screenshots or example usage

New API usage:

// Recording latency (primary use case)
const start = process.hrtime()
// ... do work ...
global.diagnosticsMetrics.recordLatency(process.hrtime(start), {
  component: 'http-client',
  client_metric: 'checkout-api',
  status: 'success',
  cache_state: 'hit'
})

// Incrementing counters
global.diagnosticsMetrics.incrementCounter('http_client_requests_total', 1, {
  component: 'http-client',
  status: 'success',
  status_code: 200
})

// Setting gauge values
global.diagnosticsMetrics.setGauge('http_agent_sockets_current', 42, {})

Metric structure:

- io_app_operation_duration_milliseconds{component="http-client",client_metric="pages-api",status="success",cache_state="hit"}
- http_client_requests_total{component="http-client",status="success",status_code=200}
- http_handler_requests_total{component="http-handler",route_id="render",status="success",status_code=200}
- graphql_field_requests_total{component="graphql",field_name="products",status="success"}

Types of changes

  • Bug fix (a non-breaking change which fixes an issue)
  • New feature (a non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Requires change to documentation, which has been updated accordingly.

This commit introduces the DiagnosticsMetrics class, which provides a high-level API for recording metrics using the `@vtex/diagnostics-nodejs` library. It includes functionality for recording latency, incrementing counters, and setting gauge values while managing attribute limits to prevent high cardinality. The new metrics system is initialized in the startApp function, allowing for backward compatibility with the existing MetricsAccumulator.
This commit introduces a comprehensive test suite for the DiagnosticsMetrics class, covering initialization, latency recording, counter incrementing, and gauge setting functionalities. The tests ensure proper handling of attributes, including limits on the number of attributes, and verify that metrics are recorded correctly under various scenarios, including error handling during initialization.
This commit introduces the `updateHttpAgentMetrics` method in the HttpAgentSingleton class, which periodically exports the current state of the HTTP agent as diagnostic metrics. It reports the current number of sockets, free sockets, and pending requests as gauges. This stats are used in `MetricsAccumulator` to produce the same metrics in log-based format.
This commit updates the `trackStatus` function to include a call to `HttpAgentSingleton.updateHttpAgentMetrics()`, enhancing the diagnostics metrics by exporting HTTP agent statistics. This addition complements the existing legacy status tracking functionality.
This commit introduces a new test suite for the HttpAgentSingleton class, validating the functionality of HTTP agent statistics retrieval and metrics reporting. The tests cover scenarios for both populated and empty agent states, ensuring accurate gauge reporting to diagnostics metrics.
This commit updates the trackIncomingRequestStats middleware to include detailed metric reporting for total incoming requests. It introduces a cumulative counter for total requests ensuring that metrics are reported to diagnostics if available.
This commit modifies the requestClosed function to include detailed reporting of closed request metrics to diagnostics. It ensures that metrics are incremented and reported with relevant route and status information, improving the overall monitoring capabilities of the middleware.
This commit enhances the requestAborted function to report metrics for aborted requests to diagnostics. It includes relevant route and status information, improving monitoring capabilities.
This commit introduces a comprehensive test suite for the requestStats middleware, validating the functionality of tracking incoming request statistics, including total, aborted, and closed requests. The tests ensure accurate reporting to diagnostics metrics and handle various scenarios, including the absence of global diagnostics metrics.
- Add OpenTelemetry `Attributes` import in `metricsMiddleware` to annotate HTTP client metric dimensions
- Captures elapsed request time and feed it to `global.diagnosticsMetrics.recordLatency`
- Create and increment a request counter
- Capture the resolved cache state for each HTTP client response
 - Include `cache_state` on latency histograms and the primary request counter
 - Emit `http_client_cache_total` diagnostics counter to replace legacy extension stats
- Read retry count from config with safe default so zero attempts are tracked
 - Retain legacy status extension updates while preparing to deprecate them
 - Emit `http_client_requests_retried_total` counter with base attributes for diagnostics
This commit introduces a test suite for the metricsMiddleware, validating the functionality of metrics recording for various HTTP client scenarios, including successful requests, cache hits, retries, and error handling.
The HTTP timings middleware now takes advantage of DiagnosticsMetrics whenever it is available so we emit request latency and counter telemetry with stable OpenTelemetry-style attributes. This ensure `http-handler` metrics from legacy log-based approach are proper instrumented with diagnostics patterns. Legacy batching remains in place for backwards compatibility, and we log a warning when the diagnostics client is missing to make instrumentation gaps visible.

 - extend context destructuring to capture the route type for tagging
 - record latency histograms for every request and keep legacy batching intact
 - increment the new http_handler_requests_total counter with consistent attributes
This commit introduces a test suite for the timings middleware, validating the recording of diagnostics metrics for various HTTP request scenarios, including successful requests, error responses, and legacy compatibility. The tests ensure accurate logging of timing and billing information, as well as proper handling of different status codes and graceful degradation when diagnostics metrics are unavailable.
This commit updates the Metric schema directive to integrate new diagnostics metrics, allowing for the recording of latency histograms and counters with stable attributes. It maintains backward compatibility with legacy metrics while providing warnings when diagnostics metrics are unavailable. Key changes include:

- Refactoring status handling for clarity
- Adding attributes for component, field name, and status in diagnostics metrics
- Recording latency for all requests and incrementing a counter for GraphQL field requests
This commit introduces a comprehensive test suite for the Metric schema directive, validating the recording of metrics for both successful and failed GraphQL field resolutions. The tests ensure accurate logging of latency and error metrics, while also checking for proper handling when diagnostics metrics are unavailable. Key additions include:

- Tests for successful field resolution metrics
- Tests for failed field resolution metrics
- Warning checks when diagnostics metrics are not present
@daniyelnnr daniyelnnr requested a review from a team October 3, 2025 20:37
@daniyelnnr daniyelnnr self-assigned this Oct 3, 2025
@daniyelnnr daniyelnnr requested review from wisneycardeal and removed request for a team October 3, 2025 20:37
@daniyelnnr daniyelnnr requested review from a team and removed request for wisneycardeal October 3, 2025 20:37
Copy link
Contributor

@arturpimentel arturpimentel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew!! That was a long one 🤣

I admit that at first I tried to read everything, but by the 6th commit onwards I started to skim the test files. So don't count only on me in this one 😬

From what I gathered, this is thoroughly tested, which is awesome. The implementation files look good too. Nicely done!

Comment on lines +102 to +105
this.metricsClient = await Promise.race([
getMetricClient(),
timeoutPromise
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: what happens if this initialization times out? Do we try again later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, if the initialization times out, there's no retry.
This rationale was chosen to align with what's already being done in the logs scope—but we can rethink this, no problem. If the timeout occurs, the metricsClient remains undefined, and this is reported in the console, allowing the application to continue normally without crashing. This approach allows telemetry failures to not impact availability, allowing requests to still be served by the application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that also means that if telemetry is not available at startup, then we would need to restart the affected pods to re-enable it once it's back, right? I would consider a retry mechanism (maybe with some backoff) so we can ease our life when this situation happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this can also be a FUP

status_code: statusCode,
})
} else {
console.warn('DiagnosticsMetrics not available. Request aborted metric not reported.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those warnings be observed (not just being printed to stdout)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure if this should be addressed; we ended up running into a problem of emitting metrics about missing metrics - I confess I'm not sure if this scope should be covered now. One way would be to treat this scenario as a follow-up and keep the behavior via stdout warnings for now. What do you think?

Copy link
Contributor

@arturpimentel arturpimentel Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's good enough for me!

This commit modifies the DiagnosticsMetrics class to log warnings about exceeded attribute limits only if it is in a linked app context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants