-
Notifications
You must be signed in to change notification settings - Fork 16
Make log-based metrics available with diagnostics tooling #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit introduces the DiagnosticsMetrics class, which provides a high-level API for recording metrics using the `@vtex/diagnostics-nodejs` library. It includes functionality for recording latency, incrementing counters, and setting gauge values while managing attribute limits to prevent high cardinality. The new metrics system is initialized in the startApp function, allowing for backward compatibility with the existing MetricsAccumulator.
This commit introduces a comprehensive test suite for the DiagnosticsMetrics class, covering initialization, latency recording, counter incrementing, and gauge setting functionalities. The tests ensure proper handling of attributes, including limits on the number of attributes, and verify that metrics are recorded correctly under various scenarios, including error handling during initialization.
This commit introduces the `updateHttpAgentMetrics` method in the HttpAgentSingleton class, which periodically exports the current state of the HTTP agent as diagnostic metrics. It reports the current number of sockets, free sockets, and pending requests as gauges. This stats are used in `MetricsAccumulator` to produce the same metrics in log-based format.
This commit updates the `trackStatus` function to include a call to `HttpAgentSingleton.updateHttpAgentMetrics()`, enhancing the diagnostics metrics by exporting HTTP agent statistics. This addition complements the existing legacy status tracking functionality.
This commit introduces a new test suite for the HttpAgentSingleton class, validating the functionality of HTTP agent statistics retrieval and metrics reporting. The tests cover scenarios for both populated and empty agent states, ensuring accurate gauge reporting to diagnostics metrics.
This commit updates the trackIncomingRequestStats middleware to include detailed metric reporting for total incoming requests. It introduces a cumulative counter for total requests ensuring that metrics are reported to diagnostics if available.
This commit modifies the requestClosed function to include detailed reporting of closed request metrics to diagnostics. It ensures that metrics are incremented and reported with relevant route and status information, improving the overall monitoring capabilities of the middleware.
This commit enhances the requestAborted function to report metrics for aborted requests to diagnostics. It includes relevant route and status information, improving monitoring capabilities.
This commit introduces a comprehensive test suite for the requestStats middleware, validating the functionality of tracking incoming request statistics, including total, aborted, and closed requests. The tests ensure accurate reporting to diagnostics metrics and handle various scenarios, including the absence of global diagnostics metrics.
- Add OpenTelemetry `Attributes` import in `metricsMiddleware` to annotate HTTP client metric dimensions - Captures elapsed request time and feed it to `global.diagnosticsMetrics.recordLatency` - Create and increment a request counter
- Capture the resolved cache state for each HTTP client response - Include `cache_state` on latency histograms and the primary request counter - Emit `http_client_cache_total` diagnostics counter to replace legacy extension stats
- Read retry count from config with safe default so zero attempts are tracked - Retain legacy status extension updates while preparing to deprecate them - Emit `http_client_requests_retried_total` counter with base attributes for diagnostics
This commit introduces a test suite for the metricsMiddleware, validating the functionality of metrics recording for various HTTP client scenarios, including successful requests, cache hits, retries, and error handling.
The HTTP timings middleware now takes advantage of DiagnosticsMetrics whenever it is available so we emit request latency and counter telemetry with stable OpenTelemetry-style attributes. This ensure `http-handler` metrics from legacy log-based approach are proper instrumented with diagnostics patterns. Legacy batching remains in place for backwards compatibility, and we log a warning when the diagnostics client is missing to make instrumentation gaps visible. - extend context destructuring to capture the route type for tagging - record latency histograms for every request and keep legacy batching intact - increment the new http_handler_requests_total counter with consistent attributes
This commit introduces a test suite for the timings middleware, validating the recording of diagnostics metrics for various HTTP request scenarios, including successful requests, error responses, and legacy compatibility. The tests ensure accurate logging of timing and billing information, as well as proper handling of different status codes and graceful degradation when diagnostics metrics are unavailable.
This commit updates the Metric schema directive to integrate new diagnostics metrics, allowing for the recording of latency histograms and counters with stable attributes. It maintains backward compatibility with legacy metrics while providing warnings when diagnostics metrics are unavailable. Key changes include: - Refactoring status handling for clarity - Adding attributes for component, field name, and status in diagnostics metrics - Recording latency for all requests and incrementing a counter for GraphQL field requests
This commit introduces a comprehensive test suite for the Metric schema directive, validating the recording of metrics for both successful and failed GraphQL field resolutions. The tests ensure accurate logging of latency and error metrics, while also checking for proper handling when diagnostics metrics are unavailable. Key additions include: - Tests for successful field resolution metrics - Tests for failed field resolution metrics - Warning checks when diagnostics metrics are not present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew!! That was a long one 🤣
I admit that at first I tried to read everything, but by the 6th commit onwards I started to skim the test files. So don't count only on me in this one 😬
From what I gathered, this is thoroughly tested, which is awesome. The implementation files look good too. Nicely done!
this.metricsClient = await Promise.race([ | ||
getMetricClient(), | ||
timeoutPromise | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: what happens if this initialization times out? Do we try again later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if the initialization times out, there's no retry.
This rationale was chosen to align with what's already being done in the logs scope—but we can rethink this, no problem. If the timeout occurs, the metricsClient
remains undefined
, and this is reported in the console, allowing the application to continue normally without crashing. This approach allows telemetry failures to not impact availability, allowing requests to still be served by the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that also means that if telemetry is not available at startup, then we would need to restart the affected pods to re-enable it once it's back, right? I would consider a retry mechanism (maybe with some backoff) so we can ease our life when this situation happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this can also be a FUP
status_code: statusCode, | ||
}) | ||
} else { | ||
console.warn('DiagnosticsMetrics not available. Request aborted metric not reported.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should those warnings be observed (not just being printed to stdout)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure if this should be addressed; we ended up running into a problem of emitting metrics about missing metrics - I confess I'm not sure if this scope should be covered now. One way would be to treat this scenario as a follow-up and keep the behavior via stdout warnings for now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's good enough for me!
This commit modifies the DiagnosticsMetrics class to log warnings about exceeded attribute limits only if it is in a linked app context
What is the purpose of this pull request?
This PR introduces a modern metrics tooling layer using
@vtex/diagnostics-nodejs
(OpenTelemetry based) to replace the legacyMetricsAccumulator
system. It provides a high-level API for recording metrics while maintaining full backward compatibility with existing code.Key additions:
DiagnosticsMetrics
class with simplified API (recordLatency
,incrementCounter
,setGauge
)Migration status:
@metric
directive)Both legacy and new systems run in parallel - no breaking changes for external apps.
What problem is this solving?
Current limitations:
MetricsAccumulator
uses in-memory aggregation with snapshot-and-clear semanticsconsole.log
requiring log parsingSolution:
http_client_requests_total
withstatus
attribute)io_app_operation_duration_milliseconds
) for all latenciescomponent
attribute (http-client
,http-handler
,graphql
)How should this be manually tested?
Note: This cannot be fully tested locally as it requires the VTEX IO runtime in cloud environment.
Testing approach:
io_app_operation_duration_milliseconds
http_client_requests_total
,http_handler_requests_total
,graphql_field_requests_total
http_agent_sockets_current
,http_server_requests_total
component
,status
,status_code
,route_id
,client_metric
,cache_state
render-server
) to ensure no breaking changesglobal.diagnosticsMetrics
unavailableValidation checklist:
MetricsAccumulator
still functionsScreenshots or example usage
New API usage:
Metric structure:
Types of changes