-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ExponentialHistogram in OTLP Registry #3959
Support for ExponentialHistogram in OTLP Registry #3959
Conversation
...istry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java
Show resolved
Hide resolved
...istry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
a79e04d
to
aa58dde
Compare
aa58dde
to
c479360
Compare
@jonatan-ivanov / @shakuzen Is this something that can be looked for next milestone release? |
@shakuzen / @jonatan-ivanov Circling back to see if this can be looked upon? |
@lenin-jaganathan @jonatan-ivanov @shakuzen is it possible to dust this off and get it merged? |
I'll be looking into this and hopefully working with @lenin-jaganathan if he has time this and next week. |
b564dd1
to
0ba532c
Compare
0ba532c
to
718f3c2
Compare
@shakuzen @lenin-jaganathan It looks like the code here is ready to merge. Thanks for that. Would it be possible to document the code changes required to use exponential histograms from a consumer's perspective? |
Documentation will come when we've finished reviewing and determine what form the feature will take. This pull request is focused on the OTLP registry but the Prometheus registry (with the latest Prometheus client) can also support exponential histograms (Prometheus calls it native histograms). So one thing I am also considering is if it makes sense to do something more generically than the OTLP registry at this time or not. That will likely influence what this feature looks like from a consumer's perspective. |
Hi |
@sfc-gh-rscott The hopeful timeline was to get something merged in time for the upcoming milestone on Monday. I'm becoming less confident that will happen as exponential histograms seem to be less well supported in the ecosystem than I anticipated. Perhaps you and others can help me understand some things better in this regard. What does the pipeline look like from app to which backend for exponential histograms for you? It's a struggle to even figure out which backends support exponential histograms, and in general, I don't like shipping a feature that won't work for most backends in a registry that is supposed to produce a common format for backends. Other registries are typically for a specific backend and so we know upfront what features are supported. With OTLP, this isn't the case and it makes things awkward to support a type of metric that may end up getting dropped. It's odd to me to have a metric type defined in OTLP that is marked stable but that most backends seemingly don't support (but again, I can't find a list anywhere describing support for exponential histograms). My preference for this feature, if all backends supported it, would have been to transparently enable exponential histograms so users don't need to choose. It's become clear it's far too premature to do that, so then we need to figure out the best configuration model for this. On configuration model, the pull request currently has configuration to enable exponential histograms at the OtlpConfig level, which is registry-wide. I am thinking it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter. For your use case, would you want to use exponential histograms for everything, or would you want to customize per meter?
Could you share the reasons why that is? It would help me understand what aspects are important and why. Do you not use SLOs at all? Are you aware exponential histograms in their current state do not support custom buckets for the purpose of tracking SLOs? |
The current state of this pull request would do something like this,
I would let @sfc-gh-rscott answer this. However, the primary use case for our scenario is to accommodate a wide range of latencies with almost zero tuning required from customers. This is especially true for infra/framework teams responsible for serving a large set of audiences with differing needs. We have use-cases where p99 for microservices vary from 10ms, 100ms, 500ms, 1s, 10s etc. With an exponential histogram, it adapts and uses the memory efficiently to cover these ranges. |
I think that's reasonable if the only reason to not use exponential histograms are because
I'm hoping to get feedback on if that's the case. If there are other reasons a user would want to configure the type of histogram at the meter level, I think we'd want to have something in DistributionStatisticsConfig. There's also the question of would people want to configure |
We can still achieve this by adding SLOs to histograms which would toggle between exponential/explicit histograms.
I have tried to address that as part of the latest commit. Generally, maxScale is something configured at a registry level. It is always preferred to use the highest possible scale supported by the back-end (Prometheus supports only up to 8 I guess anything above that is automatically downscaled to 8.) because the histogram would auto-scale based on the the number of buckets. So, Also, If we want something to be exposed in Example: A client request would be in a few 100's of milliseconds hence we can have ~160 buckets to cover latencies from 1ms to 1minute with greater than 95% accuracy (in real world ~98%). Whereas a connection request might be a few 10's of milliseconds and we could have ~80 buckets to get the same level of accuracy. |
Yes but that feels hacky and unintuitive when you don't want an SLO. As for |
...icrometer-registry-otlp/src/test/java/io/micrometer/registry/otlp/OtlpMeterRegistryTest.java
Show resolved
Hide resolved
.getDataPoints(0); | ||
assertThat(dataPoint.getZeroCount()).isEqualTo(2); | ||
assertThat(dataPoint.getCount()).isEqualTo(2); | ||
assertThat(dataPoint.getPositive().getBucketCountsCount()).isZero(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimumExpectedValue
is supposed to be inclusive with respect to buckets included in the histogram, but these test assertions show it's being treated as exclusive with regards to what is included in the bucket counts. I fear we have a bit of a mismatch here trying to use minimumExpectedValue
for zeroThreshold
because OTLP expects zeroThreshold
to be inclusive of what ends up in the zeroCount
.
// ZeroThreshold may be optionally set to convey the width of the zero
// region. Where the zero region is defined as the closed interval
// [-ZeroThreshold, ZeroThreshold].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is something I was worried too. One thing we can attempt is before passing the minimumExpectedValue
to zeroThreshold
, we could subtract a fraction from minimumExpectedValue
to make it inclusive in the distribution and zeroThreshold
is strictly less than minimumExpectedValue
. For Time based measurement, we could reduce it by 1ns
, for summaries I am not sure probably reduce it by smallest possible fraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a new change with converting the exclusive minValue to inclusive zero threshold.
I am doing Math.nextDown(minExpectedValue)
and using it for zeroThreshold. WDYT?
...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java
Outdated
Show resolved
Hide resolved
FWIW: Configuring the OTel SDK allows setting |
...ry-otlp/src/main/java/io/micrometer/registry/otlp/internal/ExponentialHistogramSnapShot.java
Outdated
Show resolved
Hide resolved
I still have my concerns about this feature being ready for GA, but in the interest of gathering as much feedback as early as possible, I think I would like to try to merge this in time for the 1.14.0-M2 milestone release on Monday. @lenin-jaganathan will you be able to address the review today? If not, @jonatan-ivanov or I can take care of it when merging, if @jonatan-ivanov doesn't have any other concerns to be address prior to merging. I think the main thing to take care of before merging is my comment about |
With the latest commit,
|
We have an existing codebase with other distribution summaries and timings already implemented. We would like to customize this per-meter.
We don't use the SLO feature of micrometer currently. Not having them for this use case is fine. As far as backends that support exponential histograms, Prometheus and some other Prometheus-compatible backends like Grafana's Mimir now have support. They call them "native histograms" but they're the same thing. |
@sfc-gh-rscott I opened #5459 to track supporting config at the meter level. |
Resolves #3861.
Adds support for https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#exponentialhistogram. Math used for index calculation is re-used from the OTEL specification which lays down the formula/techniques to be considered for index calculation also keeping performance in mind.
Did some benchmarking between the explicit bucket histograms used earlier and the exponential histogram and below are the results,
GC allocation,
Throughput in ns,
Known Issues,
For delta flavor, the support for last-minute data is not yet added.DistributionStatisticConfig#getMinimumExpectedValueAsDouble
is used as the zero threshold, but this results in values equal to the minimum expected value being added to the zeroCount instead of a positive histogram bucket.