Support for ExponentialHistogram in OTLP Registry #3959

lenin-jaganathan · 2023-07-06T10:03:23Z

Resolves #3861.

Adds support for https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#exponentialhistogram. Math used for index calculation is re-used from the OTEL specification which lays down the formula/techniques to be considered for index calculation also keeping performance in mind.

Did some benchmarking between the explicit bucket histograms used earlier and the exponential histogram and below are the results,

GC allocation,

Throughput in ns,

Known Issues,

~~For delta flavor, the support for last-minute data is not yet added.~~
While recording the exact powers of 2 on a positive scale, the exact computation is not used. It seems it will cause performance concerns with extra computation for every recording cycle.
DistributionStatisticConfig#getMinimumExpectedValueAsDouble is used as the zero threshold, but this results in values equal to the minimum expected value being added to the zeroCount instead of a positive histogram bucket.

...istry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java

lenin-jaganathan · 2023-07-28T17:34:28Z

@jonatan-ivanov / @shakuzen Is this something that can be looked for next milestone release?

lenin-jaganathan · 2023-09-01T06:02:55Z

@shakuzen / @jonatan-ivanov Circling back to see if this can be looked upon?

sfc-gh-rscott · 2024-07-01T16:07:49Z

@lenin-jaganathan @jonatan-ivanov @shakuzen is it possible to dust this off and get it merged?

shakuzen · 2024-07-18T08:52:01Z

I'll be looking into this and hopefully working with @lenin-jaganathan if he has time this and next week.

sfc-gh-rscott · 2024-07-23T17:43:40Z

@shakuzen @lenin-jaganathan It looks like the code here is ready to merge. Thanks for that. Would it be possible to document the code changes required to use exponential histograms from a consumer's perspective?

shakuzen · 2024-07-24T09:54:15Z

Would it be possible to document the code changes required to use exponential histograms from a consumer's perspective?

Documentation will come when we've finished reviewing and determine what form the feature will take. This pull request is focused on the OTLP registry but the Prometheus registry (with the latest Prometheus client) can also support exponential histograms (Prometheus calls it native histograms). So one thing I am also considering is if it makes sense to do something more generically than the OTLP registry at this time or not. That will likely influence what this feature looks like from a consumer's perspective.

sfc-gh-dguy · 2024-08-05T19:52:10Z

Hi
Thanks for working on this PR to support native histograms. This is highly anticipated by my team as we have many users waiting to for it.
Could you please provide an update on the status of this PR and if there is an estimated timeline for when it might be merged?

shakuzen · 2024-08-08T02:33:26Z

@sfc-gh-rscott The hopeful timeline was to get something merged in time for the upcoming milestone on Monday. I'm becoming less confident that will happen as exponential histograms seem to be less well supported in the ecosystem than I anticipated. Perhaps you and others can help me understand some things better in this regard.

What does the pipeline look like from app to which backend for exponential histograms for you?

It's a struggle to even figure out which backends support exponential histograms, and in general, I don't like shipping a feature that won't work for most backends in a registry that is supposed to produce a common format for backends. Other registries are typically for a specific backend and so we know upfront what features are supported. With OTLP, this isn't the case and it makes things awkward to support a type of metric that may end up getting dropped. It's odd to me to have a metric type defined in OTLP that is marked stable but that most backends seemingly don't support (but again, I can't find a list anywhere describing support for exponential histograms).

My preference for this feature, if all backends supported it, would have been to transparently enable exponential histograms so users don't need to choose. It's become clear it's far too premature to do that, so then we need to figure out the best configuration model for this.

On configuration model, the pull request currently has configuration to enable exponential histograms at the OtlpConfig level, which is registry-wide. I am thinking it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

For your use case, would you want to use exponential histograms for everything, or would you want to customize per meter?

This is highly anticipated by my team as we have many users waiting to for it.

Could you share the reasons why that is? It would help me understand what aspects are important and why.

Do you not use SLOs at all? Are you aware exponential histograms in their current state do not support custom buckets for the purpose of tracking SLOs?

lenin-jaganathan · 2024-08-08T05:44:37Z

I think it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

The current state of this pull request would do something like this,

if the exponential histogram is preferred and no custom SLO is defined, use it.
if SLOs are defined by users fall back to explicit bucket histogram.

Could you share the reasons why that is? It would help me understand what aspects are important and why.

I would let @sfc-gh-rscott answer this. However, the primary use case for our scenario is to accommodate a wide range of latencies with almost zero tuning required from customers. This is especially true for infra/framework teams responsible for serving a large set of audiences with differing needs. We have use-cases where p99 for microservices vary from 10ms, 100ms, 500ms, 1s, 10s etc. With an exponential histogram, it adapts and uses the memory efficiently to cover these ranges.

shakuzen · 2024-08-08T14:30:42Z

The current state of this pull request would do something like this,

if the exponential histogram is preferred and no custom SLO is defined, use it.

if SLOs are defined by users fall back to explicit bucket histogram.

I think that's reasonable if the only reason to not use exponential histograms are because

Your backend (or something in the pipeline from your app to backend) doesn't support exponential histograms (and therefore you want to disable them at the registry level)
You want to use SLOs with a distribution.

I'm hoping to get feedback on if that's the case. If there are other reasons a user would want to configure the type of histogram at the meter level, I think we'd want to have something in DistributionStatisticsConfig. There's also the question of would people want to configure maxScale and maxBucketCount at the meter level or not?

lenin-jaganathan · 2024-08-08T17:19:38Z

configure the type of histogram at the meter level

We can still achieve this by adding SLOs to histograms which would toggle between exponential/explicit histograms.

There's also the question of would people want to configure maxScale and maxBucketCount at the meter level or not?

I have tried to address that as part of the latest commit. Generally, maxScale is something configured at a registry level. It is always preferred to use the highest possible scale supported by the back-end (Prometheus supports only up to 8 I guess anything above that is automatically downscaled to 8.) because the histogram would auto-scale based on the the number of buckets. So, maxBuckets is what controls the actual scale of the histogram.

Also, If we want something to be exposed in DistributionStatisticsConfig, we would rather expose a generic thing that can be applied across registries for which maxBucketCount made sense. We would keep the backward compatibility and registries can override this or users can do that at the per-meter level. This way users can also do per-meter customizations for exponential histograms.

Example: A client request would be in a few 100's of milliseconds hence we can have ~160 buckets to cover latencies from 1ms to 1minute with greater than 95% accuracy (in real world ~98%). Whereas a connection request might be a few 10's of milliseconds and we could have ~80 buckets to get the same level of accuracy.

shakuzen · 2024-08-09T02:45:42Z

We can still achieve this by adding SLOs to histograms which would toggle between exponential/explicit histograms.

Yes but that feels hacky and unintuitive when you don't want an SLO.

As for maxBucketCount on DistributionStatisticsConfig, I don't think it makes sense to apply it to the fixed bucket histogram. I think it will be better to call it maxExponentialBucketCount or similar and only apply it to exponential histograms.

...icrometer-registry-otlp/src/test/java/io/micrometer/registry/otlp/OtlpMeterRegistryTest.java

shakuzen · 2024-07-24T11:19:00Z

...icrometer-registry-otlp/src/test/java/io/micrometer/registry/otlp/OtlpMeterRegistryTest.java

+            .getDataPoints(0);
+        assertThat(dataPoint.getZeroCount()).isEqualTo(2);
+        assertThat(dataPoint.getCount()).isEqualTo(2);
+        assertThat(dataPoint.getPositive().getBucketCountsCount()).isZero();


minimumExpectedValue is supposed to be inclusive with respect to buckets included in the histogram, but these test assertions show it's being treated as exclusive with regards to what is included in the bucket counts. I fear we have a bit of a mismatch here trying to use minimumExpectedValue for zeroThreshold because OTLP expects zeroThreshold to be inclusive of what ends up in the zeroCount.

// ZeroThreshold may be optionally set to convey the width of the zero // region. Where the zero region is defined as the closed interval // [-ZeroThreshold, ZeroThreshold].

Yeah, that is something I was worried too. One thing we can attempt is before passing the minimumExpectedValue to zeroThreshold, we could subtract a fraction from minimumExpectedValue to make it inclusive in the distribution and zeroThreshold is strictly less than minimumExpectedValue. For Time based measurement, we could reduce it by 1ns, for summaries I am not sure probably reduce it by smallest possible fraction?

I have a new change with converting the exclusive minValue to inclusive zero threshold.

I am doing Math.nextDown(minExpectedValue) and using it for zeroThreshold. WDYT?

...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java

pirgeo · 2024-08-09T06:26:19Z

I think it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

FWIW: Configuring the OTel SDK allows setting Views, which enable switching the histogram aggregation depending on a meter selection criteria, like meter name, for example. See https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view. That means you have a default aggregation that is used for all histograms, unless you explicitly override the aggregation for one specific histogram. It is also possible to set the default histogram aggregation for all histograms (e.g. by default, record all histograms as explicit bucket OR by default, record all histograms as exponential histograms) and then only override the ones that you want in the other aggregation.

...ry-otlp/src/main/java/io/micrometer/registry/otlp/internal/ExponentialHistogramSnapShot.java

shakuzen · 2024-08-09T08:07:13Z

I still have my concerns about this feature being ready for GA, but in the interest of gathering as much feedback as early as possible, I think I would like to try to merge this in time for the 1.14.0-M2 milestone release on Monday. @lenin-jaganathan will you be able to address the review today? If not, @jonatan-ivanov or I can take care of it when merging, if @jonatan-ivanov doesn't have any other concerns to be address prior to merging. I think the main thing to take care of before merging is my comment about maxBucketCount on DistributionStatisticsConfig and consistently using "flavor" instead of "flavour". Other things we can probably consider post merge and even after the milestone if need be.

lenin-jaganathan · 2024-08-09T10:12:53Z

With the latest commit,

removed the changes for maxBucketCount (we will probably plan this post merging)
rename flavour to flavor and ExponentialBucket to ExponentialBuckets

sfc-gh-rscott · 2024-08-13T19:42:35Z

@shakuzen

For your use case, would you want to use exponential histograms for everything, or would you want to customize per meter?

We have an existing codebase with other distribution summaries and timings already implemented. We would like to customize this per-meter.

Do you not use SLOs at all? Are you aware exponential histograms in their current state do not support custom buckets for the purpose of tracking SLOs?

We don't use the SLO feature of micrometer currently. Not having them for this use case is fine.

As far as backends that support exponential histograms, Prometheus and some other Prometheus-compatible backends like Grafana's Mimir now have support. They call them "native histograms" but they're the same thing.

shakuzen · 2024-09-02T08:25:07Z

@sfc-gh-rscott I opened #5459 to track supporting config at the meter level.

sonatype-lift bot reviewed Jul 6, 2023

View reviewed changes

...istry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java Show resolved Hide resolved

sonatype-lift bot reviewed Jul 6, 2023

View reviewed changes

...istry-otlp/src/main/java/io/micrometer/registry/otlp/internal/Base2ExponentialHistogram.java Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

lenin-jaganathan force-pushed the exponential_histogram branch 3 times, most recently from a79e04d to aa58dde Compare July 10, 2023 13:34

lenin-jaganathan force-pushed the exponential_histogram branch from aa58dde to c479360 Compare July 19, 2023 13:57

lenin-jaganathan added 2 commits July 19, 2024 01:21

Support for ExponentialHistogram in OTLP Registry

d01a5a1

Add closing rollover behaviour for Step Histogram flavours

c61f1fe

lenin-jaganathan force-pushed the exponential_histogram branch from b564dd1 to 0ba532c Compare July 19, 2024 10:53

Fix merge conflicts and update since field

718f3c2

lenin-jaganathan force-pushed the exponential_histogram branch from 0ba532c to 718f3c2 Compare July 19, 2024 11:02

Support bucket configuration for Histograms

56238b2

shakuzen reviewed Aug 9, 2024

View reviewed changes

...entations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpConfig.java Outdated Show resolved Hide resolved

shakuzen reviewed Aug 9, 2024

View reviewed changes

...ry-otlp/src/main/java/io/micrometer/registry/otlp/internal/ExponentialHistogramSnapShot.java Outdated Show resolved Hide resolved

Address PR comments

0108fda

shakuzen approved these changes Aug 13, 2024

View reviewed changes

shakuzen merged commit bb2ff45 into micrometer-metrics:main Aug 13, 2024
6 checks passed

snicoll mentioned this pull request Aug 13, 2024

Add configuration support for ExponentialHistogram in OTLP Registry spring-projects/spring-boot#41837

Closed

lenin-jaganathan mentioned this pull request Aug 16, 2024

Exponential Histogram should include minimumExpectedValue in positive buckets #5393

Closed

shakuzen mentioned this pull request Sep 2, 2024

Support configuring exponential histograms at the meter level #5459

Open

lenin-jaganathan deleted the exponential_histogram branch September 23, 2024 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for ExponentialHistogram in OTLP Registry #3959

Support for ExponentialHistogram in OTLP Registry #3959

lenin-jaganathan commented Jul 6, 2023 •

edited by shakuzen

Loading

This comment was marked as outdated.

lenin-jaganathan commented Jul 28, 2023

lenin-jaganathan commented Sep 1, 2023

sfc-gh-rscott commented Jul 1, 2024

shakuzen commented Jul 18, 2024

sfc-gh-rscott commented Jul 23, 2024

shakuzen commented Jul 24, 2024

sfc-gh-dguy commented Aug 5, 2024

shakuzen commented Aug 8, 2024

lenin-jaganathan commented Aug 8, 2024

shakuzen commented Aug 8, 2024

lenin-jaganathan commented Aug 8, 2024

shakuzen commented Aug 9, 2024

shakuzen Jul 24, 2024

lenin-jaganathan Aug 9, 2024

lenin-jaganathan Aug 9, 2024

pirgeo commented Aug 9, 2024

shakuzen commented Aug 9, 2024

lenin-jaganathan commented Aug 9, 2024

sfc-gh-rscott commented Aug 13, 2024

shakuzen commented Sep 2, 2024

Support for ExponentialHistogram in OTLP Registry #3959

Support for ExponentialHistogram in OTLP Registry #3959

Conversation

lenin-jaganathan commented Jul 6, 2023 • edited by shakuzen Loading

This comment was marked as outdated.

lenin-jaganathan commented Jul 28, 2023

lenin-jaganathan commented Sep 1, 2023

sfc-gh-rscott commented Jul 1, 2024

shakuzen commented Jul 18, 2024

sfc-gh-rscott commented Jul 23, 2024

shakuzen commented Jul 24, 2024

sfc-gh-dguy commented Aug 5, 2024

shakuzen commented Aug 8, 2024

lenin-jaganathan commented Aug 8, 2024

shakuzen commented Aug 8, 2024

lenin-jaganathan commented Aug 8, 2024

shakuzen commented Aug 9, 2024

shakuzen Jul 24, 2024

Choose a reason for hiding this comment

lenin-jaganathan Aug 9, 2024

Choose a reason for hiding this comment

lenin-jaganathan Aug 9, 2024

Choose a reason for hiding this comment

pirgeo commented Aug 9, 2024

shakuzen commented Aug 9, 2024

lenin-jaganathan commented Aug 9, 2024

sfc-gh-rscott commented Aug 13, 2024

shakuzen commented Sep 2, 2024

lenin-jaganathan commented Jul 6, 2023 •

edited by shakuzen

Loading