Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ExponentialHistogram in OTLP Registry #3959

Merged

Conversation

lenin-jaganathan
Copy link
Contributor

@lenin-jaganathan lenin-jaganathan commented Jul 6, 2023

Resolves #3861.

Adds support for https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#exponentialhistogram. Math used for index calculation is re-used from the OTEL specification which lays down the formula/techniques to be considered for index calculation also keeping performance in mind.

Did some benchmarking between the explicit bucket histograms used earlier and the exponential histogram and below are the results,

GC allocation,
Screenshot 2023-07-06 at 2 26 33 PM
Throughput in ns,
Screenshot 2023-07-06 at 2 26 52 PM

Known Issues,

  • For delta flavor, the support for last-minute data is not yet added.
  • While recording the exact powers of 2 on a positive scale, the exact computation is not used. It seems it will cause performance concerns with extra computation for every recording cycle.
  • DistributionStatisticConfig#getMinimumExpectedValueAsDouble is used as the zero threshold, but this results in values equal to the minimum expected value being added to the zeroCount instead of a positive histogram bucket.

@sonatype-lift

This comment was marked as outdated.

@lenin-jaganathan lenin-jaganathan force-pushed the exponential_histogram branch 3 times, most recently from a79e04d to aa58dde Compare July 10, 2023 13:34
@lenin-jaganathan
Copy link
Contributor Author

@jonatan-ivanov / @shakuzen Is this something that can be looked for next milestone release?

@lenin-jaganathan
Copy link
Contributor Author

@shakuzen / @jonatan-ivanov Circling back to see if this can be looked upon?

@sfc-gh-rscott
Copy link

@lenin-jaganathan @jonatan-ivanov @shakuzen is it possible to dust this off and get it merged?

@shakuzen
Copy link
Member

I'll be looking into this and hopefully working with @lenin-jaganathan if he has time this and next week.

@sfc-gh-rscott
Copy link

@shakuzen @lenin-jaganathan It looks like the code here is ready to merge. Thanks for that. Would it be possible to document the code changes required to use exponential histograms from a consumer's perspective?

@shakuzen
Copy link
Member

Would it be possible to document the code changes required to use exponential histograms from a consumer's perspective?

Documentation will come when we've finished reviewing and determine what form the feature will take. This pull request is focused on the OTLP registry but the Prometheus registry (with the latest Prometheus client) can also support exponential histograms (Prometheus calls it native histograms). So one thing I am also considering is if it makes sense to do something more generically than the OTLP registry at this time or not. That will likely influence what this feature looks like from a consumer's perspective.

@sfc-gh-dguy
Copy link

Hi
Thanks for working on this PR to support native histograms. This is highly anticipated by my team as we have many users waiting to for it.
Could you please provide an update on the status of this PR and if there is an estimated timeline for when it might be merged?

@shakuzen
Copy link
Member

shakuzen commented Aug 8, 2024

@sfc-gh-rscott The hopeful timeline was to get something merged in time for the upcoming milestone on Monday. I'm becoming less confident that will happen as exponential histograms seem to be less well supported in the ecosystem than I anticipated. Perhaps you and others can help me understand some things better in this regard.

What does the pipeline look like from app to which backend for exponential histograms for you?

It's a struggle to even figure out which backends support exponential histograms, and in general, I don't like shipping a feature that won't work for most backends in a registry that is supposed to produce a common format for backends. Other registries are typically for a specific backend and so we know upfront what features are supported. With OTLP, this isn't the case and it makes things awkward to support a type of metric that may end up getting dropped. It's odd to me to have a metric type defined in OTLP that is marked stable but that most backends seemingly don't support (but again, I can't find a list anywhere describing support for exponential histograms).

My preference for this feature, if all backends supported it, would have been to transparently enable exponential histograms so users don't need to choose. It's become clear it's far too premature to do that, so then we need to figure out the best configuration model for this.

On configuration model, the pull request currently has configuration to enable exponential histograms at the OtlpConfig level, which is registry-wide. I am thinking it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

For your use case, would you want to use exponential histograms for everything, or would you want to customize per meter?

This is highly anticipated by my team as we have many users waiting to for it.

Could you share the reasons why that is? It would help me understand what aspects are important and why.

Do you not use SLOs at all? Are you aware exponential histograms in their current state do not support custom buckets for the purpose of tracking SLOs?

@lenin-jaganathan
Copy link
Contributor Author

I think it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

The current state of this pull request would do something like this,

  • if the exponential histogram is preferred and no custom SLO is defined, use it.
  • if SLOs are defined by users fall back to explicit bucket histogram.

Could you share the reasons why that is? It would help me understand what aspects are important and why.

I would let @sfc-gh-rscott answer this. However, the primary use case for our scenario is to accommodate a wide range of latencies with almost zero tuning required from customers. This is especially true for infra/framework teams responsible for serving a large set of audiences with differing needs. We have use-cases where p99 for microservices vary from 10ms, 100ms, 500ms, 1s, 10s etc. With an exponential histogram, it adapts and uses the memory efficiently to cover these ranges.

@shakuzen
Copy link
Member

shakuzen commented Aug 8, 2024

The current state of this pull request would do something like this,

  • if the exponential histogram is preferred and no custom SLO is defined, use it.
  • if SLOs are defined by users fall back to explicit bucket histogram.

I think that's reasonable if the only reason to not use exponential histograms are because

  • Your backend (or something in the pipeline from your app to backend) doesn't support exponential histograms (and therefore you want to disable them at the registry level)
  • You want to use SLOs with a distribution.

I'm hoping to get feedback on if that's the case. If there are other reasons a user would want to configure the type of histogram at the meter level, I think we'd want to have something in DistributionStatisticsConfig. There's also the question of would people want to configure maxScale and maxBucketCount at the meter level or not?

@lenin-jaganathan
Copy link
Contributor Author

configure the type of histogram at the meter level

We can still achieve this by adding SLOs to histograms which would toggle between exponential/explicit histograms.

There's also the question of would people want to configure maxScale and maxBucketCount at the meter level or not?

I have tried to address that as part of the latest commit. Generally, maxScale is something configured at a registry level. It is always preferred to use the highest possible scale supported by the back-end (Prometheus supports only up to 8 I guess anything above that is automatically downscaled to 8.) because the histogram would auto-scale based on the the number of buckets. So, maxBuckets is what controls the actual scale of the histogram.

Also, If we want something to be exposed in DistributionStatisticsConfig, we would rather expose a generic thing that can be applied across registries for which maxBucketCount made sense. We would keep the backward compatibility and registries can override this or users can do that at the per-meter level. This way users can also do per-meter customizations for exponential histograms.

Example: A client request would be in a few 100's of milliseconds hence we can have ~160 buckets to cover latencies from 1ms to 1minute with greater than 95% accuracy (in real world ~98%). Whereas a connection request might be a few 10's of milliseconds and we could have ~80 buckets to get the same level of accuracy.

@shakuzen
Copy link
Member

shakuzen commented Aug 9, 2024

We can still achieve this by adding SLOs to histograms which would toggle between exponential/explicit histograms.

Yes but that feels hacky and unintuitive when you don't want an SLO.

As for maxBucketCount on DistributionStatisticsConfig, I don't think it makes sense to apply it to the fixed bucket histogram. I think it will be better to call it maxExponentialBucketCount or similar and only apply it to exponential histograms.

.getDataPoints(0);
assertThat(dataPoint.getZeroCount()).isEqualTo(2);
assertThat(dataPoint.getCount()).isEqualTo(2);
assertThat(dataPoint.getPositive().getBucketCountsCount()).isZero();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimumExpectedValue is supposed to be inclusive with respect to buckets included in the histogram, but these test assertions show it's being treated as exclusive with regards to what is included in the bucket counts. I fear we have a bit of a mismatch here trying to use minimumExpectedValue for zeroThreshold because OTLP expects zeroThreshold to be inclusive of what ends up in the zeroCount.

  // ZeroThreshold may be optionally set to convey the width of the zero
  // region. Where the zero region is defined as the closed interval
  // [-ZeroThreshold, ZeroThreshold].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is something I was worried too. One thing we can attempt is before passing the minimumExpectedValue to zeroThreshold, we could subtract a fraction from minimumExpectedValue to make it inclusive in the distribution and zeroThreshold is strictly less than minimumExpectedValue. For Time based measurement, we could reduce it by 1ns, for summaries I am not sure probably reduce it by smallest possible fraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a new change with converting the exclusive minValue to inclusive zero threshold.

I am doing Math.nextDown(minExpectedValue) and using it for zeroThreshold. WDYT?

@pirgeo
Copy link
Contributor

pirgeo commented Aug 9, 2024

I think it may be better to have this at the DistributionStatisticsConfig level which can be customized per meter.

FWIW: Configuring the OTel SDK allows setting Views, which enable switching the histogram aggregation depending on a meter selection criteria, like meter name, for example. See https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view. That means you have a default aggregation that is used for all histograms, unless you explicitly override the aggregation for one specific histogram. It is also possible to set the default histogram aggregation for all histograms (e.g. by default, record all histograms as explicit bucket OR by default, record all histograms as exponential histograms) and then only override the ones that you want in the other aggregation.

@shakuzen
Copy link
Member

shakuzen commented Aug 9, 2024

I still have my concerns about this feature being ready for GA, but in the interest of gathering as much feedback as early as possible, I think I would like to try to merge this in time for the 1.14.0-M2 milestone release on Monday. @lenin-jaganathan will you be able to address the review today? If not, @jonatan-ivanov or I can take care of it when merging, if @jonatan-ivanov doesn't have any other concerns to be address prior to merging. I think the main thing to take care of before merging is my comment about maxBucketCount on DistributionStatisticsConfig and consistently using "flavor" instead of "flavour". Other things we can probably consider post merge and even after the milestone if need be.

@lenin-jaganathan
Copy link
Contributor Author

With the latest commit,

  • removed the changes for maxBucketCount (we will probably plan this post merging)
  • rename flavour to flavor and ExponentialBucket to ExponentialBuckets

@shakuzen shakuzen merged commit bb2ff45 into micrometer-metrics:main Aug 13, 2024
6 checks passed
@sfc-gh-rscott
Copy link

@shakuzen

For your use case, would you want to use exponential histograms for everything, or would you want to customize per meter?

We have an existing codebase with other distribution summaries and timings already implemented. We would like to customize this per-meter.

Do you not use SLOs at all? Are you aware exponential histograms in their current state do not support custom buckets for the purpose of tracking SLOs?

We don't use the SLO feature of micrometer currently. Not having them for this use case is fine.

As far as backends that support exponential histograms, Prometheus and some other Prometheus-compatible backends like Grafana's Mimir now have support. They call them "native histograms" but they're the same thing.

@shakuzen
Copy link
Member

shakuzen commented Sep 2, 2024

@sfc-gh-rscott I opened #5459 to track supporting config at the meter level.

@lenin-jaganathan lenin-jaganathan deleted the exponential_histogram branch September 23, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ExponentialHistogram in OTLP
5 participants