Fix Stackdriver Distribution count and bucket count sum mismatch #5836

shakuzen · 2025-01-29T09:13:50Z

The count passed to the HistogramSnapshot is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver.

This solves this by always tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.

I still need to do some more testing with empty (no-op) histograms and percentiles configured, but I wanted to open the PR in case anyone wants to do some review on the approach in the meantime.

Closes gh-4868

shakuzen

Leaving some notes for anyone reviewing and my future self before I forget.

shakuzen · 2025-01-29T09:16:05Z

...r-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverMeterRegistry.java

@@ -508,18 +505,17 @@ private TimeInterval interval(MetricDescriptor.MetricKind metricKind) {
        Distribution distribution(HistogramSnapshot snapshot, boolean timeDomain) {
            CountAtBucket[] histogram = snapshot.histogramCounts();

-            // selected finite buckets (represented as a normal histogram)


We no longer need to convert cumulative buckets to non-cumulative because with the change to dedicated Stackdriver types for Timer and DistributionSummary, we pass the flag for non-cumulative histogram.

shakuzen · 2025-01-29T09:17:14Z

...r-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverMeterRegistry.java

                .setMean(timeDomain ? snapshot.mean(getBaseTimeUnit()) : snapshot.mean())
-                .setCount(snapshot.count())
+                .setCount(cumulativeCount)


This is the part that fixes things - snapshot.count() is the step-based count, which does not align with the histogram time window, while the cumulativeCount is taken from the histogram.

shakuzen · 2025-01-29T09:18:39Z

...r-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverMeterRegistry.java

                    bucketCounts = bucketCounts.subList(0, lastNonZeroIndex + 1);
+                    // infinite bucket count of 0 can be omitted
+                    bucketCounts.add(infCount);


I left this note here but did not omit 0 count infinity buckets to try to keep the behavior the same as before since this is a change in a maintenance branch.

shakuzen · 2025-01-29T09:25:53Z

...re/src/main/java/io/micrometer/core/instrument/distribution/AbstractTimeWindowHistogram.java

-        }
-
-        final Set<Double> monitoredValues = distributionStatisticConfig
-            .getHistogramBuckets(supportsAggregablePercentiles);


This logic of getting the buckets again to include in the snapshot was problematic for the goal of always having an infinity bucket (in the case of Stackdriver), since this relies only on the logic in DistributionStatisticConfig and we do not know at this level (AbstractTimeWindowHistogram) if we want to include the infinity bucket or not. Hence this was removed to avoid trying to keep the buckets in the concrete histogram type (e.g. TimeWindowFixedBoundaryHistogram) to match the buckets returned here when Histogram#takeSnapshot is called. With the new package-private method countsAtBuckets we let the concrete implementation determine what buckets to return rather than from here passing the buckets. We could even get rid of the countsAtValues method, I think.

Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values) might be unnecessary while we are also having CountAtBucket[] countsAtBuckets(). I'm not sure though if there is a use-case of countsAtValues, where not all the bucket boundaries are passed.

It's package-private, so we can check all the usage. AFAIR, it's always distributionStatisticConfig.getHistogramBuckets being called, until this PR deviated to add the infinity bucket. We previously had a regression where the buckets didn't match - see #3922.

I've removed it now as I refactored things in TimeWindowPercentileHistogram

shakuzen · 2025-01-29T09:31:34Z

...c/main/java/io/micrometer/core/instrument/distribution/TimeWindowFixedBoundaryHistogram.java

+     * @see PercentileHistogramBuckets
+     */
+    protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
+            boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) {


This protected constructor is the only (semi-)public API added/changed as part of this pull request. I tried to keep API changes minimal since this is in a maintenance branch. I would have liked to not add any, but I did not come up with a way to achieve that without doing something like copying everything from TimeWindowFixedBoundaryHistogram to an implementation of Histogram in the stackdriver module. That would probably be more prone to code rot and bugs.

jonatan-ivanov · 2025-01-29T22:44:36Z

...r-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverHistogramUtil.java

+    // copied and modified from AbstractDistributionSummary/AbstractTimer
+    static Histogram stackdriverHistogram(Clock clock, DistributionStatisticConfig distributionStatisticConfig) {
+        if (distributionStatisticConfig.isPublishingPercentiles()) {
+            // hdr-based histogram


HdrHistogram-based histogram?

jonatan-ivanov · 2025-01-29T22:51:42Z

...r-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverMeterRegistry.java

-            // boundary
-            bucketCounts.add(Math.max(0, snapshot.count() - truncatedSum.get()));
+            // Stackdriver Distribution must have at least one bucket count
+            if (bucketCounts.isEmpty()) {


Do we need this check now that the histogram always contains the inf bucket?

This is for the no-op Histogram case, which still will have no buckets.

jonatan-ivanov · 2025-01-29T22:55:08Z

...gistry-stackdriver/src/test/java/io/micrometer/stackdriver/StackdriverMeterRegistryTest.java

-        // count is 4, but sum of bucket counts is 5 due to inconsistent snapshotting
-        HistogramSnapshot histogramSnapshot = new HistogramSnapshot(4, 14.7, 5, null,
-                new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null);
+        // count is 0 from previous step, but sum of bucket counts is 5


Should we check that as well that the last bucket count is 5?

We can but we check this in other tests and I was trying to keep this test case focused on the mentioned bug fix.

jonatan-ivanov · 2025-01-29T22:57:05Z

...gistry-stackdriver/src/test/java/io/micrometer/stackdriver/StackdriverMeterRegistryTest.java

-                new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null);
+        // count is 0 from previous step, but sum of bucket counts is 5
+        HistogramSnapshot histogramSnapshot = ds.takeSnapshot();
+        assertThat(histogramSnapshot.count()).isEqualTo(0);
        Distribution distribution = batch.distribution(histogramSnapshot, false);
        List<Long> bucketCountsList = distribution.getBucketCountsList();
        assertThat(bucketCountsList.get(bucketCountsList.size() - 1)).isNotNegative();


Should not we do an equals check for the value of the inf bucket instead?

We can now with these changes since the infinity bucket is properly tracked - before we could not and the previous fix was only to ensure we didn't report it as less than zero which would cause a validation error on the Stackdriver side. I'm not sure how much we want to change this existing test case or not. In a lot of ways it's probably not needed anymore with the assertions done in other test cases, but I thought it might be good to keep it as a focused unit test for a previously fixed bug.

jonatan-ivanov · 2025-01-29T23:09:12Z

...re/src/main/java/io/micrometer/core/instrument/distribution/AbstractTimeWindowHistogram.java

-        }
-
-        final Set<Double> monitoredValues = distributionStatisticConfig
-            .getHistogramBuckets(supportsAggregablePercentiles);


Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values) might be unnecessary while we are also having CountAtBucket[] countsAtBuckets(). I'm not sure though if there is a use-case of countsAtValues, where not all the bucket boundaries are passed.

jonatan-ivanov · 2025-01-29T23:11:17Z

...c/main/java/io/micrometer/core/instrument/distribution/TimeWindowFixedBoundaryHistogram.java

+     * @see PercentileHistogramBuckets
+     */
+    protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
+            boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) {


It's a bit more verbose but might describe the intent better: alwaysIncludeInfinityBucket instead of includeInfinityBucket?

jonatan-ivanov · 2025-01-29T23:13:10Z

...c/main/java/io/micrometer/core/instrument/distribution/TimeWindowFixedBoundaryHistogram.java

    public TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
            boolean supportsAggregablePercentiles) {
        this(clock, config, supportsAggregablePercentiles, true);
    }

    /**
-     * Create a {@code TimeWindowFixedBoundaryHistogram} instance.
+     * Create a {@code TimeWindowFixedBoundaryHistogram} instance with buckets based on


Should we mention here and in the docs of the previous ctor what are the defaults?
Here: inf bucket is not forced/might be missing.

I was trying to allude to that by saying the buckets are based on the DistributionStatisticConfig and supportsAggregablePercentiles. If those result in an inf bucket, it'll be there. If not, it won't.

shakuzen · 2025-01-30T02:36:46Z

...re/src/main/java/io/micrometer/core/instrument/distribution/AbstractTimeWindowHistogram.java

@@ -45,7 +44,7 @@ abstract class AbstractTimeWindowHistogram<T, U> implements Histogram {

    private final Clock clock;

-    private final boolean supportsAggregablePercentiles;
+    protected final boolean supportsAggregablePercentiles;


I needed to make this protected for TimeWindowPercentileHistogram to include the logic that was previously here in takeCountSnapshot()

The count passed to the `HistogramSnapshot` is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver. This solves this by tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.

shakuzen commented Jan 29, 2025

View reviewed changes

jonatan-ivanov reviewed Jan 29, 2025

View reviewed changes

shakuzen commented Jan 30, 2025

View reviewed changes

shakuzen force-pushed the stackdriver-histogram branch from b6b2b1b to 51d3186 Compare January 30, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Stackdriver Distribution count and bucket count sum mismatch #5836

Fix Stackdriver Distribution count and bucket count sum mismatch #5836

shakuzen commented Jan 29, 2025 •

edited

Loading

shakuzen left a comment

shakuzen Jan 29, 2025

shakuzen Jan 29, 2025

shakuzen Jan 29, 2025

shakuzen Jan 29, 2025

jonatan-ivanov Jan 29, 2025

shakuzen Jan 30, 2025

shakuzen Jan 30, 2025

shakuzen Jan 29, 2025

jonatan-ivanov Jan 29, 2025

jonatan-ivanov Jan 29, 2025

shakuzen Jan 30, 2025

jonatan-ivanov Jan 29, 2025

shakuzen Jan 30, 2025

jonatan-ivanov Jan 29, 2025

shakuzen Jan 30, 2025

jonatan-ivanov Jan 29, 2025

jonatan-ivanov Jan 29, 2025

jonatan-ivanov Jan 29, 2025

shakuzen Jan 30, 2025

shakuzen Jan 30, 2025

Fix Stackdriver Distribution count and bucket count sum mismatch #5836

Are you sure you want to change the base?

Fix Stackdriver Distribution count and bucket count sum mismatch #5836

Conversation

shakuzen commented Jan 29, 2025 • edited Loading

shakuzen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shakuzen commented Jan 29, 2025 •

edited

Loading