Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Stackdriver Distribution count and bucket count sum mismatch #5836

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shakuzen
Copy link
Member

@shakuzen shakuzen commented Jan 29, 2025

The count passed to the HistogramSnapshot is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver.

This solves this by always tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.

I still need to do some more testing with empty (no-op) histograms and percentiles configured, but I wanted to open the PR in case anyone wants to do some review on the approach in the meantime.

Closes gh-4868

Copy link
Member Author

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some notes for anyone reviewing and my future self before I forget.

@@ -508,18 +505,17 @@ private TimeInterval interval(MetricDescriptor.MetricKind metricKind) {
Distribution distribution(HistogramSnapshot snapshot, boolean timeDomain) {
CountAtBucket[] histogram = snapshot.histogramCounts();

// selected finite buckets (represented as a normal histogram)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to convert cumulative buckets to non-cumulative because with the change to dedicated Stackdriver types for Timer and DistributionSummary, we pass the flag for non-cumulative histogram.

.setMean(timeDomain ? snapshot.mean(getBaseTimeUnit()) : snapshot.mean())
.setCount(snapshot.count())
.setCount(cumulativeCount)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that fixes things - snapshot.count() is the step-based count, which does not align with the histogram time window, while the cumulativeCount is taken from the histogram.

bucketCounts = bucketCounts.subList(0, lastNonZeroIndex + 1);
// infinite bucket count of 0 can be omitted
bucketCounts.add(infCount);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this note here but did not omit 0 count infinity buckets to try to keep the behavior the same as before since this is a change in a maintenance branch.

}

final Set<Double> monitoredValues = distributionStatisticConfig
.getHistogramBuckets(supportsAggregablePercentiles);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic of getting the buckets again to include in the snapshot was problematic for the goal of always having an infinity bucket (in the case of Stackdriver), since this relies only on the logic in DistributionStatisticConfig and we do not know at this level (AbstractTimeWindowHistogram) if we want to include the infinity bucket or not. Hence this was removed to avoid trying to keep the buckets in the concrete histogram type (e.g. TimeWindowFixedBoundaryHistogram) to match the buckets returned here when Histogram#takeSnapshot is called. With the new package-private method countsAtBuckets we let the concrete implementation determine what buckets to return rather than from here passing the buckets. We could even get rid of the countsAtValues method, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values) might be unnecessary while we are also having CountAtBucket[] countsAtBuckets(). I'm not sure though if there is a use-case of countsAtValues, where not all the bucket boundaries are passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's package-private, so we can check all the usage. AFAIR, it's always distributionStatisticConfig.getHistogramBuckets being called, until this PR deviated to add the infinity bucket. We previously had a regression where the buckets didn't match - see #3922.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it now as I refactored things in TimeWindowPercentileHistogram

* @see PercentileHistogramBuckets
*/
protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This protected constructor is the only (semi-)public API added/changed as part of this pull request. I tried to keep API changes minimal since this is in a maintenance branch. I would have liked to not add any, but I did not come up with a way to achieve that without doing something like copying everything from TimeWindowFixedBoundaryHistogram to an implementation of Histogram in the stackdriver module. That would probably be more prone to code rot and bugs.

// copied and modified from AbstractDistributionSummary/AbstractTimer
static Histogram stackdriverHistogram(Clock clock, DistributionStatisticConfig distributionStatisticConfig) {
if (distributionStatisticConfig.isPublishingPercentiles()) {
// hdr-based histogram
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HdrHistogram-based histogram?

// boundary
bucketCounts.add(Math.max(0, snapshot.count() - truncatedSum.get()));
// Stackdriver Distribution must have at least one bucket count
if (bucketCounts.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check now that the histogram always contains the inf bucket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the no-op Histogram case, which still will have no buckets.

// count is 4, but sum of bucket counts is 5 due to inconsistent snapshotting
HistogramSnapshot histogramSnapshot = new HistogramSnapshot(4, 14.7, 5, null,
new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null);
// count is 0 from previous step, but sum of bucket counts is 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that as well that the last bucket count is 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but we check this in other tests and I was trying to keep this test case focused on the mentioned bug fix.

new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null);
// count is 0 from previous step, but sum of bucket counts is 5
HistogramSnapshot histogramSnapshot = ds.takeSnapshot();
assertThat(histogramSnapshot.count()).isEqualTo(0);
Distribution distribution = batch.distribution(histogramSnapshot, false);
List<Long> bucketCountsList = distribution.getBucketCountsList();
assertThat(bucketCountsList.get(bucketCountsList.size() - 1)).isNotNegative();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not we do an equals check for the value of the inf bucket instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now with these changes since the infinity bucket is properly tracked - before we could not and the previous fix was only to ensure we didn't report it as less than zero which would cause a validation error on the Stackdriver side. I'm not sure how much we want to change this existing test case or not. In a lot of ways it's probably not needed anymore with the assertions done in other test cases, but I thought it might be good to keep it as a focused unit test for a previously fixed bug.

}

final Set<Double> monitoredValues = distributionStatisticConfig
.getHistogramBuckets(supportsAggregablePercentiles);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values) might be unnecessary while we are also having CountAtBucket[] countsAtBuckets(). I'm not sure though if there is a use-case of countsAtValues, where not all the bucket boundaries are passed.

* @see PercentileHistogramBuckets
*/
protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more verbose but might describe the intent better: alwaysIncludeInfinityBucket instead of includeInfinityBucket?

public TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config,
boolean supportsAggregablePercentiles) {
this(clock, config, supportsAggregablePercentiles, true);
}

/**
* Create a {@code TimeWindowFixedBoundaryHistogram} instance.
* Create a {@code TimeWindowFixedBoundaryHistogram} instance with buckets based on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention here and in the docs of the previous ctor what are the defaults?
Here: inf bucket is not forced/might be missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to allude to that by saying the buckets are based on the DistributionStatisticConfig and supportsAggregablePercentiles. If those result in an inf bucket, it'll be there. If not, it won't.

@@ -45,7 +44,7 @@ abstract class AbstractTimeWindowHistogram<T, U> implements Histogram {

private final Clock clock;

private final boolean supportsAggregablePercentiles;
protected final boolean supportsAggregablePercentiles;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to make this protected for TimeWindowPercentileHistogram to include the logic that was previously here in takeCountSnapshot()

The count passed to the `HistogramSnapshot` is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver.

This solves this by tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.
@shakuzen shakuzen force-pushed the stackdriver-histogram branch from b6b2b1b to 51d3186 Compare January 30, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distribution value has |bucket_counts| fields that sum to 4 which does not equal the |count| field value of 2
2 participants