-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Stackdriver Distribution count and bucket count sum mismatch #5836
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some notes for anyone reviewing and my future self before I forget.
@@ -508,18 +505,17 @@ private TimeInterval interval(MetricDescriptor.MetricKind metricKind) { | |||
Distribution distribution(HistogramSnapshot snapshot, boolean timeDomain) { | |||
CountAtBucket[] histogram = snapshot.histogramCounts(); | |||
|
|||
// selected finite buckets (represented as a normal histogram) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to convert cumulative buckets to non-cumulative because with the change to dedicated Stackdriver types for Timer and DistributionSummary, we pass the flag for non-cumulative histogram.
.setMean(timeDomain ? snapshot.mean(getBaseTimeUnit()) : snapshot.mean()) | ||
.setCount(snapshot.count()) | ||
.setCount(cumulativeCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that fixes things - snapshot.count()
is the step-based count, which does not align with the histogram time window, while the cumulativeCount
is taken from the histogram.
bucketCounts = bucketCounts.subList(0, lastNonZeroIndex + 1); | ||
// infinite bucket count of 0 can be omitted | ||
bucketCounts.add(infCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this note here but did not omit 0 count infinity buckets to try to keep the behavior the same as before since this is a change in a maintenance branch.
} | ||
|
||
final Set<Double> monitoredValues = distributionStatisticConfig | ||
.getHistogramBuckets(supportsAggregablePercentiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic of getting the buckets again to include in the snapshot was problematic for the goal of always having an infinity bucket (in the case of Stackdriver), since this relies only on the logic in DistributionStatisticConfig
and we do not know at this level (AbstractTimeWindowHistogram
) if we want to include the infinity bucket or not. Hence this was removed to avoid trying to keep the buckets in the concrete histogram type (e.g. TimeWindowFixedBoundaryHistogram
) to match the buckets returned here when Histogram#takeSnapshot
is called. With the new package-private method countsAtBuckets
we let the concrete implementation determine what buckets to return rather than from here passing the buckets. We could even get rid of the countsAtValues
method, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values)
might be unnecessary while we are also having CountAtBucket[] countsAtBuckets()
. I'm not sure though if there is a use-case of countsAtValues
, where not all the bucket boundaries are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's package-private, so we can check all the usage. AFAIR, it's always distributionStatisticConfig.getHistogramBuckets
being called, until this PR deviated to add the infinity bucket. We previously had a regression where the buckets didn't match - see #3922.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it now as I refactored things in TimeWindowPercentileHistogram
* @see PercentileHistogramBuckets | ||
*/ | ||
protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config, | ||
boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This protected
constructor is the only (semi-)public API added/changed as part of this pull request. I tried to keep API changes minimal since this is in a maintenance branch. I would have liked to not add any, but I did not come up with a way to achieve that without doing something like copying everything from TimeWindowFixedBoundaryHistogram
to an implementation of Histogram in the stackdriver module. That would probably be more prone to code rot and bugs.
// copied and modified from AbstractDistributionSummary/AbstractTimer | ||
static Histogram stackdriverHistogram(Clock clock, DistributionStatisticConfig distributionStatisticConfig) { | ||
if (distributionStatisticConfig.isPublishingPercentiles()) { | ||
// hdr-based histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HdrHistogram-based histogram?
// boundary | ||
bucketCounts.add(Math.max(0, snapshot.count() - truncatedSum.get())); | ||
// Stackdriver Distribution must have at least one bucket count | ||
if (bucketCounts.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check now that the histogram always contains the inf bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the no-op Histogram case, which still will have no buckets.
// count is 4, but sum of bucket counts is 5 due to inconsistent snapshotting | ||
HistogramSnapshot histogramSnapshot = new HistogramSnapshot(4, 14.7, 5, null, | ||
new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null); | ||
// count is 0 from previous step, but sum of bucket counts is 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that as well that the last bucket count is 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but we check this in other tests and I was trying to keep this test case focused on the mentioned bug fix.
new CountAtBucket[] { new CountAtBucket(1.0, 2), new CountAtBucket(2.0, 5) }, null); | ||
// count is 0 from previous step, but sum of bucket counts is 5 | ||
HistogramSnapshot histogramSnapshot = ds.takeSnapshot(); | ||
assertThat(histogramSnapshot.count()).isEqualTo(0); | ||
Distribution distribution = batch.distribution(histogramSnapshot, false); | ||
List<Long> bucketCountsList = distribution.getBucketCountsList(); | ||
assertThat(bucketCountsList.get(bucketCountsList.size() - 1)).isNotNegative(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not we do an equals check for the value of the inf bucket instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now with these changes since the infinity bucket is properly tracked - before we could not and the previous fix was only to ensure we didn't report it as less than zero which would cause a validation error on the Stackdriver side. I'm not sure how much we want to change this existing test case or not. In a lot of ways it's probably not needed anymore with the assertions done in other test cases, but I thought it might be good to keep it as a focused unit test for a previously fixed bug.
} | ||
|
||
final Set<Double> monitoredValues = distributionStatisticConfig | ||
.getHistogramBuckets(supportsAggregablePercentiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, Iterator<CountAtBucket> countsAtValues(Iterator<Double> values)
might be unnecessary while we are also having CountAtBucket[] countsAtBuckets()
. I'm not sure though if there is a use-case of countsAtValues
, where not all the bucket boundaries are passed.
* @see PercentileHistogramBuckets | ||
*/ | ||
protected TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config, | ||
boolean supportsAggregablePercentiles, boolean isCumulativeBucketCounts, boolean includeInfinityBucket) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit more verbose but might describe the intent better: alwaysIncludeInfinityBucket
instead of includeInfinityBucket
?
public TimeWindowFixedBoundaryHistogram(Clock clock, DistributionStatisticConfig config, | ||
boolean supportsAggregablePercentiles) { | ||
this(clock, config, supportsAggregablePercentiles, true); | ||
} | ||
|
||
/** | ||
* Create a {@code TimeWindowFixedBoundaryHistogram} instance. | ||
* Create a {@code TimeWindowFixedBoundaryHistogram} instance with buckets based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention here and in the docs of the previous ctor what are the defaults?
Here: inf bucket is not forced/might be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to allude to that by saying the buckets are based on the DistributionStatisticConfig and supportsAggregablePercentiles. If those result in an inf bucket, it'll be there. If not, it won't.
@@ -45,7 +44,7 @@ abstract class AbstractTimeWindowHistogram<T, U> implements Histogram { | |||
|
|||
private final Clock clock; | |||
|
|||
private final boolean supportsAggregablePercentiles; | |||
protected final boolean supportsAggregablePercentiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to make this protected
for TimeWindowPercentileHistogram
to include the logic that was previously here in takeCountSnapshot()
The count passed to the `HistogramSnapshot` is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver. This solves this by tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.
b6b2b1b
to
51d3186
Compare
The count passed to the
HistogramSnapshot
is from the StepTimer and is step-based, while the histogram buckets have a separate time window. This mismatch in time windows caused the two to generally not match which causes a validation error when publishing Distribution metrics to Stackdriver.This solves this by always tracking an infinity bucket in the histogram so we can get a consistent count in the same time window as the rest of the histogram.
I still need to do some more testing with empty (no-op) histograms and percentiles configured, but I wanted to open the PR in case anyone wants to do some review on the approach in the meantime.
Closes gh-4868