Timer/distribution summary snapshot values not self-consistent #998

john-karp · 2018-11-09T22:44:09Z

While the name might suggest otherwise, the timer/distribution summary takeSnapshot() implementations do not atomically capture statistics, so that count(), max(), and histogramCounts() are not necessarily self-consistent. This can lead to situations where the total counts of the histogram buckets may not equal snapshot.count(). Or a downstream system using metrics might compute incorrect rates (total / count) because the time periods they measured are effectively different.

The text was updated successfully, but these errors were encountered:

jrask · 2018-12-15T08:08:43Z

For us this is not are rare case, it happens more or less all the time so at the moment this is a blocker for us. Are there any thought about if and how this could be solved? Would be great if you could point me in the right direction.

jkschneider · 2018-12-15T22:27:02Z

Without thinking too much about it, seems like in the presence of histograms you could just take the histogram sum as the count. Neither value is more correct really.

jrask · 2018-12-17T08:48:35Z

Thanks, so there is no plan or proposal how to solve this for "real"? My guess was that this was a concurrency issue. I could look into it, I was just hoping to perhaps get some initial intro to the problem.

What we are considering is that there are two scenarios, one which is ok and that
is when we have durations that are longer than our max SLA. In that case it would be ok with Count higher than CountAtBucket[lastValue].count() , right?

So would a workaround like this be good enough?

timer.count > CountAtBucket[lastValue] && Timer.max <= SLA.Max
== Rely on CountAtBucket[lastValue] as Count. Do nothing else.

timer.count > CountAtBucket[lastValue] && Timer.max > SLA.Max
== Rely on Count as Count, add an extra bucket with Timer.count -  CountAtBucket[lastValue].count
      which is from SLA_MAX TO inf.

jkschneider · 2018-12-18T15:23:08Z

I think we could always rely on the count on a +Inf bucket to be the count when histograms are being recorded. They can only diverge now because count is maintained independently of histogram counts, so if a snapshot is taken between when count has been updated and the histogram is updated, they can appear inconsistent. Neither value is "more correct" in absolute terms, but consistency within a snapshot is preferable.

jrask · 2019-01-21T14:19:25Z

Hi, I just have to leave a comment on this again since I really do not understand what happens. This is a result of a Timer being invoked with record() every 100ms and then we have steps=2s (have tried with different intervals, same result). Then everything just runs in a while true loop.

The result shows a big difference between histogram.count and histogram cumulative count. Is this really just "an atomic capture" issue? Anyway, if we are not able to get this to work properly I do not really know what to do. This is not a rare situation, it happens all the time.

It is not really useful to use count=13 men the reality is 20?

Start

Histogram max value jumps between 20 and 13 and than stabalizes around 13-14.

Timer.cnt=20, Histogram.cnt=13, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.277507072E9 at 50.0%), (2.948595712E9 at 75.0%), (4.559208448E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.07618048E9 at 50.0%), (2.545942528E9 at 75.0%), (4.290772992E9 at 95.0%), (4.290772992E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.count=19, HistogramSnapshot{count=19, total=5.8642E10, mean=3.086421052631579E9, max=4.772E9, percentileValues=[(2.948595712E9 at 50.0%), (4.559208448E9 at 75.0%), (4.827643904E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.2482E10, mean=2.2358947368421054E9, max=4.606E9, percentileValues=[(2.277507072E9 at 50.0%), (3.08281344E9 at 75.0%), (4.290772992E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(3.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.2304E10, mean=2.6152E9, max=4.926E9, percentileValues=[(2.948595712E9 at 50.0%), (3.619684352E9 at 75.0%), (4.156555264E9 at 95.0%), (5.09607936E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.3376E10, mean=2.2829473684210525E9, max=4.933E9, percentileValues=[(2.950692864E9 at 50.0%), (3.621781504E9 at 75.0%), (4.5613056E9 at 95.0%), (5.098176512E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

However, one time I forgot to stop the program and when I came back I noticed that the histogram max count was slow but constantly increasing and reach the following which looks promising, it almost reaches total count but is 19 or 20.

Timer.cnt=20, HistogramSnapshot{count=20, total=4.3065E10, mean=2.15325E9, max=4.673E9, percentileValues=[(2.415886336E9 at 50.0%), (3.355410432E9 at 75.0%), (4.1607168E9 at 95.0%), (4.83180544E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.4401E10, mean=2.22005E9, max=4.943E9, percentileValues=[(1.946124288E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(8.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.1766E10, mean=2.5883E9, max=4.94E9, percentileValues=[(2.550104064E9 at 50.0%), (3.48962816E9 at 75.0%), (5.100240896E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.4872E10, mean=2.3616842105263157E9, max=4.977E9, percentileValues=[(2.013233152E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

And all of a sudden drop to 13, 14 again.

jrask · 2019-01-21T15:08:16Z

@jkschneider Hi again, it really seems like timing is everything here. I changed this gist https://gist.github.com/jrask/52c70475ac744af56f88e43fab80a7c9 and just added

 for (Meter m : getMeters()) {
        if (m instanceof Timer) {
            Thread.sleep(20);
            Timer t = (Timer)m;
            HistogramSnapshot histogramSnapshot = t.takeSnapshot();
            System.out.println(t.count() + ", " + histogramSnapshot);
     ...
}

before handling each meter (I only have one). This changed the output from proper results to minor difference. In my code, since I things like custom json generation I can image that 20ms can pass from the time that publish() is invoked until I use it. (Note, this is not related to my comment above, that is another test case, this is "raw" micrometer).

Increasing sleep does not seem to affect much.

Anyway, perhaps you already knew this but for me this proves that it is broken. Not really sure what to do next, if this is solvable. It is "too" wrong to be useful, cannot rely on milliseconds to be on my side.

No sleep

29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}
30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (30.0 at 200000000), (30.0 at 500000000)]}

To this, where we loose around -10 in count for histogram buckets total.

With sleep(20)

30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (20.0 at 200000000), (20.0 at 500000000)]}
29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}

Due to a lack of guarantee that histogram snapshots are self-consistent (see #998), the count value could be lower than the sum of the bucket counts. Before this fix the infinity bucket count would have a negative value, which Stackdriver rejects with an error. This uses zero (0) instead in this situation. This does not fix the potential for inconsistent values in histogram snapshots. Resolves #1325

Fetsivalen · 2020-12-04T16:04:30Z

Hi, I've faced this issue as well
I found it is not self-consistent not only between SLO value and total counter.
Here you can see query to return percent in the specific bucket.

but between boundaries as well:

I've played with aggregations in my queries and found that it is not because of that and kept for the screenshots the most "self-descriptive" ones.
I've played with aggregations in my queries and found that it is not because of that and used for the screenshots the most "self-descriptive" ones.

From what I see from my dashboards, the most fluctuations are near the last two buckets.

I even create a "middle" bucket (should be without data) and don't see such behavior.

Fetsivalen · 2020-12-04T16:13:13Z

Also, I've played with Heatmaps visualization for the latency and realized that the same stuff happens with percentiles as well as with bucket values.

rafal-dudek · 2021-10-29T11:29:49Z

Hello,
I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency.
Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

cschockaert · 2021-10-31T20:58:33Z

Hello,
we got this error too in our grafana dashboard, we get negatives value in the +inf bucket causing us to think that we are getting very high responses times for our requests:

associated heatmap query is :

sum by (le)(rate(http_server_requests_seconds_bucket{namespace="$namespace", job="$job",method=~"$method",uri=~"$uri",status=~"$status",le=~"0.001|0.003|0.03|0.1|0.3|0.5|1.0|1.5|3.0|5.0|10.0|.Inf}[$__rate_interval]))

marthursson · 2024-01-24T12:10:38Z

Hello, I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency. Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

Interestingly, running on Micrometer 1.10.x we are not troubled by this particular error (i.e. there is no complaints in the logs), but as soon as we upgrade to 1.11.x or 1.12.x we will see warnings similar to this one in our logs.

gmucha · 2024-02-06T11:12:45Z

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

marthursson · 2024-02-06T11:50:45Z

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

Interestingly, quickly looking at the code there is a lot of work involved to calculate the counts of the buckets. I may be missing some important aspect here (I've basically just looked at this particular function) but since the individual counts are supposed (by the API) to sum up to the total count it should be possible to just calculate the total count in the same function rather than accepting the total count as a parameter to the function.

altunkan · 2024-03-21T18:59:18Z

We have been observing same issue ever since we migrated to Spring Boot 3.2.2 (Micrometer 1.12).

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[31].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 4 which does not equal the |count| field value of 2.

Any update on this issue?

rafal-dudek · 2024-05-13T06:41:27Z

Hello,

More and more applications in our company have problems with it.
E.g.: Micrometer 1.12.0

failed to send metrics to Stackdriver
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[82].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[97].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[39].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[5].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[131].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[162].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[196].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.

jkschneider added this to the 1.0.8 milestone Nov 12, 2018

jkschneider modified the milestones: 1.0.8, 1.0.9 Nov 29, 2018

jkschneider mentioned this issue Dec 14, 2018

StepMeterRegistry CountAtBucket values does not match timer.count. #1070

Closed

jkschneider modified the milestones: 1.0.9, 1.0.10 Jan 10, 2019

shakuzen modified the milestones: 1.0.10, 1.0.11 Mar 26, 2019

shakuzen mentioned this issue Mar 27, 2019

StackdriverMeterRegistry negative bucketCounts value #1325

Closed

shakuzen modified the milestones: 1.0.11, 1.1.6 Jun 14, 2019

shakuzen modified the milestones: 1.1.6, 1.1.7 Sep 3, 2019

shakuzen modified the milestones: 1.1.7, 1.1.8 Sep 30, 2019

shakuzen modified the milestones: 1.1.8, 1.1.9 Nov 4, 2019

shakuzen modified the milestones: 1.1.9, 1.1.10 Dec 3, 2019

shakuzen modified the milestones: 1.1.10, 1.1.11 Jan 16, 2020

shakuzen modified the milestones: 1.1.11, 1.1.12 Mar 12, 2020

shakuzen modified the milestones: 1.1.12, 1.1.x Mar 30, 2020

shakuzen modified the milestones: 1.3.x, 1.5.x Jun 22, 2021

shakuzen modified the milestones: 1.5.x, 1.6.x Aug 12, 2021

jonatan-ivanov modified the milestones: 1.6.x, 1.7.x Nov 10, 2021

shakuzen modified the milestones: 1.7.x, 1.8.x May 11, 2022

shakuzen mentioned this issue Sep 8, 2022

Fix concurrency issue with cumulative bucket counts #3394

Closed

jonatan-ivanov modified the milestones: 1.8.x, 1.9.x Jan 12, 2023

marcingrzejszczak removed this from the 1.9.x milestone Dec 21, 2023

marcingrzejszczak added the bug A general bug label Dec 21, 2023

rafal-dudek mentioned this issue May 20, 2024

Distribution bucket_counts sum does not equal the count #4868

Closed

shakuzen mentioned this issue Dec 2, 2024

GCP Stackdriver publish fails #5651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timer/distribution summary snapshot values not self-consistent #998

Timer/distribution summary snapshot values not self-consistent #998

john-karp commented Nov 9, 2018 •

edited

Loading

jrask commented Dec 15, 2018

jkschneider commented Dec 15, 2018 •

edited

Loading

jrask commented Dec 17, 2018

jkschneider commented Dec 18, 2018

jrask commented Jan 21, 2019

jrask commented Jan 21, 2019

Fetsivalen commented Dec 4, 2020

Fetsivalen commented Dec 4, 2020

rafal-dudek commented Oct 29, 2021 •

edited

Loading

cschockaert commented Oct 31, 2021

marthursson commented Jan 24, 2024

gmucha commented Feb 6, 2024

marthursson commented Feb 6, 2024

altunkan commented Mar 21, 2024

rafal-dudek commented May 13, 2024

Timer/distribution summary snapshot values not self-consistent #998

Timer/distribution summary snapshot values not self-consistent #998

Comments

john-karp commented Nov 9, 2018 • edited Loading

jrask commented Dec 15, 2018

jkschneider commented Dec 15, 2018 • edited Loading

jrask commented Dec 17, 2018

jkschneider commented Dec 18, 2018

jrask commented Jan 21, 2019

jrask commented Jan 21, 2019

Fetsivalen commented Dec 4, 2020

Fetsivalen commented Dec 4, 2020

rafal-dudek commented Oct 29, 2021 • edited Loading

cschockaert commented Oct 31, 2021

marthursson commented Jan 24, 2024

gmucha commented Feb 6, 2024

marthursson commented Feb 6, 2024

altunkan commented Mar 21, 2024

rafal-dudek commented May 13, 2024

john-karp commented Nov 9, 2018 •

edited

Loading

jkschneider commented Dec 15, 2018 •

edited

Loading

rafal-dudek commented Oct 29, 2021 •

edited

Loading