Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer/distribution summary snapshot values not self-consistent #998

Open
john-karp opened this issue Nov 9, 2018 · 15 comments
Open

Timer/distribution summary snapshot values not self-consistent #998

john-karp opened this issue Nov 9, 2018 · 15 comments
Labels
bug A general bug

Comments

@john-karp
Copy link
Contributor

john-karp commented Nov 9, 2018

While the name might suggest otherwise, the timer/distribution summary takeSnapshot() implementations do not atomically capture statistics, so that count(), max(), and histogramCounts() are not necessarily self-consistent. This can lead to situations where the total counts of the histogram buckets may not equal snapshot.count(). Or a downstream system using metrics might compute incorrect rates (total / count) because the time periods they measured are effectively different.

@jrask
Copy link

jrask commented Dec 15, 2018

For us this is not are rare case, it happens more or less all the time so at the moment this is a blocker for us. Are there any thought about if and how this could be solved? Would be great if you could point me in the right direction.

@jkschneider
Copy link
Contributor

jkschneider commented Dec 15, 2018

Without thinking too much about it, seems like in the presence of histograms you could just take the histogram sum as the count. Neither value is more correct really.

@jrask
Copy link

jrask commented Dec 17, 2018

Thanks, so there is no plan or proposal how to solve this for "real"? My guess was that this was a concurrency issue. I could look into it, I was just hoping to perhaps get some initial intro to the problem.

What we are considering is that there are two scenarios, one which is ok and that
is when we have durations that are longer than our max SLA. In that case it would be ok with Count higher than CountAtBucket[lastValue].count() , right?

So would a workaround like this be good enough?

timer.count > CountAtBucket[lastValue] && Timer.max <= SLA.Max
== Rely on CountAtBucket[lastValue] as Count. Do nothing else.

timer.count > CountAtBucket[lastValue] && Timer.max > SLA.Max
== Rely on Count as Count, add an extra bucket with Timer.count -  CountAtBucket[lastValue].count
      which is from SLA_MAX TO inf.

@jkschneider
Copy link
Contributor

I think we could always rely on the count on a +Inf bucket to be the count when histograms are being recorded. They can only diverge now because count is maintained independently of histogram counts, so if a snapshot is taken between when count has been updated and the histogram is updated, they can appear inconsistent. Neither value is "more correct" in absolute terms, but consistency within a snapshot is preferable.

@jkschneider jkschneider modified the milestones: 1.0.9, 1.0.10 Jan 10, 2019
@jrask
Copy link

jrask commented Jan 21, 2019

Hi, I just have to leave a comment on this again since I really do not understand what happens. This is a result of a Timer being invoked with record() every 100ms and then we have steps=2s (have tried with different intervals, same result). Then everything just runs in a while true loop.

The result shows a big difference between histogram.count and histogram cumulative count. Is this really just "an atomic capture" issue? Anyway, if we are not able to get this to work properly I do not really know what to do. This is not a rare situation, it happens all the time.

It is not really useful to use count=13 men the reality is 20?

Start

Histogram max value jumps between 20 and 13 and than stabalizes around 13-14.

Timer.cnt=20, Histogram.cnt=13, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.277507072E9 at 50.0%), (2.948595712E9 at 75.0%), (4.559208448E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.07618048E9 at 50.0%), (2.545942528E9 at 75.0%), (4.290772992E9 at 95.0%), (4.290772992E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.count=19, HistogramSnapshot{count=19, total=5.8642E10, mean=3.086421052631579E9, max=4.772E9, percentileValues=[(2.948595712E9 at 50.0%), (4.559208448E9 at 75.0%), (4.827643904E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.2482E10, mean=2.2358947368421054E9, max=4.606E9, percentileValues=[(2.277507072E9 at 50.0%), (3.08281344E9 at 75.0%), (4.290772992E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(3.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.2304E10, mean=2.6152E9, max=4.926E9, percentileValues=[(2.948595712E9 at 50.0%), (3.619684352E9 at 75.0%), (4.156555264E9 at 95.0%), (5.09607936E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.3376E10, mean=2.2829473684210525E9, max=4.933E9, percentileValues=[(2.950692864E9 at 50.0%), (3.621781504E9 at 75.0%), (4.5613056E9 at 95.0%), (5.098176512E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]}

However, one time I forgot to stop the program and when I came back I noticed that the histogram max count was slow but constantly increasing and reach the following which looks promising, it almost reaches total count but is 19 or 20.

Timer.cnt=20, HistogramSnapshot{count=20, total=4.3065E10, mean=2.15325E9, max=4.673E9, percentileValues=[(2.415886336E9 at 50.0%), (3.355410432E9 at 75.0%), (4.1607168E9 at 95.0%), (4.83180544E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=4.4401E10, mean=2.22005E9, max=4.943E9, percentileValues=[(1.946124288E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(8.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=20, HistogramSnapshot{count=20, total=5.1766E10, mean=2.5883E9, max=4.94E9, percentileValues=[(2.550104064E9 at 50.0%), (3.48962816E9 at 75.0%), (5.100240896E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]}

Timer.cnt=19, HistogramSnapshot{count=19, total=4.4872E10, mean=2.3616842105263157E9, max=4.977E9, percentileValues=[(2.013233152E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]}

And all of a sudden drop to 13, 14 again.

@jrask
Copy link

jrask commented Jan 21, 2019

@jkschneider Hi again, it really seems like timing is everything here. I changed this gist https://gist.github.com/jrask/52c70475ac744af56f88e43fab80a7c9 and just added

 for (Meter m : getMeters()) {
        if (m instanceof Timer) {
            Thread.sleep(20);
            Timer t = (Timer)m;
            HistogramSnapshot histogramSnapshot = t.takeSnapshot();
            System.out.println(t.count() + ", " + histogramSnapshot);
     ...
}

before handling each meter (I only have one). This changed the output from proper results to minor difference. In my code, since I things like custom json generation I can image that 20ms can pass from the time that publish() is invoked until I use it. (Note, this is not related to my comment above, that is another test case, this is "raw" micrometer).

Increasing sleep does not seem to affect much.

Anyway, perhaps you already knew this but for me this proves that it is broken. Not really sure what to do next, if this is solvable. It is "too" wrong to be useful, cannot rely on milliseconds to be on my side.

No sleep

29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}
30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (30.0 at 200000000), (30.0 at 500000000)]}

To this, where we loose around -10 in count for histogram buckets total.

With sleep(20)

30, HistogramSnapshot{count=30, total=4.5E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (20.0 at 200000000), (20.0 at 500000000)]}
29, HistogramSnapshot{count=29, total=4.35E9, mean=1.5E8, max=1.5E8, histogramCounts=[(0.0 at 100000000), (29.0 at 200000000), (29.0 at 500000000)]}

@shakuzen shakuzen modified the milestones: 1.0.10, 1.0.11 Mar 26, 2019
@shakuzen shakuzen modified the milestones: 1.0.11, 1.1.6 Jun 14, 2019
@shakuzen shakuzen modified the milestones: 1.1.6, 1.1.7 Sep 3, 2019
@shakuzen shakuzen modified the milestones: 1.1.7, 1.1.8 Sep 30, 2019
@shakuzen shakuzen modified the milestones: 1.1.8, 1.1.9 Nov 4, 2019
@shakuzen shakuzen modified the milestones: 1.1.9, 1.1.10 Dec 3, 2019
@shakuzen shakuzen modified the milestones: 1.1.10, 1.1.11 Jan 16, 2020
@shakuzen shakuzen modified the milestones: 1.1.11, 1.1.12 Mar 12, 2020
shakuzen added a commit that referenced this issue Mar 16, 2020
Due to a lack of guarantee that histogram snapshots are self-consistent (see #998), the count value could be lower than the sum of the bucket counts. Before this fix the infinity bucket count would have a negative value, which Stackdriver rejects with an error. This uses zero (0) instead in this situation. This does not fix the potential for inconsistent values in histogram snapshots.

Resolves #1325
@shakuzen shakuzen modified the milestones: 1.1.12, 1.1.x Mar 30, 2020
@Fetsivalen
Copy link
Contributor

Hi, I've faced this issue as well
I found it is not self-consistent not only between SLO value and total counter.
Here you can see query to return percent in the specific bucket.
image
but between boundaries as well:
image

I've played with aggregations in my queries and found that it is not because of that and kept for the screenshots the most "self-descriptive" ones.
I've played with aggregations in my queries and found that it is not because of that and used for the screenshots the most "self-descriptive" ones.

From what I see from my dashboards, the most fluctuations are near the last two buckets.

I even create a "middle" bucket (should be without data) and don't see such behavior.

@Fetsivalen
Copy link
Contributor

Also, I've played with Heatmaps visualization for the latency and realized that the same stuff happens with percentiles as well as with bucket values.

@shakuzen shakuzen modified the milestones: 1.3.x, 1.5.x Jun 22, 2021
@shakuzen shakuzen modified the milestones: 1.5.x, 1.6.x Aug 12, 2021
@rafal-dudek
Copy link
Contributor

rafal-dudek commented Oct 29, 2021

Hello,
I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency.
Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

@cschockaert
Copy link

Hello,
we got this error too in our grafana dashboard, we get negatives value in the +inf bucket causing us to think that we are getting very high responses times for our requests:

image

associated heatmap query is :

sum by (le)(rate(http_server_requests_seconds_bucket{namespace="$namespace", job="$job",method=~"$method",uri=~"$uri",status=~"$status",le=~"0.001|0.003|0.03|0.1|0.3|0.5|1.0|1.5|3.0|5.0|10.0|.Inf}[$__rate_interval]))

@jonatan-ivanov jonatan-ivanov modified the milestones: 1.6.x, 1.7.x Nov 10, 2021
@shakuzen shakuzen modified the milestones: 1.7.x, 1.8.x May 11, 2022
@jonatan-ivanov jonatan-ivanov modified the milestones: 1.8.x, 1.9.x Jan 12, 2023
@marcingrzejszczak marcingrzejszczak removed this from the 1.9.x milestone Dec 21, 2023
@marcingrzejszczak marcingrzejszczak added the bug A general bug label Dec 21, 2023
@marthursson
Copy link

Hello, I noticed a problems in 2 of our applications using StackdriverMeterRegistry from micrometer 1.6.6. The errors are like this: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Field timeSeries[6].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.

I believe it is related to Summary snapshot inconsistency. Maybe for many systems it is not verified, but for Stackdriver such a metric is rejected.

I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority?

Interestingly, running on Micrometer 1.10.x we are not troubled by this particular error (i.e. there is no complaints in the logs), but as soon as we upgrade to 1.11.x or 1.12.x we will see warnings similar to this one in our logs.

@gmucha
Copy link

gmucha commented Feb 6, 2024

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

@marthursson
Copy link

We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't.

Trying to debug the code, I'm consistently seeing the count in the underlying StepTimer to be always 0 when the summary is calculated. A potential workaround would be to disregard the count when creating the distribution for Stackdriver (although looking at the issue this may not be related just to Stackdriver.)

Interestingly, quickly looking at the code there is a lot of work involved to calculate the counts of the buckets. I may be missing some important aspect here (I've basically just looked at this particular function) but since the individual counts are supposed (by the API) to sum up to the total count it should be possible to just calculate the total count in the same function rather than accepting the total count as a parameter to the function.

@altunkan
Copy link

We have been observing same issue ever since we migrated to Spring Boot 3.2.2 (Micrometer 1.12).

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[31].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 4 which does not equal the |count| field value of 2.

Any update on this issue?

@rafal-dudek
Copy link
Contributor

Hello,

More and more applications in our company have problems with it.
E.g.: Micrometer 1.12.0

failed to send metrics to Stackdriver
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Field timeSeries[82].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[97].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[39].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[5].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[131].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 1 which does not equal the |count| field value of 0.;
 Field timeSeries[162].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.;
 Field timeSeries[196].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 2 which does not equal the |count| field value of 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug
Projects
None yet
Development

No branches or pull requests