-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timer/distribution summary snapshot values not self-consistent #998
Comments
For us this is not are rare case, it happens more or less all the time so at the moment this is a blocker for us. Are there any thought about if and how this could be solved? Would be great if you could point me in the right direction. |
Without thinking too much about it, seems like in the presence of histograms you could just take the histogram sum as the count. Neither value is more correct really. |
Thanks, so there is no plan or proposal how to solve this for "real"? My guess was that this was a concurrency issue. I could look into it, I was just hoping to perhaps get some initial intro to the problem. What we are considering is that there are two scenarios, one which is ok and that So would a workaround like this be good enough?
|
I think we could always rely on the count on a +Inf bucket to be the count when histograms are being recorded. They can only diverge now because count is maintained independently of histogram counts, so if a snapshot is taken between when count has been updated and the histogram is updated, they can appear inconsistent. Neither value is "more correct" in absolute terms, but consistency within a snapshot is preferable. |
Hi, I just have to leave a comment on this again since I really do not understand what happens. This is a result of a Timer being invoked with record() every 100ms and then we have steps=2s (have tried with different intervals, same result). Then everything just runs in a while true loop. The result shows a big difference between histogram.count and histogram cumulative count. Is this really just "an atomic capture" issue? Anyway, if we are not able to get this to work properly I do not really know what to do. This is not a rare situation, it happens all the time. It is not really useful to use count=13 men the reality is 20? Start Histogram max value jumps between 20 and 13 and than stabalizes around 13-14. Timer.cnt=20, Histogram.cnt=13, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.277507072E9 at 50.0%), (2.948595712E9 at 75.0%), (4.559208448E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]} Timer.cnt=20, HistogramSnapshot{count=20, total=4.667E10, mean=2.3335E9, max=4.866E9, percentileValues=[(2.07618048E9 at 50.0%), (2.545942528E9 at 75.0%), (4.290772992E9 at 95.0%), (4.290772992E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]} Timer.count=19, HistogramSnapshot{count=19, total=5.8642E10, mean=3.086421052631579E9, max=4.772E9, percentileValues=[(2.948595712E9 at 50.0%), (4.559208448E9 at 75.0%), (4.827643904E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]} Timer.cnt=19, HistogramSnapshot{count=19, total=4.2482E10, mean=2.2358947368421054E9, max=4.606E9, percentileValues=[(2.277507072E9 at 50.0%), (3.08281344E9 at 75.0%), (4.290772992E9 at 95.0%), (4.827643904E9 at 99.0%)], histogramCounts=[(3.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]} Timer.cnt=20, HistogramSnapshot{count=20, total=5.2304E10, mean=2.6152E9, max=4.926E9, percentileValues=[(2.948595712E9 at 50.0%), (3.619684352E9 at 75.0%), (4.156555264E9 at 95.0%), (5.09607936E9 at 99.0%)], histogramCounts=[(1.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]} Timer.cnt=19, HistogramSnapshot{count=19, total=4.3376E10, mean=2.2829473684210525E9, max=4.933E9, percentileValues=[(2.950692864E9 at 50.0%), (3.621781504E9 at 75.0%), (4.5613056E9 at 95.0%), (5.098176512E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (13.0 at 5000000000), (13.0 at 7000000000)]} However, one time I forgot to stop the program and when I came back I noticed that the histogram max count was slow but constantly increasing and reach the following which looks promising, it almost reaches total count but is 19 or 20. Timer.cnt=20, HistogramSnapshot{count=20, total=4.3065E10, mean=2.15325E9, max=4.673E9, percentileValues=[(2.415886336E9 at 50.0%), (3.355410432E9 at 75.0%), (4.1607168E9 at 95.0%), (4.83180544E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]} Timer.cnt=20, HistogramSnapshot{count=20, total=4.4401E10, mean=2.22005E9, max=4.943E9, percentileValues=[(1.946124288E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(8.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]} Timer.cnt=20, HistogramSnapshot{count=20, total=5.1766E10, mean=2.5883E9, max=4.94E9, percentileValues=[(2.550104064E9 at 50.0%), (3.48962816E9 at 75.0%), (5.100240896E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(4.0 at 1000000000), (20.0 at 5000000000), (20.0 at 7000000000)]} Timer.cnt=19, HistogramSnapshot{count=19, total=4.4872E10, mean=2.3616842105263157E9, max=4.977E9, percentileValues=[(2.013233152E9 at 50.0%), (4.026499072E9 at 75.0%), (4.83180544E9 at 95.0%), (5.100240896E9 at 99.0%)], histogramCounts=[(5.0 at 1000000000), (19.0 at 5000000000), (19.0 at 7000000000)]} And all of a sudden drop to 13, 14 again. |
@jkschneider Hi again, it really seems like timing is everything here. I changed this gist https://gist.github.com/jrask/52c70475ac744af56f88e43fab80a7c9 and just added
before handling each meter (I only have one). This changed the output from proper results to minor difference. In my code, since I things like custom json generation I can image that 20ms can pass from the time that publish() is invoked until I use it. (Note, this is not related to my comment above, that is another test case, this is "raw" micrometer). Increasing sleep does not seem to affect much. Anyway, perhaps you already knew this but for me this proves that it is broken. Not really sure what to do next, if this is solvable. It is "too" wrong to be useful, cannot rely on milliseconds to be on my side. No sleep
To this, where we loose around -10 in count for histogram buckets total. With sleep(20)
|
Due to a lack of guarantee that histogram snapshots are self-consistent (see #998), the count value could be lower than the sum of the bucket counts. Before this fix the infinity bucket count would have a negative value, which Stackdriver rejects with an error. This uses zero (0) instead in this situation. This does not fix the potential for inconsistent values in histogram snapshots. Resolves #1325
Also, I've played with Heatmaps visualization for the latency and realized that the same stuff happens with percentiles as well as with bucket values. |
Hello, I believe it is related to Summary snapshot inconsistency. I can see that this issue is not of high priority since it has been open for 3 years - can we count on a higher priority? |
Hello, associated heatmap query is :
|
Interestingly, running on Micrometer 1.10.x we are not troubled by this particular error (i.e. there is no complaints in the logs), but as soon as we upgrade to 1.11.x or 1.12.x we will see warnings similar to this one in our logs. |
We're seeing the same problem with Micrometer 1.11 - not sure if consistency is a problem here, but noticed that the |count| field is pretty much always 0 while bucket counts aren't. Trying to debug the code, I'm consistently seeing the count in the underlying |
Interestingly, quickly looking at the code there is a lot of work involved to calculate the counts of the buckets. I may be missing some important aspect here (I've basically just looked at this particular function) but since the individual counts are supposed (by the API) to sum up to the total count it should be possible to just calculate the total count in the same function rather than accepting the total count as a parameter to the function. |
We have been observing same issue ever since we migrated to Spring Boot 3.2.2 (Micrometer 1.12).
Any update on this issue? |
Hello, More and more applications in our company have problems with it.
|
While the name might suggest otherwise, the timer/distribution summary takeSnapshot() implementations do not atomically capture statistics, so that
count()
,max()
, andhistogramCounts()
are not necessarily self-consistent. This can lead to situations where the total counts of the histogram buckets may not equalsnapshot.count()
. Or a downstream system using metrics might compute incorrect rates (total / count) because the time periods they measured are effectively different.The text was updated successfully, but these errors were encountered: