-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric Timestamps #77
Comments
edit: Not the actual real-time stats interval, though — see below. |
Actually, just for my understanding... The real-time API takes a Unix timestamp as input, and returns after an arbitrary period of time with, essentially, deltas of all the stats. So you could give ts=100 and get back ts=105 with 5s worth of data, or get back ts=113 with 13s of data, etc. What timestamp would you apply? The time when the response was received? |
I'm thinking it should get the end timestamp of the delta accumulation. |
Yes, my thinking is that because there is some jitter in the timing of each real-time poll, and when Prometheus performs the scrape. |
I dug around in the docs for about 15m and couldn't figure out how to do this. Given this is how metrics are updated how should I apply a specific timestamp? |
I'm going to continue to do some debugging on my end. I'll get back to you on what I find out. |
Yup, exactly as I thought. The time to pull from the API is highly variable. This means that sometimes data arrives slightly on one side or the other from a scrape. Test patch: diff --git a/pkg/rt/subscriber.go b/pkg/rt/subscriber.go
index a0ec01d..8551018 100644
--- a/pkg/rt/subscriber.go
+++ b/pkg/rt/subscriber.go
@@ -131,6 +131,7 @@ func (s *Subscriber) Run(ctx context.Context) error {
// Recoverable errors are logged internally and not returned, so any non-nil
// error returned by this method should be considered fatal to the subscriber.
func (s *Subscriber) query(ctx context.Context, ts uint64) (currentName string, result apiResult, delay time.Duration, newts uint64, fatal error) {
+ start := time.Now()
name, ver, found := s.provider.Metadata(s.serviceID)
version := strconv.Itoa(ver)
if !found {
@@ -169,7 +170,7 @@ func (s *Subscriber) query(ctx context.Context, ts uint64) (currentName string,
switch resp.StatusCode {
case http.StatusOK:
- level.Debug(s.logger).Log("status_code", resp.StatusCode, "response_ts", response.Timestamp, "err", apiErr)
+ level.Debug(s.logger).Log("status_code", resp.StatusCode, "response_ts", response.Timestamp, "err", apiErr, "took", time.Since(start).Seconds())
if strings.Contains(apiErr, "No data available") {
result = apiResultNoData
} else { Results:
|
I guess I don't really understand how timestamps will help here. Are you saying to use a timestamp of the response_ts? Are you seeing that as substantially different than the local time? edit: I ran it for a bit while logging the diff between the response_ts and the local timestamp when the response was received. It varied but only up to about 1100ms, which is exactly what you'd expect given the response_ts only has 1s granularity. It also seems to line up with your data. Given the API "refresh interval" is about 1s and the Prometheus scrape interval is 15s, I can't see how the dips and peaks in the graph are a function of any kind of data skew. Can you explain like I'm 5? |
The issue is much more visible when you have a large amount of traffic and you're looking at the extrapolated rate between two scrapes. Any small amount of missing accumulation are greatly exaggerated by the extrapolation. Think about it this way. If the update interval is 1s, and the scrape interval is 15s. An individual scrape could be missing 1/15th of the data. This will lead to the rate being off by -6.7%. The next scrape includes the missing data so it's now +6.7% By including the timestamp, we align the data with the actual time window of what it represents. Currently the timestamps are aligned with the end timestamp of the data in addition to the random lag in the API. |
Ah, but the returned data always lags the truth to at least some degree. At the moment, by several seconds! See the AggregateDelay parameter described here, which we don't consider at all right now. That could be a feature request, but it's of arguable benefit, because there's always going to be the possibility of "data drift" where some increments arrive late to the system serving rt.fastly.com itself. So I don't think computing "the extrapolated rate between two scrapes" will ever be precise in the way you're assuming/hoping it is, just by the nature of the system. Manipulating data timestamps also introduces risks related to those timestamps being incorrect, for whatever reason. That would require safeguards, and the semantics of those safeguards in the presence of unsound input data are non-obvious. Can of worms.... |
Lagging behind the truth is fine, what's important is that the timestamp of the truth is reasonable. We do this for importing data from other monitoring systems, for example AWS Cloudwatch. |
I'm sure I'm being dense here but I still don't understand how a difference of less than a second — which is the granularity of the data itself — is meaningful enough to warrant correction. The delta never exceeds 1s, right? Would you not see exactly the same squiggles in the graph if we did the correction you're suggesting, but Prometheus happens to scrape on something other than precisely the .000 of the scrape interval? I'm honestly not trying to be difficult, I honestly don't understand how this would help. Or, I think maybe equivalently: I honestly don't understand why you think the extrapolated rate between two scrapes should be as precise as you're expecting. |
The problem is that the API is crazy jittery, and aligns perfectly with second boundaries. So the delay wobbles the results before and after scrape alignment. It's fine if there is enough offset between the Prometheus scrape and the width of the API wobble. But when they're closely aligned, the data becomes excessively noisy. I will work on a patch to see if I can demonstrate this. |
In my tests it has a delta of like +/- 1s. Is that a lot? And any exporter which is a proxy to an API is surely going to have this issue, right? My intuition is that it would typically be much more "jitter" than 1s -- am I off base? Do all exporters like this need to provide timestamps? |
Since it's basically polling another monitoring system, yes Another way we could handle this, is to convert the exporter to become a remote write streaming sever. Rather than scrape -> scrape, scrape -> stream. |
While it's normal not best practice to expose timestamps. I think this exporter may need them.
I have been seeing issues with update vs scrape time alignment.
This may also be an artifact of the latency for ingesting updates from the real-time API. If the update comes in just slightly after the scrape, the update will not include some data. But the next scrape will catch up with the current value.
This causes artifacts in the graphs.
The text was updated successfully, but these errors were encountered: