Skip to content

Conversation

fionaliao
Copy link
Contributor

@fionaliao fionaliao commented Mar 27, 2025

@fionaliao
Copy link
Contributor Author

I'm planning to make some updates to the proposal after the dev summit (the consensus was that we want to support push-based usage cases and continue to explore delta support) and some additional discussion with @beorn7 .

Based on these discussions, I also have a PR up for basic native delta support (behind a feature flag) in Prometheus - this just stores the sample value at TimeUnixNano without additional labels: prometheus/prometheus#16360, and for querying, advising people to use sum_over_time (/ interval) for now. The idea is that we'd first have this simple case without changing PromQL function and get some feedback, that could help figure out how to go forward in terms of temporality-aware functions. I'll update the proposal with this.

Additional updates to make:

  • Flesh out pros/cons of __temporality__ label vs delta_ types (I am actually leaning more into the temporality label now)
  • Add an example of query interval < collection interval which could mess up rate calculations
  • Add some stuff about the serverless/ephemeral jobs use case. This is not specific to OTEL per-se, but this was the use case that kept coming up when talking about deltas to various people/users during kubecon.

I'm out for the next week, but will apply the updates after that

@fionaliao
Copy link
Contributor Author

Updates:

  • Simplified proposal - moved CT-per-sample to possible future extension instead of embedding within proposal
  • Changed proposal to have a new __temporality__ label instead of extending __type__ - probably better to keep metric type concept distinct from metric temporality. This also aligns with how OTEL models it.
  • Updated remote-write section - delta ingestion will actually be fully supported via remote write (since CT-per-sample is moved out of main proposal for now)
  • Moved temporary delta_rate() and delta_increase() functions suggestion to discarded alternative - not sure this is actually necessary if we have feature flag for temporality-aware functions anyway
  • Fleshed out implementation plan


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link

@enisoc enisoc Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We skip the value of the first sample as we do not know its interval.

Perhaps we could get utility out of the first sample's value by guessing that each sample's interval is equal to the average time between samples in the window. One motivation for this is a case that we see often with non-sparse deltas produced by an intermediate processor.

Suppose the actual instrumented app is sending delta samples at a regular 60s interval. We'll assume for simplicity that these deltas are always greater than zero. Then there is an intermediate processor that's configured to accumulate data and flush every 30s. To avoid spareness, it's configured to flush a 0 value if nothing was seen.

The data stream will look like this, with a sample every 30s:

5 0 2 0 10 0 8 0

Note that every other value is 0 because of the mismatch between the flush intervals of the instrumented app and the intermediate processor.

If we then do a rate(...[1m]) on this timeseries, with the current proposal, we might end up with the 1m windows broken up like this:

5 0 | 2 0 | 10 0 | 8 0

If we can't make use of the value from the first sample in each window, we will end up computing a rate of 0 for all of these windows. That feels like it fails to make use of all the available information, since as humans we can clearly see that the rate was not zero.

If instead we guess that each sample represents the delta for a 30s interval, because that's the average distance between the two datapoints in the window, then we will compute the correct rates. Of course it was only a guess and you could contrive a scenario that would fool the guess, but the idea would be to assume that the kind of scenario described here is more likely than those weird ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cumulative series could generate the same values (with the zeroes being where the series resets). And in that case rate() would return no results. So though this doesn't accurately capture the rate, the behaviour would be consistent for cumulative and delta metrics.

However, the first point in a window is more meaningful in the delta case - you know it's a delta from the preceeding sample while in the cumulative case you have to look outside the window to get the same information, so maybe we should do better because of that. That example is leaning me more towards "just do sum_over_time() / range for delta rate()" - in this case that would probably give more useful information. Or at least do that before CT-per-sample available, at which point we'd have more accurate interval data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the root of evil here is that we are essentially reconstructing the "problem" of the current rate calculation, which is that we do not take into account samples from outside the range (but still extrapolates the calculated rate to match the range). I have made arguments why this is actually a good thing in the case of the classic rate calculation, and those arguments partially carry over to the delta case. But not entirely. If we had the accurate interval data, we could reason about how far outside of the range the seen increments are. We could weigh them (but then we should probably also take into account the delta sample "from the future", i.e. after the "right" end of the range), or we could accept if the interval is small enough.
Given that we do not want to take into account the collection interval in this first iteration, we could argue that a delta sample usually implies that the increments it represents are "recent", so we could actually take into account all delta samples in the range. This would ensure "complete coverage" if we graph something with 1m spacing of points and a [1m] range for the rate calculation. That's essentially what "xrate" does for classic rate calculation, but with the tweak that it is unlikely to include increments from the "distant past" because delta samples are supposed to represent "recent" increments. (If you miss a few scrapes with a cumulative counter, you don't miss increments, but now the increment you see is over a multiple of the usual scrape interval, which an "xrate" like approach will happily count as still within the given range.)
From a high level perspective, I'm a bit concerned that we are throwing away one of the advantages that delta temporality has if we ignore the first sample in the range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another perspective on this subject (partially discussed with @fionaliao in person):

One reason to do all the extrapolation and estimation magic with the current rate calculation is that the Prometheus collection model deliberately gives you "unaligned" sampling, i.e. targets with the same scrape interval are still scraped at different phases (not all at the full minute, but hashed over the minute). PromQL has to deal with this in a suitable manner.

While delta samples may be unaligned as well, the usual use case is to collect increments over the collection interval (let's say again 1m), and then send out the collected increments at the full minute. So all samples are at the full minute. If we now run a query like rate(delta_request_counter[5m]), and we run this query at an aligned "full minute" timestamp, we get the perfect match: All the delta samples in the range perfectly cover the 5m range. The sample at the very left end of the range is excluded (thanks to the new "left open" behavior in Prometheus v3). So it would be a clear loss in this case to exclude the earliest sample in the range. (The caveat here is that you do not have to run the query at the full minute. In fact, if you create recording rules in Prometheus, the evaluation time is again deliberately hashed around the rule evaluation interval to avoid the "thundering herd". The latter could be avoided, though, if we accept delayed rule evaluation, i.e. evaluate in a staggered fashion, but use a timestamp "in the past" that matches the full minute.)

There is a use case where delta samples are not aligned at all, and that's the classic statsd use case where you sent increments of one immediately upon each counter increment. However, in this case, the collection interval is effectively zero, and we should definitely not remove the earliest sample from the range.


Downsides:

* This will not work if there is only a single sample in the range, which is more likely with delta metrics (due to sparseness, or being used in short-lived jobs).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cumulative samples, it makes sense that with a single sample in the window you can't guess anything at all about how much increase happened in that window. With a single delta sample, even if we don't know the start time, we should be able to make a better guess than "no increase happened".

For example, we could guess that the interval is equal to the window size -- in other words return the single delta value as is with no extrapolation. The argument would be that you picked an arbitrary window of size T and found 1 sample, so the best guess for the frequency of the samples is 1/T. This seems like it would be more useful on average than returning no value in the case of a single sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is with mixing extrapolation and non-extrapolation logic because that might end up surprising users.

if we do decide to generally extrapolate to fill the whole window, but have this special case for a single datapoint, someone might rely on the non-extrapolation behaviour and get surprised when there are two points and it changes .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, another point why extrapolation (while not completely without merit) probably has another trade-off in the delta case and might just not be worth it.

@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from d0474da to f2433c8 Compare April 18, 2025 16:08
@fionaliao
Copy link
Contributor Author

Next steps for this proposal:

  • Wait for type and unit metadata proposal to be finalised, which might result in updates to how exactly the temporality label will be implemented
  • Get the primitive otel delta support PR merged, hopefully having that out will help get some feedback on querying should be done
  • Write some code to experiment with delta rate implementations, see what edge cases there are for each option

@fionaliao
Copy link
Contributor Author

Write some code to experiment with delta rate implementations

Started rough implementation for rate functions here:

prometheus/prometheus@fionaliao/basic-delta-support...fionaliao/delta-rate

Including some example queries: https://github.com/prometheus/prometheus/blob/4c72cba2e76ac55c77c46af7b2b9348e8cf67b59/promql/promqltest/testdata/delta.test

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this design doc.

I realize that my comments are a bit all over the place, and often they discuss things that are already taken into account in this document, just maybe not in the order or emphasis I would prefer.

An attempt to summarize my thoughts and concerns:

I wholeheartedly agree with "Milestone 1". However, I do think we should let the experiences made from it inform our further steps. The design doc explains most of the possible approaches quite well, but it essentially already proposes a preferred solution along the following lines:

  1. Introduce a temporality label.
  2. Make rate/increase/irate behave differently depending on that label.
  3. Embrace an extrapolation approach in that rate/increase calculation.

I have concerns about each of these points. I wouldn't go as far to say that they are prohibitive, but I would have trouble approving a design doc that frames them as the preferred way to go forward, while the alternatives that I find more likely to be viable are already "demoted" to "alternatives that we have dismissed for now".

My concerns summarized:

  1. I would only introduce a temporality label once we have established we need one. I would go for "treat deltas as gauges" until we hit a wall where we clearly see that this is not enough. In the form of the outcome of recording rules, Prometheus had "delta samples" from the beginning, and never did we consider marking them as such.
  2. I have a very bad feeling about "overloading" and have certain functions behave fundamentally different depending on the type of the argument (and that even more so as we are just establishing this strong typing of metrics as we go). (We kind-of do that already for some histogram functions, but there the difference in type is firmly established, plus it's not really fundamentally different what we are doing, we are doing "the same" on different representations of histograms (classic vs. native), plus we will just get rid of the "classic" part eventually.) Additionally, I don't think it makes sense to claim that we are calculating an "increase" based on something that is already an increase (a delta sample). The "rate'ing" is then just the normalization step, which is just one part of the "actual" rate calculation. Even though it might be called that way in other metrics system, I don't think that should inform Prometheus naming. I do understand the migration point, but I see it more as a "lure" into something that looks convenient at first glance but has the potential of causing trouble precisely because it is implicit (or "automagic"). What might convince me would be a handling of ranges that contain "mixed samples" (both cumulative and delta samples) because that would actually allow a seamless migration, but that would open up a whole different can of worms.
  3. Extrapolation caused a whole lot of confusion and controversy for the existing rate conversion. I believe that it was necessary, but I see a different trade-off for delta samples. Given that we have active work on non-extrapolation (anchored in the PoC) and "a different kind of interpolation" (smoothed in the PoC) for rate calculation, we should hold back introducing a questionable extrapolation mechanism in delta handling. With the tracking of CT (aka StartTimeUnixNano), we are well set up to do something like smoothed for deltas (which is something to be fleshed out maybe here or maybe in a separate design doc), and in many cases, the naive non-extrapolation approach might just be the best for deltas. (An "aligned" rule evaluation feature might be easier to implement and more helpful for the use case of aligned delta samples.)1

To summarize the summary: I would pivot this design doc more as a list of alternatives we have to explore, and only state the first step as "already decided", namely to ingest the delta samples "as is", which puts us into a position to explore the alternatives in practice.

Footnotes

  1. if you feel that aligned rule evaluation and "smoothed" increase calculation from deltas should be included in this doc, I'm willing to flesh them out in more detail.


For the initial implementation, reuse existing chunk encodings.

Currently the counter reset behaviour for cumulative native histograms is to cut a new chunk if a counter reset is detected. If a value in a bucket drops, that counts as a counter reset. As delta samples don’t build on top of each other, there could be many false counter resets detected and cause unnecessary chunks to be cut. Therefore a new counter reset hint/header is required, to indicate the cumulative counter reset behaviour for chunk cutting should not apply.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is more to that than just creating a new counter reset hint. Counter histogram chunks have the invariant that no (bucket or total) count ever goes down baked into their implementation (e.g. to store numbers more efficiently).

The histogram chunks storing delta histogram samples should use the current gauge histogram chunks. Whether we really need a different counter reset hint then (rather than just using the existing "gauge histogram" hint) is a more subtle question. (I still tend to just view deltas as gauges, but if we want to mark them differently, the counter reset hint could be one way. However, simple float samples do not have that way, so we need some other way to mark a sample as "delta" anyway. If we use the same way for histogram samples, then we can just keep using the "gauge histogram" counter reset hint combined with that new way to mark delta samples.)


No scraped metrics should have delta temporality as there is no additional benefit over cumulative in this case. To produce delta samples from scrapes, the application being scraped has to keep track of when a scrape is done and resetting the counter. If the scraped value fails to be written to storage, the application will not know about it and therefore cannot correctly calculate the delta for the next scrape.

Delta metrics will be filtered out from metrics being federated. If the current value of the delta series is exposed directly, data can be incorrectly collected if the ingestion interval is not the same as the scrape interval for the federate endpoint. The alternative is to convert the delta metric to a cumulative one, which has issues detailed above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As delta temporality is essentially the same as the outcome of a rate(...) recording rule (provided the delta metric does not wildly change its collection interval), I wouldn't rule out federation completely. It is very common to federate the outcome of a rate(...) recording rule, so why not federate delta metrics in the same way?
If the delta metric has e.g. a constant collection interval of 1m, and we do a federation scrape at least as often (or better more often, like 15s), we can still work with the resulting federated metrics. Prerequisite is essentially a (mostly) constant and known collection interval.
In contrast, a delta metric that has samples at irregular intervals (most extreme: classic statsd approach with deltas of one whenever an event happens) would not work via federation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the design doc to soften the approach - instead of explicitly blocking deltas from being federated, in documentation we can provide example configs to avoid scraping deltas if the user decides their deltas are not appropriate for scraping.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the root of evil here is that we are essentially reconstructing the "problem" of the current rate calculation, which is that we do not take into account samples from outside the range (but still extrapolates the calculated rate to match the range). I have made arguments why this is actually a good thing in the case of the classic rate calculation, and those arguments partially carry over to the delta case. But not entirely. If we had the accurate interval data, we could reason about how far outside of the range the seen increments are. We could weigh them (but then we should probably also take into account the delta sample "from the future", i.e. after the "right" end of the range), or we could accept if the interval is small enough.
Given that we do not want to take into account the collection interval in this first iteration, we could argue that a delta sample usually implies that the increments it represents are "recent", so we could actually take into account all delta samples in the range. This would ensure "complete coverage" if we graph something with 1m spacing of points and a [1m] range for the rate calculation. That's essentially what "xrate" does for classic rate calculation, but with the tweak that it is unlikely to include increments from the "distant past" because delta samples are supposed to represent "recent" increments. (If you miss a few scrapes with a cumulative counter, you don't miss increments, but now the increment you see is over a multiple of the usual scrape interval, which an "xrate" like approach will happily count as still within the given range.)
From a high level perspective, I'm a bit concerned that we are throwing away one of the advantages that delta temporality has if we ignore the first sample in the range.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another perspective on this subject (partially discussed with @fionaliao in person):

One reason to do all the extrapolation and estimation magic with the current rate calculation is that the Prometheus collection model deliberately gives you "unaligned" sampling, i.e. targets with the same scrape interval are still scraped at different phases (not all at the full minute, but hashed over the minute). PromQL has to deal with this in a suitable manner.

While delta samples may be unaligned as well, the usual use case is to collect increments over the collection interval (let's say again 1m), and then send out the collected increments at the full minute. So all samples are at the full minute. If we now run a query like rate(delta_request_counter[5m]), and we run this query at an aligned "full minute" timestamp, we get the perfect match: All the delta samples in the range perfectly cover the 5m range. The sample at the very left end of the range is excluded (thanks to the new "left open" behavior in Prometheus v3). So it would be a clear loss in this case to exclude the earliest sample in the range. (The caveat here is that you do not have to run the query at the full minute. In fact, if you create recording rules in Prometheus, the evaluation time is again deliberately hashed around the rule evaluation interval to avoid the "thundering herd". The latter could be avoided, though, if we accept delayed rule evaluation, i.e. evaluate in a staggered fashion, but use a timestamp "in the past" that matches the full minute.)

There is a use case where delta samples are not aligned at all, and that's the classic statsd use case where you sent increments of one immediately upon each counter increment. However, in this case, the collection interval is effectively zero, and we should definitely not remove the earliest sample from the range.


#### rate() calculation

In general: `sum of second to last sample values / (last sample ts - first sample ts)) * range`. We skip the value of the first sample as we do not know its interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum of second to last sample values / (last sample ts - first sample ts)) * range

Technical note: This formula calculates the extrapolated increase. You have to leave out the * range to get the extrapolated rate:

sum of second to last sample values / (last sample ts - first sample ts))


CT-per-sample is not a blocker for deltas - before this is ready, `StartTimeUnixNano` will just be ignored.

Having CT-per-sample can improve the `rate()` calculation - the ingestion interval for each sample will be directly available, rather than having to guess the interval based on gaps. It also means a single sample in the range can result in a result from `rate()` as the range will effectively have an additional point at `StartTimeUnixNano`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar effect could be created from a separate and explicit tracking of counter resets (rather than relying on the detection via "value has gone down"). If we were able to mark every sample as having a counter reset, rate'ing a delta counter would implicitly give the "correct" result as described in this paragraph.

Or in other words: CT gives us tracking counter resets explicitly as a byproduct. And maybe it should just become the way. (NH can track counter resets explicitly, but need a new chunk for that. It would not be efficient if it happened on every sample. Counter resets could be tracked in metadata, but again, it would be expensive to track frequent counter resets that way.)

(This is more an academic comment to improve our collective understanding, not necessarily something to include in the design doc. Maybe just mention that CT allows precise counter-reset tracking so that the reader is made aware that those topics are related.)

#### Treat as gauge
To avoid introducing a new type, deltas could be represented as gauges instead and the start time ignored.

This could be confusing as gauges are usually used for sampled data (for example, in OTEL: "Gauges do not provide an aggregation semantic, instead “last sample value” is used when performing operations like temporal alignment or adjusting resolution.”) rather than data that should be summed/rated over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to note that gauges in Prometheus are in fact the metric type that is aggregatable. "First the rate, then aggregate!" Rate'ing a counter creates a gauge. The gauge is then what you can aggregate. Delta samples are already aggregatable. They are, for all Prometheus intents and purposes, gauges.

If we end up with a new metric type "delta-counter" that is treated in exactly the same way as gauges, then we are arguably creating a greater confusion than having a gauge in Prometheus that has a slightly different semantics from gauges in other metrics systems.

In other words, I think it is a good idea that each (Prometheus) metric type is actually handled differently within Prometheus. A type should not just be "informational".

Maybe there are cases where we want to treat "real" gauges differently from deltas, but that has to be seen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this point about the "informational" type to the proposal (note the __temporality__ option also has been moved to a future extension rather than something to do now). On a similar note, I added a downside to function overloading as well - functions working inconsistently depending on type and temporality. Both increase() and sum_over_time() could be used for aggregating deltas over time. However, sum_over_time() would not work for cumulative metrics, and increase() would not work for gauges. This also gets even more complicated if using recording rules on deltas.


This also does not work for samples missing StartTimeUnixNano.

#### Convert to rate on ingest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note as the person who came up with this idea: I have come to the conclusion that this approach has a bad trade-off. Being able to store as much as possible of the original sample (value, and ideally the CT AKA StartTimeUnixNano) and then process that on query time is better than doing some calculation on ingest time and lose the original data.


`sum_over_time()` between T0 and T5 will get 10. Divided by 5 for the rate results in 2.

However, if you only query between T4 and T5, the rate would be 10/1 = 1 , and queries between earlier times (T0-T1, T1-T2 etc.) will have a rate of zero. These results may be misleading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will the result be so much different with the rate approach described above? In fact, we won't get any rate with that because there is only one sample is the range.

I do think there is a way to estimate rates/increases if the samples do not align with the range boundaries and we have the StartTimeUnixNano AKA CT. Then we could do some weighing according to the proportion the increase is expected to happen inside the range (including for this particular example where the range is just a fraction of the collection interval, and we could say the collection interval is 5x the range, so we only take into account 1/5th of the increase). But this approach isn't described anywhere in this design doc (is it?). It would be similar to the upcoming "smoothed" rate modeling (aka "mrate" in my rate braindump). It would also be the key to a "proper" integral function, see prometheus/prometheus#14440 – to connect all the dots... ;)

@fionaliao
Copy link
Contributor Author

@beorn7 Thanks for your comments :) I need more time to go through all of them, but as a start:

To summarize the summary: I would pivot this design doc more as a list of alternatives we have to explore, and only state the first step as "already decided", namely to ingest the delta samples "as is", which puts us into a position to explore the alternatives in practice.

It makes sense to separate the first step. Not a strong opinion, but I was thinking of making a separate proposal PR for the first step, and then put this proposal back into draft to indicate it's still being figured out - that way we can have a merged PR for the first step, and there can be discussion on the possible future steps within this open PR. WDYT?


I would only introduce a temporality label once we have established we need one. I would go for "treat deltas as gauges" until we hit a wall where we clearly see that this is not enough.

I agree with doing this delta as gauge approach first. I do think eventually we will want to treat deltas separately from gauges, but we should get more user feedback to confirm this is the case.

Chronosphere have already gained insights into on this, as they've implemented their own version of delta support and @enisoc wrote up this document and noted: "Users don't like having to think about the temporality of their metrics and learn to use different functions (e.g. increase vs. sum_over_time). They want to have one recommended way to query counters that just works regardless of temporality.".

One problem with treating deltas as gauges is that gauge means different things in Prometheus and OTEL - in Prometheus, it's just a value that can go up and down, while in OTEL it's the "last-sampled event for a given time window". While it technically makes sense to represent an OTEL delta counter as a Prometheus gauge, this could be a point of confusion for OTEL users who see their counter being mapped to a Prometheus gauge, rather than a Prometheus counter. There could also be uncertainty for the user on whether the metric was accidentally instrumented as a gauge or whether it was converted from a delta counter to a gauge.

Another problem is that the temporality of a metric might not be completely under the control of the user instrumenting the metric - it could change in the metric ingestion pipeline (e.g. with the cumulativetodelta or deltatocumulative processors), so it can be hard to determine at query time what function to use. If we properly mark deltas as gauges - i.e. with the metric type gauge - and have warnings when using rate() on Prometheus gauges and sum_over_time() on Prometheus counters, this is alleviated. (However, alerts don't integrate with warnings so may end up being incorrect without detection).

We kind-of do that already for some histogram functions, but there the difference in type is firmly established

How is type being firmly established in the native histogram case vs not being firmly established in the delta and cumulative case if there's a "temporality" label?

Additionally, I don't think it makes sense to claim that we are calculating an "increase" based on something that is already an increase (a delta sample).

increase() could be considered the increase in the underlying thing being measured, which makes sense for applying increase() on a delta metric.

Also, deltas could also be seen as cumulative with resets between each sample. (On the other hand, as discussed, there are different characteristics of delta metrics so while they could be seen as cumulative or converted to cumulative that might not be the best representation)

I do understand the migration point, but I see it more as a "lure" into something that looks convenient at first glance but has the potential of causing trouble precisely because it is implicit (or "automagic"). What might convince me would be a handling of ranges that contain "mixed samples" (both cumulative and delta samples) because that would actually allow a seamless migration, but that would open up a whole different can of worms.

As well as one-off migrations where you might just have to update queries once, a case which might cause more constant frustration is when there is a mix of sources with different temporalities. So a single series might have the same temporality over time, but different series have different temporalities. If you want a single query to combine the results and we didn't do function overloading, you'd need something like rate(cumulative metrics only) + sum_over_time(delta metrics only). (Is this what you were referring to when you said mixed samples, or did you just mean the case where a single series had different temporality over time?)

@beorn7
Copy link
Member

beorn7 commented Apr 29, 2025

I was thinking of making a separate proposal PR for the first step, and then put this proposal back into draft to indicate it's still being figured out

I don't think it would help with clarity to have multiple design docs. Personally, I don't believe a design doc has to make a call all the way through. I would think it's fine if a design doc says "We want to do this first step, and then we want to do research and want to make a call between options X, Y, and Z based on these following rationales."

About the general argument about "overloading" increase and rate for delta temporality: I think the arguments are already well made in the design doc. I'm just not sure we can make a call right now without practical experience. We can repeat and refine both sides of the argument, but I think it will be much easier and much more convincing once we have played with it.

How is type being firmly established in the native histogram case vs not being firmly established in the delta and cumulative case if there's a "temporality" label?

First of all, that label does not exist yet. So currently, it is not established at all. Secondly, a histogram sample looks completely different from a float sample in the low-level TSDB data. There is no way the code can confuse one for the other. But a label is just a label. It could accidentally get removed, or added (or maybe even on purpose, "hard-casting" the metric type, if you want), so a relatively lightweight thing like a label will change how a function processes something that is just a float in the TSDB in either case.

Is this what you were referring to when you said mixed samples, or did you just mean the case where a single series had different temporality over time?

I was thinking mostly about one and the same series that changes from cumulative to delta over time. (Your point about mixed vectors is also valid, but that would be solved by the proposed "overloaded" functions just fine.)

@fionaliao
Copy link
Contributor Author

@beorn7 I'll update this doc as you suggested (with the first step + laying out the options for future steps without committing to any), and incorporate you and @enisoc's comments

@beorn7
Copy link
Member

beorn7 commented Apr 30, 2025

Thank you. Feel free to express a preference (like putting the "most likely" outcome first). As said, I just would have a problem making the call at this time.

@fionaliao
Copy link
Contributor Author

fionaliao commented May 28, 2025

As an update - I am still working on updating this proposal, but progress has been slow due to other work priorities

@fionaliao
Copy link
Contributor Author

@beorn7 Would you be open to having deltas ingested as gauges by default, with an option to ingest as counters with a __temporality__="delta" label? With documentation making it clear that all of this is experimental and could be removed. This won't include implementing any function overloading, it just adds the label so users can distinguish between delta counters and gauges if they want to.

I think to explore if it's worth persuing the __temporality__ label, we should to offer it as an option and see how users interact with it.

@beorn7
Copy link
Member

beorn7 commented Jun 3, 2025

Sounds good to me.

@ArthurSens
Copy link
Member

@beorn7 Would you be open to having deltas ingested as gauges by default, with an option to ingest as counters with a __temporality__="delta" label? With documentation making it clear that all of this is experimental and could be removed. This won't include implementing any function overloading, it just adds the label so users can distinguish between delta counters and gauges if they want to.

I think to explore if it's worth persuing the __temporality__ label, we should to offer it as an option and see how users interact with it.

Just to clarify, how would this optionality be granted to users? Is it yet a new feature flag? A config option?

@fionaliao
Copy link
Contributor Author

@ArthurSens I was thinking of just have a single feature flag for all delta ingestion options. Currently we already have two flags for delta ingestion: --enable-feature=otlp-deltatocumulative and --enable-feature=otlp-native-delta-ingestion. Instead we could have a single --enable-feature=otlp-delta-ingestion flag. And then have a config option to a) set type as gauge (default), b) add temporality label, and c) convert to cumulative. This was we don't have to check that multiple delta ingestion features are enabled (like we currently have to do).

@ArthurSens
Copy link
Member

Interesting, my idea was to eventually sunset the delta to cumulative conversion 🤔

I'm a bit concerned with adding new fields to config file because removing them would be a breaking change. Removing feature flags isn't as hard as removing config options

@fionaliao
Copy link
Contributor Author

fionaliao commented Jun 6, 2025

@ArthurSens In that case, maybe having two feature flags is better then since the cumulative conversion is going away - just two feature flags for deltas ingestion is managable. I think we can use the --enable-feature=otlp-native-delta-ingestion as the one with the __temporality__ label (since this is introducing a specific "type" for deltas), and then a --enable-feature=otlp-delta-as-gauge-ingestion for setting the type as gauge

And then in the documentation highlight that the gauge option is relatively more stable, since it's a pre-exisiting type and has been used for delta-like use cases in Prometheus already, while the temporality label option is very experimental and dependent on other experimental features.

Comment on lines 119 to 120
2. `--enable-feature=otlp-native-delta-ingestion`: Ingests OTLP deltas as a new delta "type", using a new `__temporality__` label to explicitly mark metrics as delta.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider a more genetic label to keep track of the original OTEL type (ex: __otel_type__) - and so we can this information to make other decisions in the future - similar to what we are proposing on the rate function here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like this idea a lot- for the deltas as gauge option, this would allow us to keep the original OTEL type information, but lets us treat deltas as a special case for OTEL compatibility rather than modifying the core Prometheus model. So I've actually updated the proposal to do this plus some extensions - the idea now is to keep the prometheus type as gauge (as it fits the Prometheus gauge definition) and add __otel_type__="sum" and __otel_temporality__="delta"

fionaliao added a commit to grafana/mimir that referenced this pull request Jul 7, 2025
Introduces the `-distributor.otel-native-delta-ingestion` flag
(and corresponding per-tenant setting), which enables primitive OTEL
delta metrics ingestion via the OTLP endpoint. This feature was
implemented in Prometheus in
prometheus/prometheus#16360. This PR allows
Mimir users to enable this feature too.

As per the Prometheus PR:

> This allows otlp metrics with delta temporality to be ingested and
stored as-is, with metric type unknown. To get "increase" or "rate",
`sum_over_time(metric[<interval>])` (`/ <interval>`) can be used.

> This is the first step towards implementing
prometheus/proposals#48. That proposal has
additional suggestions around type-aware functions and making the rate()
and increase() functions work for deltas too. However, there are some
questions around the best way to do querying over deltas, so having this
simple implementation without changing any PromQL functions allow us to
get some form of delta ingestion out there gather some feedback to
decide the best way to go further.

---------

Co-authored-by: Taylor C <[email protected]>
@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from 3de5158 to 055e315 Compare July 18, 2025 21:28
@fionaliao fionaliao force-pushed the fionaliao/delta-proposal branch from 78ca5b8 to cd7a703 Compare July 24, 2025 17:35

Therefore it is important to maintain information about the OTEL metric properties. Alongside setting the type to `gauge` / `gaugehistogram`, the original OTEL metric properties will also be added as labels:

* `__otel_type__="sum"` - this will allow the metric to be converted back to an OTEL sum (with delta temporality) rather than a gauge if the metrics as exported back to OTEL.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need the __otel_type__ label since you can infer a metric is an OTEL sum if it has the __temporality__ label (OTEL gauges don't have temporality).

Metrics ingested from other sources with these labels could then be accidentally converted to otel sums. That might be acceptable though, these labels would probably be added to indicate deltas

Copy link
Member

@bwplotka bwplotka Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds like we don't need this, so I would skip this step. Also we don't need it, if we use counters from day 1.


#### Function warnings

To help users use the correct functions, warnings will be added if the metric type/temporality does not match the types that should be used with the function. The warnings will be based on the `__type__` label, so will only work if `--enable-feature=type-and-unit-labels` is enabled. Because of this, while not necessary, it will be recommended to enabled the type and unit labels feature alongside the delta support feature.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the proposal suggests using the __type__ label for warnings, but should we also add a specific __temporality__="delta" check for rate()? So they can still get warnings if they've turned on ingesting deltas but they don't have type and unit metadata turned on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good thing to add :)

@fionaliao
Copy link
Contributor Author

fionaliao commented Jul 24, 2025

Updated to remove the __otel_ prefix when unnecessary post dev summit consensus: https://docs.google.com/document/d/1uurQCi5iVufhYHGlBZ8mJMK_freDFKPG0iYBQqJ9fvA/edit?tab=t.0#heading=h.xp8na12byu2i

Consensus: There are no objections to the proposal to ingest delta metrics directly with additional labels (sans the otel_ prefix) into the tsdb as gauges as an experimental feature.


* `__otel_type__="sum"` - this will allow the metric to be converted back to an OTEL sum (with delta temporality) rather than a gauge if the metrics as exported back to OTEL.
* `__temporality__="delta"`
* `__monotonicity__="true"/"false"` - as mentioned in [Monotonicity](#monotonicity), it is important to be able to ingest non-monotonic counters. Therefore this label is added to be able to distinguish between monotonic and non-monotonic cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean __monotonicity__="monotonic/non-monotonic"? Or maybe __monotonic__="true/false"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like __monotonicity__="monotonic/non-monotonic" (having descriptive label values is nice)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think __monotonic__ or __is_monotonic__ would be ok choices.

Comment on lines +181 to +185
### Federation

Federating delta series directly could be usable if there is a constant and known collection interval for the delta series, and the metrics are scraped at least as often as the collection interval. However, this is not the case for all deltas and the scrape interval cannot be enforced.

Therefore we will add a warning to the delta documentation explaining the issue with federating delta metrics, and provide a scrape config for ignoring deltas if the delta labels are set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of giving a warning, could we update the federation endpoint to just not expose metrics with __temporality__="delta" and/or __monotonicity__="non-monotonic"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to get useful information from scraping delta metrics, as per #48 (comment). So I think it's worth giving users the choice of what to do


#### Function warnings

To help users use the correct functions, warnings will be added if the metric type/temporality does not match the types that should be used with the function. The warnings will be based on the `__type__` label, so will only work if `--enable-feature=type-and-unit-labels` is enabled. Because of this, while not necessary, it will be recommended to enabled the type and unit labels feature alongside the delta support feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good thing to add :)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for amazing work on this one!!!

Can I challenge the need of delta as gauge idea? Given new changes, I'd argue we should use counter +temporality from day 1

Feel free to dismiss this, but wanted to put some findings here! 🤗

Context

The Initial version of this proposal assumed we DON'T want __temporality__ label, but instead we will have gauge for now and specific delta type later.

However @aknuds1 pinged me on the PR prometheus/prometheus#16971 that is adding it, so I dived into the newest state of this proposal.

So the current plan is to:

  • Add temporality label to gauges that represents delta
  • NOT add a special type in future, but instead ingest them as counters one day.

I would argue that IF we decided to switch direction and share existing types and add another temporality dimension, it might be more beneficial to put that dimension on counters and ingest deltas as counters (with temporality=delta) from day 1.

Rationales

  1. We wanted to experiment and gather data. So why not starting with counters? We plan to already "confuse/break" people by asking them to ingest delta-gauges and name them as counters, use counter based functions (increase etc).. but we are afraid to add some unused (see why unused below) metadata to type to counter here?
  2. I would challenge the fact we break people given we break the "counter definition" by putting delta as counters (do you have at least one practical example)? Our metadata-wal-records is unusable for long term storages (overhead, not reliable), and none of internal Prometheus code do anything different between gauge vs counter. type and unit will improve this, but it's too new feature to be already adopted. So why not just do this, given experimentation phase? There might be some tools that maybe do something different for _total names, but from your examples I see you planned to add delta as gauges with counter-like naming (!), so we already break this? 🙈
  3. Using counter with new dimension is more aligned with OpenTelemetry, no?
  4. Using counter with new dimension feels aligned with histogram vs gaugehistogram story no? gauge histogram is NOT delta sum histogram
    1. Switching Otel Prometheus users from delta as gauge to delta as counters will be MUCH more painful later once type-and-unit is adopted AND we get users to use gauge now.
  5. Maybe I get this wrong, but we planned to add delta as gauge. Then we noticed some PromQL functions would work better if we know if it's a normal gauge vs delta gauge, so we planned to add temporality labels to gauges... Wouldn't it be 100x easier if we start PromQL handling with counter + temporality from day 1 here vs trying to remove code later on to handle special deltas with temporality?

To sum up

I believe we have two options:

  1. We follow Otel model with counter/histogram + temporality. For this I'd argue using gauges now is more damaging than using counters on day 1.
  2. We create new model without new label dimension - with gauge, counter, delta (and gaugehist, hist, deltahist). If we follow this using gauges might make sense tmp.

Rightly so, seems that this proposal proposes (1) - in this case I'd use counters from day 1.


**Cons**
* Introduces additional complexity to the Prometheus data model.
* Confusing overlap between gauge and delta counters. Essentially deltas already exist in Prometheus as gauges, and deltas can be viewed as a subset of gauges under the Prometheus definition. The same `sum_over_time()` would be used for aggregating these pre-existing deltas-as-gauges and delta counters, creating confusion on why there are two different "types".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially deltas already exist in Prometheus as gauges, and deltas can be viewed as a subset of gauges under the Prometheus definition.

Is it really true? What exactly is that subset? That gauge can go down and counter not? Counter can do down meaning reset though. Delta can go down because it's a diff. Gauge is for a current value, Delta is for diff. I see equal confusion when used on either type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the definition of "A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.". Though that technically fits for any kind of data, even cumulative counters 😅. What I'm trying to say is that delta data is already being represented as gauges in Prometheus:

  1. Some outputs of recording rules could be considered deltas:

    While gauges ingested into Prometheus via scraping represent sampled values rather than a count within a interval, there are other sources that can ingest "delta" gauges. For example, increase() outputs the delta count of a series over a specified interval. While the output type is not explicitly defined, it's considered a gauge. A common optimisation is to use recording rules with increase() to generate “delta” samples at regular intervals. When calculating the increase over a longer period of time, instead of loading large volumes of raw cumulative counter data, the stored deltas can be summed over time.

    Regarding considering the output of increase() a gauge, it's not explicitly stated anywhere, but in the metadata labels proposal, is says

    we might want to make certain functions return a useful type and unit e.g. a rate over a counter is technically a gauge metric.

    I assume we'd apply the same type to the output of increaase() too.

  2. sum_over_time() is available in Prometheus. This is the best-suited function for deltas at the moment, but it's already present, which means it's probably being used for summing gauges already. If you're summing gauges, it probably means the data you have are deltas and you want to aggregate the diff over a larger period of time.

  3. We have to set the counter reset header for delta histograms as GaugeType because the data doesn't fit the assumptions of the counter histograms (i.e. multiple samples before a reset).

Additionally, non-monotonic cumulative sums in OTEL are already ingested as Prometheus gauges, meaning there is precedent for counter-like OTEL metrics being converted to Prometheus gauge types.

Gauge is for a current value, Delta is for diff

I think Prometheus has two meanings for gauge which can lead to some confusion of what a gauge means:

  1. Gauge on the instrumentation side e.g. in the Go client. This has to represent the current value because Prometheus instruments are scraped.
  2. Gauge in the Prometheus data model definition, which is just a value that can go up or down, with no restrictions on whether it represents just the current value or a value collected over time.

I know there is a difference between it technically makes sense to mark deltas as gauges and whether it's a good idea to do so. But I'd say in the current state of Prometheus, if we had to map deltas to an existing type with minimal changes, gauge is the best type.

Copy link
Member

@bwplotka bwplotka Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, increase() outputs the delta count of a series over a specified interval.

That's fair observation. We do have some gauge as delta precedence, however with no temporality label. Would we add _temporality=delta to some function outputs with the delta-gauge idea, and then type=counter temporality=delta one day to those?

Delta histograms will use native histogram chunks with the GaugeType counter reset hint/header. The counter reset behaviour for cumulative native histograms is to cut a new chunk if a counter reset is detected. A (bucket or total) count drop is detected as a counter reset. As delta samples don’t build on top of each other, there could be many false counter resets detected and cause unnecessary chunks to be cut. Additionally, counter histogram chunks have the invariant that no count ever goes down baked into their implementation. GaugeType allows counts to go up and down, and does not cut new chunks on counter resets.

Interesting, so this means histogram gauges has some special handling that helps with temporality="delta" and/or is_monotonic"true" cases, as a side effect. This is opposed to current gauge which is really semantics on how this metric should be understood, more then practical storage/PromQL engine influencer.

What's the long term game here? To represent delta histograms with as Counter histograms with type="counter", temporality="delta", is_monotonic"true|false" labels, right?

I wonder how hard would be to implement it now. In fact this might literally remove the need of gaugeType histogram in our storage, as we really can care about sampleTypes like float (with CT), histogram (maybe floathistogram vs inthistogram) and nothing else, and adjust behaviour of how we cut chunks, encode, based on those labels.

Additionally, non-monotonic cumulative sums in OTEL are already ingested as Prometheus gauges, meaning there is precedent for counter-like OTEL metrics being converted to Prometheus gauge types.

Sure, but no label or special care later on for those gauges is added, so it's "fine" (wild-west, best effort, no special handling for this anywhere, great starting point).

I think Prometheus has two meanings for gauge which can lead to some confusion of what a gauge means:

  1. Gauge on the instrumentation side e.g. in the Go client. This has to represent the current value because Prometheus instruments are scraped.
  2. Gauge in the Prometheus data model definition, which is just a value that can go up or down, with no restrictions on whether it represents just the current value or a value collected over time.

True, it's open for interpretations. I would add OM definition which is closer to (1) and Otel definition which is literally (1) despite Otel is mainly push based. (so I would challenge because Prometheus instruments are scraped. statement.)

(2) literally mentions __type__=counter and __is_monotonic__=false case as a valid use of Gauge (at the current moment), because of no other options:

Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

But we are here designing those options, so it's up to us now how we shape this definition it.

And essentially the high level questions is for long term, do we want:

a) We go deeper in to (2) and have Prometheus gauge be literally a __is_monotonic__=false setting, so the type=gauge is only meaning __is_monotonic__=false and type=counter __is_monotonic__=true. This is the current situation. Then temporality label could be added to both types and it controls if CT/start time is present or not.
b) We followOtel model of sum and add __temporality__="cumulative|delta", __is_monotonic__"true|false" ONLY to counters and we follow (1) definition. So the gauge vs counter different is more of current value or counter over time semantic differentiation for humans and there is no cumulation or delta notion for gauges.

Sounds like we want to go into direction of (b) but we also go deeper into (a) as a temporary measure, is my understanding correct?

I know there is a difference between it technically makes sense to mark deltas as gauges and whether it's a good idea to do so. But I'd say in the current state of Prometheus, if we had to map deltas to an existing type with minimal changes, gauge is the best type.

I agree and I thought that's the initial plan. But we are discussing here more changes to gauge type handling (Add otel metric properties as informational labels, Federation, Remote write ingestion sections) and this is what worries me. So if those changes are required and the risk involved with people starting depending on it and we maintain some of the code... is it still best type? (:

* Pre-existing deltas-as-gauges could be converted to counters with `__temporality__="delta"`, to have one consistent "type" which should be summed over time.
* Systems or scripts that handle Prometheus metrics may be unaware of the new `__temporality__` label and could incorrectly treat all counter-like metrics as cumulative, resulting in hard-to-notice calculation errors.

We decided not to go for this approach for the initial version as it is more invasive to the Prometheus model - it changes the definition of a Prometheus counter, especially if we allow non-monotonic deltas to be ingested as counters. This would mean we won't always be able convert Prometheus delta counters to Prometheus cumulative counters, as Prometheus cumulative counters have to monotonic (as drops are detected as resets). There's no actual use case yet for being able to convert delta counters to cumulative ones but it makes the Prometheus data model more complex (counters sometimes need to be monotonic, but not always). An alternative is allow cumulative counters to be non-monotonic (by setting a `__monontonicity__="false"` label) and adding warnings if a non-monotonic cumulative counter is used with `rate()`, but that makes `rate()` usages more confusing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided not to go for this approach for the initial version as it is more invasive to the Prometheus model - it changes the definition of a Prometheus counter, especially if we allow non-monotonic deltas to be ingested as counters

Again, I disagree. I would assume that a type=counter that:

  • does not have _total suffix (arguable even?)
  • have temporality=delta

.. is simply another type - a delta.

Ofc I would NOT put temporality=cumulative to all counters now, but assume that's a default

@fionaliao
Copy link
Contributor Author

fionaliao commented Aug 22, 2025

@bwplotka Thanks for the detailed review and it's definitely worth challenge whether to map to gauges or not

The main argument for treating as gauge is that Prometheus basically uses gauges to represent deltas already (#48 (comment)) and it works, so let's see if that's enough before potentially advancing to more complex implementations.

This does require the user to distinguish delta gauges from other gauges (e.g. scraped) to use the appropriate functions though, but similar would need to happen if we mapped them to counters and had to distinguish between cumulative and delta counters.

I agree it would be more disruptive to users depending on deltas-as-gauge if we do make the change to counter type later though. However, what's the extent of the disruption?

  • If there's any logic like if gauge && temporality == delta then ..., sure this would break
  • If we don't modify rate() and get them to use sum_over_time() to query (see next section), changing the type won't affect that query, sum_over_time() would still work.

We plan to already "confuse/break" people by asking them to ingest delta-gauges and name them as counters

What do you mean by name as counters? do you mean retaining the __otel_type__ as a label?

use counter based functions (increase etc)..

Using counter-based functions (e.g. rate()) for delta is a potential future extension, though there are questions on how to implement this properly, especially with the new smoothed and anchored modifiers. I'd rather wait for those to become more established before doing any changes wrt deltas. Also sum_over_time() is an exisiting function that would work for deltas, so do we need to modify the rate() function as well?

I agree if we get to the point of implementing delta-based rate(), it would be more painful to switch users from typing as gauge to typing as counter. Right now, if they're using sum_over_time() for querying deltas, that should always work no matter the type. If we decide to first support counter-based functions for deltas-as-gauges (aka function overloading) then switch the type to counter and then remove counter-based functions for deltas-as-gauges support then that will be messy.

Maybe the extension needs to be - we'll only support function overloading if we also change the delta type to counter.


I would challenge the fact we break people given we break the "counter definition" by putting delta as counters (do you have at least one practical example)?

I do have Grafana as an example of something that tries to use metric type information - to give hints about whether to use rate() or not, but it is a best effort approach: #48 (comment).

However, I don't have other examples, my concern is mostly hypothetical -Prometheus is widely used and types have been pretty stable so I wouldn't be surprised if there are systems that depend on that.


buthttps://github.com/prometheus/prometheus/pull/16360 I see you planned to add delta as gauges with counter-like naming (!), so we already break this? 🙈

That's not the case anymore - the PR was created before we decided to drop _total from the metric name in the proposal:

The `_total` suffix will not be added to OTEL deltas. The `_total` suffix is used to help users figure out whether a metric is a counter. The `__otel_type__` and `__temporality__` labels will be able to provide the distinction and the suffix is unnecessary.


Another thing that makes me wary of using the counter type is what do we do with non-monotonic sums? Do we just treat them as Prometheus gauges? But statsd counters are non-monotonic by definition - as statsd is so popular, we could end up with lots of deltas being converted into gauges which defeats the purpose of mapping to counters. If we say that counters can be non-monotonic, this is fine for the delta case.

But how do we handle non-monotonic cumulative sums in otel? Currently we map them to gauges. We can keep mapping them to gauges, but then we're still being inconsistent, we're saying that all otel sums can be mapped to counters except for non-monotonic cumulative sums. If we map them to Prometheus counters, we end up with an unrateable counter as we use decreases to indicate counter resets.

Assuming we get function overloading and rate() and increase() working for deltas, we're now saying all counters can use rate() except non-monotonic cumulative ones, so there's an inconsistency again.

Here is where CT-per-sample (or some kind of precise CT tracking) would be extremely useful, then we wouldn't need to use drops to figure out counter resets and could potentially support rating non-monotonic cumulative counters.


CT-per-sample is actually another reason why it might be better to map deltas to gauges for now. If we had CT-per-sample, we could map deltas to counters without really needing temporality information as per https://github.com/prometheus/proposals/blob/fionaliao/delta-proposal/proposals/0048-otel_delta_temporality_support.md#treat-as-mini-cumulative and we wouldn't need function overloading. If we mapped deltas to counters now, we'd then have to continue to support deltas-as-counter-without-CT-per-sample for backwards compatiblity.

@bwplotka
Copy link
Member

Thanks for considering! I will take a deeper look from Tuesday, perhaps worth chatting during the work group too.

But.. what if we have ct per sample feature literally next week or month? Would it be enough to convince us to use counters (without temporality label?). The mini cumulatives idea make sense to me. If yes then getting CT per sample (which is needed for other reasons too) might be easier together if we "swarm" this vs getting through delta as gauge period, no? (:

I don't yet understand the non-monotonic sum from statsd question. Do you mean statsd counter is literally like otel up and down counter or something? I would need to learn more about it here, but it sounds like a monotocity dimension feature separate to delta, no?

@fionaliao
Copy link
Contributor Author

But.. what if we have ct per sample feature literally next week or month? Would it be enough to convince us to use counters (without temporality label?). The mini cumulatives idea make sense to me. If yes then getting CT per sample (which is needed for other reasons too) might be easier together if we "swarm" this vs getting through delta as gauge period, no? (:

Yeah I think if CT-per-sample is available soon then deltas-as-gauge wouldn't be necessary. I would like to retain the temporality label even if we had CT-per-sample, but it would be mostly informational (or useful if we needed to translate back to otel).

However, we have most of the code for deltas-as-gauges ready though, so would it be okay to have that merged while working on CT-per-sample? The delta feature experimental and there are warnings around label changes.

I don't yet understand the non-monotonic sum from statsd question. Do you mean statsd counter is literally like otel up and down counter or something? I would need to learn more about it here, but it sounds like a monotocity dimension feature separate to delta, no?

Yes, from the statsd spec: "A counter is a gauge calculated at the server. Metrics sent by the client increment or decrement the value of the gauge rather than giving its current value."
Also see this discussion: open-telemetry/opentelemetry-collector-contrib#1789
(also counters are subsets of gauges in statsd lol)

It's a separate to temporality, but it's another property of a otel counter

@fionaliao
Copy link
Contributor Author

However, we have most of the code for deltas-as-gauges ready though, so would it be okay to have that merged while working on CT-per-sample? The delta feature experimental and there are warnings around label changes.

To expand on this - I think realistically, it'll take a while to have CT-per-sample implemented - for it to be useful for deltas, we'd need the storage part to be done, test performance, update rate() and increase() to take these into consideration, including updating the new range vector selectors (which are currently a WIP).

I think there are three options for what we should do before CT-per-sample is done:

  1. Set deltas with type gauge, add __temporality__ and __monotonicity__ labels
  2. Set deltas with no type, add __temporality__ and __monotonicity__ labels
  3. Do nothing - we already have primitive delta ingestion which adds no types and labels, so people aren't blocked from ingesting deltas

For all three options, only sum_over_time() can be used to aggregate deltas over time. rate() and increase() will not work. Even if we had CT-per-sample, sum_over_time() would still work for delta samples. So users would not have to update their queries once we can ingest deltas as counters with CT-per-samples. They might want to, so that there's a single consistent function for getting the rate or increase over cumulative and delta counters, but it's optional.

I like the first option because it can take advantage of any gauge type checking if __type__="gauge" is set, for example, warning if rate() or increase() is used rather than sum_over_time(). And it shouldn't break any gauge semantics (sum_over_time() is usable for gauges outside of otel deltas). Essentially it makes deltas more usable than the other options before CT-per-sample is available.

The two issues I see with migration from Option 1 to CT-per-sample (plus delta type set to counter):

  • if people are ingesting deltas as gauges without the __type__ label, while rate() would work for the new samples, we wouldn't flag up warnings if it's also used on the older gauge samples
  • suffix handling for counters and gauges are different (_total is appended to counters by default - but with type and unit metadata we want to stop doing that anyway)

We can write in the Prometheus documentation that CT-per-sample is our preferred the long time solution, and what will change when that is done.

For this proposal - I can move the "Treat as mini-cumulative" to the top of the potential future extensions and as the preferred future solution (over function overloading)?

@bwplotka
Copy link
Member

bwplotka commented Aug 27, 2025

Thanks for explaining, let's dive in a bit, I commented on subthreads, but maybe we should close it and have single thread.

See summary below for TL;DR

To answer your top-level comments

It's a separate to temporality, but it's another property of a otel counter

Yea, that means ~upAndDownCounter support then. I see Otel got to the same conclusion. Good question how that could be represented in Prometheus (if we ever want it). Have we thought about that?

It does feel like a counter with a monotonic dimension (set to "false") and temporality=delta, no? Gauges feels wrong for this if we follow (2nd plan mentioned in the comment), so

We followOtel model of sum and add __temporality__="cumulative|delta", __is_monotonic__"true|false" ONLY to counters and we follow (1) definition. So the gauge vs counter different is more of current value or counter over time semantic differentiation for humans and there is no cumulation or delta notion for gauges.

However, we have most of the code for deltas-as-gauges ready though, so would it be okay to have that merged while working on CT-per-sample? The delta feature experimental and there are warnings around label changes.

I'm bias towards "achievable perfection" 🙃 Meaning, if we are short step away from doing something we want long term, my preference is to skip temporary steps in short term directions.

This is because:

  • We are busy engineers and there's never time to fix tech debt (left overs, moving to long term option). This is not a biggie if we just ingest this as delta with new label. But I saw some ideas to introduce extra complex logic that assumes gauges can have temporality label (Add otel metric properties as informational labels, Federation, Remote write ingestion sections).
  • Hyrum Law -- it might be experimental feature but it will be recommended for all Otel users. Imagine 100% of those using deltas enabling this and using it in PromQL and external systems. Would they be happy with sudden changes to how new metric integrated with vendors when it goes through Prometheus or that they suddenly have to add _total suffix or some other PromQL changed behaviours, etc
  • We spent time on something short term, where we could join forced on long term one and achieve it... short term (:
  • Finally: We do this for feedback. Why don't we try using counter and see more the feedback for what it might look with counters? Wouldn't this be more useful?

To expand on this - I think realistically, it'll take a while to have CT-per-sample implemented - for it to be useful for deltas, we'd need the storage part to be done, test performance, update rate() and increase() to take these into consideration, including updating the prometheus/prometheus#16457.
I think there are three options for what we should do before CT-per-sample is done:

  1. Set deltas with type gauge, add temporality and monotonicity labels
  2. Set deltas with no type, add temporality and monotonicity labels
    Do nothing (3) - we already have primitive delta ingestion which adds no types and labels, so people aren't blocked from ingesting deltas

I think we have more options (:

  1. Set deltas with type counter, add __temporality__ and __monotonicity__ labels (or __is_monotonic__), without ct-per sample.
  2. Set deltas with no type/type gauge , add otel_converted_from="sum/delta/monotonic" or "sum/cumulative/non-monotonic" or "sum/delta/monotonic"

I like the first option because it can take advantage of any gauge type checking if __type__="gauge" is set, for example, warning if rate() or increase() is used rather than sum_over_time()

None of those warnings exists at the moment, plus we can easily extend them if we pursue experimental long term feature (__type__="counter", __temporality__="delta", no? 🙈

And it shouldn't break any gauge semantics (sum_over_time() is usable for gauges outside of otel deltas). Essentially it makes deltas more usable than the other options before CT-per-sample is available.

Again, what are the "breaking counter semantics" real consquences? For other soft consequences (human confusion), are we sure it's not a better idea to let them learn, or fix (downstream systems) by slowly adopting temporality label on a counter. We already ask users to make conscious decision to enable special metric, so let's make it special as we want it, and not something we will change in future ASAP.

The two issues I see with migration from Option 1 to CT-per-sample (plus delta type set to counter):

  • if people are ingesting deltas as gauges without the type label, while rate() would work for the new samples, we wouldn't flag up warnings if it's also used on the older gauge samples

Hm, I checked the recent PR for warnings, and I think it could be improved to handle unknown types better, which removes this issue (https://github.com/prometheus/prometheus/pull/16632/files#r2303429110).

  • suffix handling for counters and gauges are different (_total is appended to counters by default - but with type and unit metadata we want to stop doing that anyway)

Well, that is to me an argument to NOT use gauges for a clear counts in a delta form, this is rather a beneficial side effect if we use counters, no? (:

Summary

Lot's of discussion points and arguments. Thank you so much @fionaliao for considering my points. Let's try to summarize/distill what is the root problem here.

My main worry is around the desired gauge definition in Prometheus and how our temporary experiment can confuse downstream devs and users who don't have time to read this proposal or detailed type nuanses.

Gauge definition

I think we summarized it well together in #48 (comment)

essentially the high level questions is for long term, do we want:

a) We go deeper in to (2) and have Prometheus gauge be literally a __is_monotonic__=false setting, so the type=gauge is only meaning __is_monotonic__=false and type=counter >__is_monotonic__=true. This is the current situation. Then temporality label could be added to both types and it controls if CT/start time is present or not.
b) We followOtel model of sum and add __temporality__="cumulative|delta", __is_monotonic__"true|false" ONLY to counters and we follow (1) definition. So the gauge vs counter different is more of current value or counter over time semantic differentiation for humans and there is no cumulation or delta notion for gauges.

Do you know the answer? This proposal seems to agree with (b) but actually implements (a) as a short term thing, no? 🙃

Options

Let's pull our extended list again:

  1. Set deltas with type gauge, add temporality and monotonicity labels
  2. Set deltas with no type, add temporality and monotonicity labels
  3. we already have primitive delta ingestion which adds no types and labels, so people aren't blocked from ingesting deltas
  1. Set deltas with type counter, add __temporality__ and __monotonicity__ labels (or __is_monotonic__), without ct-per sample.
  2. Set deltas with no type/type gauge , add otel_converted_from="sum/delta/monotonic" or "sum/cumulative/non-monotonic" or "sum/delta/monotonic"

In the current proposal we pursue (1) temporarily . So our downstream dev and users will see a distinct gauge type with temporality, montocity and potentially otel label and think

Oh God, what is this, why Prometheus has to do this totally opposite to Otel, even for a new feature... they now use gauge for my delta counts for statsd? Well.. fine, let's embrace it.

And then 1y later...

wait what.. why my metric does not work/changed?

That's why, my preference would be to avoid confusion with (1) and do long term option (4) straight away.
My second preference would be to make the least confusing experiment so (5) first, so a clear tmp Otel conversion thing that will not spread like a virus too much (even if somebody depends it will be clear that "it was tmp thing", and next is a native implementation.
Then my 3rd preference is actually (3) which is what we have now.

(1) is too odd, too much extra work for the temporary short term experimentations that already spreads to other parts of Prometheus and ecosystem where devs will need to think about those special gauges handling. It goes too much in the gauge definition that we don't want to have too (my assumption with (a) gauge definition above).

However, we have most of the code for deltas-as-gauges ready though, so would it be okay to have that merged while working on CT-per-sample? The delta feature experimental and there are warnings around label changes.

So yea... I would NOT approve this, I think there are better ways without counter type even - but ofc I won't block you, it's just my opinion and I only hope you will be there to explain all of this to downstream devs and users 🙈

@bwplotka bwplotka dismissed their stale review August 29, 2025 04:13

We chatted offline with @fionaliao -- I did my best to convince the quorum here, but the decision (so far) is to keep going (as a tmp step) into delta as gauge with generic temporality and monotonicity labels.

This is similar to the DevSummit consensus (similar, because consensus was about _otel... specific labels, not generic ones, but that's a nit).

So far I still disagree with the tmp (my preference is (4),(5) or (3)), but I'm happy to "commit" -- it's not a big deal, given it's only tmp step.

I think we agree on the long term points, so let's collab on that more once we are ready!

Thanks for bringing your points and constructive discussion so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants