Skip to content

Commit d4eb35a

Browse files
cijothomaslalitb
andauthored
docs: on how to set right cardinality limit (#2998)
Co-authored-by: Lalit Kumar Bhasin <[email protected]>
1 parent 1f0d9a9 commit d4eb35a

File tree

1 file changed

+110
-5
lines changed

1 file changed

+110
-5
lines changed

docs/metrics.md

Lines changed: 110 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -485,9 +485,9 @@ practice is important:
485485

486486
* **Delta Temporality**: The SDK "forgets" the state after each
487487
collection/export cycle. This means in each new interval, the SDK can track
488-
up to the cardinality limit of distinct attribute combinations.
489-
Over time, your metrics backend might see far more than the configured limit
490-
of distinct combinations from a single process.
488+
up to the cardinality limit of distinct attribute combinations. Over time,
489+
your metrics backend might see far more than the configured limit of
490+
distinct combinations from a single process.
491491

492492
* **Cumulative Temporality**: Since the SDK maintains state across export
493493
intervals, once the cardinality limit is reached, new attribute combinations
@@ -560,7 +560,108 @@ The exported metrics would be:
560560
words, attributes used to create `Meter` or `Resource` attributes are not
561561
subject to this cap.
562562

563-
// TODO: Document how to pick cardinality limit.
563+
#### Cardinality Limits - How to Choose the Right Limit
564+
565+
Choosing the right cardinality limit is crucial for maintaining efficient memory
566+
usage and predictable performance in your metrics system. The optimal limit
567+
depends on your temporality choice and application characteristics.
568+
569+
Setting the limit incorrectly can have consequences:
570+
571+
* **Limit too high**: Due to the SDK's [memory
572+
preallocation](#memory-preallocation) strategy, excess memory will be
573+
allocated upfront and remain unused, leading to resource waste.
574+
* **Limit too low**: Measurements will be folded into the overflow bucket
575+
(`{"otel.metric.overflow": true}`), losing granular attribute information and
576+
making attribute-based queries unreliable.
577+
578+
Consider these guidelines when determining the appropriate limit:
579+
580+
##### Choosing the Right Limit for Cumulative Temporality
581+
582+
Cumulative metrics retain every unique attribute combination that has *ever*
583+
been observed since the start of the process.
584+
585+
* You must account for the theoretical maximum number of attribute combinations.
586+
* This can be estimated by multiplying the number of possible values for each
587+
attribute.
588+
* If certain attribute combinations are invalid or will never occur in practice,
589+
you can reduce the limit accordingly.
590+
591+
###### Example - Fruit Sales Scenario
592+
593+
Attributes:
594+
595+
* `name` can be "apple" or "lemon" (2 values)
596+
* `color` can be "red", "yellow", or "green" (3 values)
597+
598+
The theoretical maximum is 2 × 3 = 6 unique attribute sets.
599+
600+
For this example, the simplest approach is to use the theoretical maximum and **set the cardinality limit to 6**.
601+
602+
However, if you know that certain combinations will never occur (for example, if "red lemons" don't exist in your application domain), you could reduce the limit to only account for valid combinations. In this case, if only 5 combinations are valid, **setting the cardinality limit to 5** would be more memory-efficient.
603+
604+
##### Choosing the Right Limit for Delta Temporality
605+
606+
Delta metrics reset their aggregation state after every export interval. This
607+
approach enables more efficient memory utilization by focusing only on attributes
608+
observed during each interval rather than maintaining state for all combinations.
609+
610+
* **When attributes are low-cardinality** (as in the fruit example), use the
611+
same calculation method as with cumulative temporality.
612+
* **When high-cardinality attribute(s) exist** like `user_id`, leverage Delta
613+
temporality's "forget state" nature to set a much lower limit based on active
614+
usage patterns. This is where Delta temporality truly excels - when the set of
615+
active values changes dynamically and only a small subset is active during any
616+
given interval.
617+
618+
###### Example - High Cardinality Attribute Scenario
619+
620+
Export interval: 60 sec
621+
622+
Attributes:
623+
624+
* `user_id` (up to 1 million unique users)
625+
* `success` (true or false, 2 values)
626+
627+
Theoretical limit: 1 million users × 2 = 2 million attribute sets
628+
629+
But if only 10,000 users are typically active during a 60 sec export interval:
630+
10,000 × 2 = 20,000
631+
632+
**You can set the limit to 20,000, dramatically reducing memory usage during
633+
normal operation.**
634+
635+
###### Export Interval Tuning
636+
637+
Shorter export intervals further reduce the required cardinality:
638+
639+
* If your interval is halved (e.g., from 60 sec to 30 sec), the number of unique
640+
attribute sets seen per interval may also be halved.
641+
642+
> [!NOTE] More frequent exports increase CPU/network overhead due to
643+
> serialization and transmission costs.
644+
645+
##### Choosing the Right Limit - Backend Considerations
646+
647+
While delta temporality offers certain advantages for cardinality management,
648+
your choice may be constrained by backend support:
649+
650+
* **Backend Restrictions:** Some metrics backends only support cumulative
651+
temporality. For example, Prometheus requires cumulative temporality and
652+
cannot directly consume delta metrics.
653+
* **Collector Conversion:** To leverage delta temporality's memory advantages
654+
while maintaining backend compatibility, configure your SDK to use delta
655+
temporality and deploy an OpenTelemetry Collector with a delta-to-cumulative
656+
conversion processor. This approach pushes the memory overhead from your
657+
application to the collector, which can be more easily scaled and managed
658+
independently.
659+
660+
TODO: Add the memory cost incurred by each data points, so users can know the
661+
memory impact of setting a higher limits.
662+
663+
TODO: Add example of how query can be affected when overflow occurs, use
664+
[Aspire](https://github.com/dotnet/aspire/pull/7784) tool.
564665

565666
### Memory Preallocation
566667

@@ -622,7 +723,7 @@ Follow these guidelines when deciding where to attach metric attributes:
622723
* **Meter-level attributes**: If the dimension applies only to a subset of
623724
metrics (e.g., library version), model it as meter-level attributes via
624725
`meter_with_scope`.
625-
726+
626727
```rust
627728
// Example: Setting meter-level attributes
628729
let scope = InstrumentationScope::builder("payment_library")
@@ -660,3 +761,7 @@ Common pitfalls that can result in missing metrics include:
660761
used, some metrics may be placed in the overflow bucket.
661762

662763
// TODO: Add more specific examples
764+
765+
## References
766+
767+
[OTel Metrics Specification - Supplementary Guidelines](https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/)

0 commit comments

Comments
 (0)