Skip to content

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389

Draft
dougqh wants to merge 9 commits into
dougqh/control-tag-cardinalityfrom
dougqh/metrics-memory-efficiency
Draft

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389
dougqh wants to merge 9 commits into
dougqh/control-tag-cardinalityfrom
dougqh/metrics-memory-efficiency

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 16, 2026

Summary

Sits on top of #11387. Reduces the per-entry footprint of the metrics aggregator, replaces HashMap-backed cardinality state with flat open-addressed tables, and adds three new JMH benchmarks (miss-path, full-pipeline, adversarial). All consumer-side changes; producer-thread hot path is unchanged or marginally faster.

Memory + design changes

  • Fold AggregateMetric into AggregateEntry — drops one object header + reference field per entry (~16-24 bytes) and one indirection on the consumer's hot recording path.
  • Peer tags as UTF8BytesString[] instead of List — saves the ArrayList wrapper allocation per miss; eliminates one iterator allocation per entry per report cycle.
  • Cardinality handlers backed by flat open-addressed tables — no HashMap$Node allocations. At full cardinality saves ~7 KB on the 9 property handlers and ~16 KB per TagCardinalityHandler (cap 512). reset() is two Arrays.fill calls.
  • Lazy error histogramAggregateEntry.errorLatencies starts null; allocated on first error. Entries that never error save ~80 B of empty-histogram wrapper. At full table cap with sparse errors: ~150 KB reclaimed. The serializer writes a cached empty-histogram byte payload for entries that never errored, keeping the wire format unchanged.
  • Raise per-field cardinality limits to match real workloads — RESOURCE 32→512, HTTP_ENDPOINT 32→256, OPERATION 64→128, SERVICE 32→128, TYPE 8→32, HTTP_METHOD 8→16, GRPC_STATUS_CODE 32→24. The old limits collapsed even a single-service web app to blocked_by_tracer on day one.

Benchmarks

  • ClientStatsAggregatorMissPathBenchmark — pool of 4096 single-span traces with unique labels per op. Producer measures 57 ns/op + 96 B/op vs master's 305 ns/op + 399 B/op (5.3× faster, 4.2× less allocation per metrics-eligible span).
  • TracePipelineBenchmark — end-to-end 3-span HTTP-style trace through CoreTracer with metrics aggregator wired in. Multi-threaded (8 producers, throughput mode). Shows the metrics subsystem improvements at the full-pipeline level: under varied labels, branch is +1.8% throughput and -734 B/op vs master (real signal; stable mode is identical, since the metrics-aggregator improvements only manifest on miss).
  • AdversarialMetricsBenchmark — 8 threads × unique labels per op × random durations across 1ns-1s × random error/top-level flags. Bounded-design verification + drop counters at teardown.

Under adversarial load (this is the headline):

master this branch
Iteration 1 / 2 / 3 throughput 1.5M / 1.25M / 0.41M ops/s ~5.85M sustained
Iteration 3 GC time in a 15s window 18.6 s (multi-thread GC saturation) ~150 ms
Producer allocation 1,108 B/op 823 B/op
Aggregate-table drops n/a 0
Inbox-full drops (no counter) ~139M (76% backpressure)

Master enters a GC death spiral because it allocates MetricKey + DDCache lookups + LRUCache entries synchronously on every producer thread; this branch's producer/consumer split converts overflow into counted drops via onStatsInboxFull instead.

Docs

Updated docs/client_metrics_design.md with a new "Behavior under adversarial load" section, the AggregateMetric→AggregateEntry fold, and a benchmark summary that lists all four producer-side JMH benchmarks.

Test plan

  • :dd-trace-core:test --tests "datadog.trace.common.metrics.*" — all pass (existing + updated cardinality test for the new 128-cap service handler).
  • ClientStatsAggregatorMissPathBenchmark numbers reproduce locally (57 ns/op ± 8, 96 B/op).
  • AdversarialMetricsBenchmark sustains throughput across iterations (no GC death spiral).
  • Hit-path benchmarks within noise of pre-change numbers (no producer-thread regression).
  • Wire format unchanged — SerializingMetricWriterTest and MetricsIntegrationTest pass with the lazy-error-histogram empty payload.

🤖 Generated with Claude Code

dougqh and others added 9 commits May 15, 2026 19:11
The aggregator allocated one AggregateMetric per entry and reached its
counters/histograms through entry.aggregate.X. Folding the state and
methods onto AggregateEntry removes:

  - one object header + alignment (~16-24 bytes) per entry,
  - one reference field on AggregateEntry,
  - one virtual dispatch hop on the consumer hot path
    (entry.recordOneDuration replaces entry.aggregate.recordOneDuration).

Moves ERROR_TAG / TOP_LEVEL_TAG onto AggregateEntry. recordOneDuration,
recordDurations, clearAggregate, and the counter/histogram accessors all
live directly on the entry now. equals/hashCode still depend only on the
immutable label fields (the counters are not part of identity).

Also migrates AggregateMetricTest.groovy to AggregateEntryRecordingTest.java
under JUnit 5, per project conventions.

Benchmark unchanged (within noise) since the fold is purely a consumer-side
+ memory change:
  before: 1.995 us/op
  after:  2.040 us/op (stdev 0.011; prior stdev 0.023)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each entry held a List<UTF8BytesString> for peer tags, which meant:

  - on miss, Canonical.toEntry allocated an ArrayList plus its internal
    Object[] -- two allocations and ~24 bytes of List wrapper overhead
    on top of the array itself (or a SingletonList wrapper at N=1);
  - on every report cycle, SerializingMetricWriter.add did a for-each
    over the list, allocating an ArrayList$Itr per entry.

Switching the field to a flat UTF8BytesString[] eliminates the wrapper
and the iterator allocation. For empty entries, a shared EMPTY_PEER_TAGS
constant is used. Canonical's scratch buffer also moves from
ArrayList<UTF8BytesString> to UTF8BytesString[] + count with on-demand
growth, so populate / matches / hashOf all operate directly on the array
without bounds-check method-call overhead.

Benchmark unchanged (consumer-side change; producer-bound benchmark):
  before: 2.040 us/op
  after:  2.023 us/op

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing hit-path benchmarks publish the same 64 spans every op, so
after warmup every lookup is a cache hit and the consumer-side
allocations (AggregateEntry construction, peer-tag array, etc.) never
fire. That hides the impact of the entry/peer-tag memory work.

This variant builds a pool of 4096 single-span traces with unique
(service, operation, resource) tuples and publishes one per op,
cycling. Steady state exercises canonicalize + miss + insert + the
blocked_by_tracer fallback once cardinality budgets fill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property/TagCardinalityHandler each held a pre-sized HashMap<...,
UTF8BytesString>. Every distinct value cost a HashMap$Node (~40 bytes:
header + hash + 3 refs + alignment). For the 9 property handlers (caps
8-64) plus N tag handlers (cap 512 each), that's a lot of Node objects
surviving each reporting interval.

Swap for two parallel arrays (CharSequence/String keys + UTF8BytesString
values) sized to the next power of two >= 2 * cardinalityLimit, with
linear probing. No per-entry object allocation; reset is just two
Arrays.fill calls.

Semantics preserved exactly:
- size-vs-cap check still runs before lookup, so values registered
  earlier in the cycle also collapse to the sentinel once the cap
  is reached;
- reset drops every entry (callers must obtain fresh references after
  reset);
- cacheBlocked survives reset, keeping the sentinel instance stable.

Memory comparison at full cardinality (property handler, cap 64):
  HashMap: 48 (obj) + 528 (table[128]) + 64*40 (Nodes) = 3136 B
  Flat:    40 (obj) + 2*528 (two tables)               = 1096 B
For a tag handler at cap 512:
  HashMap: 4160 + 512*40 = 24640 B
  Flat:    40 + 2*4112    =  8264 B  (saves ~16 KB per tag handler)

Trade-off: empty handlers are slightly larger (flat keeps both arrays
prealloc'd; HashMap only has the bucket table). Crossover is ~13
entries -- typical steady-state for any handler in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each AggregateEntry allocated two DDSketchHistograms in its constructor
(ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store,
roughly 60-80 bytes per histogram even when empty. Most spans aren't
errors, so most entries' errorLatencies sit empty for life.

Now the field starts null. recordOneDuration / recordDurations
lazy-allocate on the first error; if no error ever lands on the entry,
it stays null and ~80 bytes of empty-histogram overhead are reclaimed.
Across a full 2048-entry table that's ~150 KB if 95% of entries never
error -- the typical case.

For the wire format, SerializingMetricWriter caches the serialized form
of an empty histogram (~17 bytes) on first use and writes those cached
bytes when an entry's errorLatencies is null. The cache is per-writer
(not a global static) so each writer instance picks up the Histograms
factory state at the time of its first report, avoiding races with
test setup that registers the DDSketch factory at varying points.

Trade-off: entries that DO see an error retain the histogram across
clearAggregate (just cleared, not nulled), so always-erroring entries
allocate exactly once. Same total allocation as before for that case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous limits (RESOURCE=32, OPERATION=64, TYPE=8, ...) were
inherited from the prior DDCache sizes and were chosen for memory
conservation, not to cover the cardinality typical APM workloads
actually produce. A single REST API with 200 routes exhausted the
RESOURCE budget within seconds and collapsed everything onto the
blocked_by_tracer sentinel.

Recalibrated against realistic distinct-value counts per 10-second
reporting window:

  RESOURCE         32 -> 512  (route + SQL template count for web apps)
  HTTP_ENDPOINT    32 -> 256  (parameterized routes)
  OPERATION        64 -> 128  (one per integration kind)
  SERVICE          32 -> 128  (microservice / peer-service hub)
  TYPE              8 ->  32  (DDSpanTypes has ~30 known values)
  HTTP_METHOD       8 ->  16  (9 standard verbs + custom)
  GRPC_STATUS_CODE 32 ->  24  (17 gRPC codes; tightened)
  SERVICE_SOURCE   16 ->  16  (unchanged)
  SPAN_KIND        16 ->  16  (unchanged)

Memory cost with the flat handler tables: ~3.8 KB -> ~19.6 KB across
all 9 handlers. Negligible relative to the maxAggregates=2048 entry
table.

Benchmark numbers unchanged (the limits don't show up in producer
allocation or the existing hit-path benchmarks):
  ClientStatsAggregatorBenchmark        3.632 us/op (was 3.555)
  ClientStatsAggregatorDDSpanBenchmark  2.027 us/op (was 1.967)
  ClientStatsAggregatorMissPathBenchmark 0.057 us/op, 96 B/op (same)

Also updates cardinalityBlockedValuesCollapseIntoOneEntry to push the
test beyond the new SERVICE cap (150 services -> 128 in-budget + 1
sentinel = 129 entries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing JMH benchmarks cover stages of the trace pipeline in
isolation -- span creation (CoreTracerBenchmark), scope activation
(ScopeLifecycleBenchmark), metrics aggregation
(ClientStatsAggregator*Benchmark), wire serialization
(TraceMapperBenchmark), propagation -- but nothing exercises a full
startSpan -> setTag -> finish -> trace assembly -> metrics publish
trip in one op.

This benchmark fills that gap. It builds a real CoreTracer with a
NoopWriter, reflectively swaps in a real ClientStatsAggregator
(default tracer init leaves it as the no-op until agent discovery
runs), then per @benchmark op runs the full single-span lifecycle.

Two modes:
  - "stable" -- identical labels each op (consumer-side hit path)
  - "varied" -- unique (service, operation, resource) per op
    (consumer-side miss path until cardinality budgets fill)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The metrics aggregator is bounded by design at every layer (inbox cap,
aggregate-table cap, per-field cardinality cap, fixed-size flat handler
tables, bounded histogram dense stores). This benchmark stresses all of
those simultaneously to verify the bounds hold and to measure throughput
+ allocation under attack:

  - 8 producer threads hammering publish() concurrently.
  - Unique (service, operation, resource, peer.hostname) per op so
    handlers saturate to blocked_by_tracer instantly and the aggregate
    table fills+evicts continuously.
  - Random durations across a 1ns..1s range so histograms accept many
    distinct bins.
  - Random error / topLevel flags so both histograms exercise.

Per-fork tearDown prints the CountingHealthMetrics drop counters so we
can see how the subsystem absorbed the burst: snapshots dropped at the
inbox (backpressure when consumer can't keep up) versus snapshots
dropped at the table (cap reached with no stale entry). The expected
shape under attack is "lots of inbox drops, zero table drops" --
producer-driven backpressure stays ahead of table saturation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new "Behavior under adversarial load" section comparing the new
ClientStatsAggregator producer/consumer split against master's
ConflatingMetricsAggregator under high-cardinality + random-duration
attack. Captures the measured numbers from AdversarialMetricsBenchmark:

  - master degrades 73% by iteration 3 (1.5M -> 410K ops/s) with the
    GC threads consuming >100% of wall-clock budget;
  - this design sustains ~5.85M ops/s steady-state with ~10% GC time;
  - 139M snapshots dropped via the inbox-full backpressure counter,
    zero dropped at the AggregateTable cap;
  - worst-case heap ceiling of ~32 MB regardless of attack rate.

Explains why master degrades (no producer/consumer split -> all
canonicalization allocation on the calling thread -> GC death spiral)
and why this design holds (offer-based backpressure at the MPSC inbox
boundary turns overflow into a counted drop rather than heap growth).

Also updates two stale references in the doc:
  - AggregateMetric -> "AggregateEntry's per-bucket counters" (fold);
  - benchmark summary now lists all four JMH benchmarks and includes
    the miss-path numbers (5.3x faster, 4.2x less producer allocation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant