Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark by dougqh · Pull Request #11389 · DataDog/dd-trace-java

dougqh · 2026-05-16T02:04:19Z

Summary

Sits on top of #11387. Reduces the per-entry footprint of the metrics aggregator, replaces HashMap-backed cardinality state with flat open-addressed tables, and adds three new JMH benchmarks (miss-path, full-pipeline, adversarial). All consumer-side changes; producer-thread hot path is unchanged or marginally faster.

Memory + design changes

Fold AggregateMetric into AggregateEntry — drops one object header + reference field per entry (~16-24 bytes) and one indirection on the consumer's hot recording path.
Peer tags as UTF8BytesString[] instead of List — saves the ArrayList wrapper allocation per miss; eliminates one iterator allocation per entry per report cycle.
Cardinality handlers backed by flat open-addressed tables — no HashMap$Node allocations. At full cardinality saves ~7 KB on the 9 property handlers and ~16 KB per TagCardinalityHandler (cap 512). reset() is two Arrays.fill calls.
Lazy error histogram — AggregateEntry.errorLatencies starts null; allocated on first error. Entries that never error save ~80 B of empty-histogram wrapper. At full table cap with sparse errors: ~150 KB reclaimed. The serializer writes a cached empty-histogram byte payload for entries that never errored, keeping the wire format unchanged.
Raise per-field cardinality limits to match real workloads — RESOURCE 32→512, HTTP_ENDPOINT 32→256, OPERATION 64→128, SERVICE 32→128, TYPE 8→32, HTTP_METHOD 8→16, GRPC_STATUS_CODE 32→24. The old limits collapsed even a single-service web app to blocked_by_tracer on day one.

Benchmarks

ClientStatsAggregatorMissPathBenchmark — pool of 4096 single-span traces with unique labels per op. Producer measures 57 ns/op + 96 B/op vs master's 305 ns/op + 399 B/op (5.3× faster, 4.2× less allocation per metrics-eligible span).
TracePipelineBenchmark — end-to-end 3-span HTTP-style trace through CoreTracer with metrics aggregator wired in. Multi-threaded (8 producers, throughput mode). Shows the metrics subsystem improvements at the full-pipeline level: under varied labels, branch is +1.8% throughput and -734 B/op vs master (real signal; stable mode is identical, since the metrics-aggregator improvements only manifest on miss).
AdversarialMetricsBenchmark — 8 threads × unique labels per op × random durations across 1ns-1s × random error/top-level flags. Bounded-design verification + drop counters at teardown.

Under adversarial load (this is the headline):

	master	this branch
Iteration 1 / 2 / 3 throughput	1.5M / 1.25M / 0.41M ops/s	~5.85M sustained
Iteration 3 GC time in a 15s window	18.6 s (multi-thread GC saturation)	~150 ms
Producer allocation	1,108 B/op	823 B/op
Aggregate-table drops	n/a	0
Inbox-full drops	(no counter)	~139M (76% backpressure)

Master enters a GC death spiral because it allocates MetricKey + DDCache lookups + LRUCache entries synchronously on every producer thread; this branch's producer/consumer split converts overflow into counted drops via onStatsInboxFull instead.

Docs

Updated docs/client_metrics_design.md with a new "Behavior under adversarial load" section, the AggregateMetric→AggregateEntry fold, and a benchmark summary that lists all four producer-side JMH benchmarks.

Test plan

:dd-trace-core:test --tests "datadog.trace.common.metrics.*" — all pass (existing + updated cardinality test for the new 128-cap service handler).
ClientStatsAggregatorMissPathBenchmark numbers reproduce locally (57 ns/op ± 8, 96 B/op).
AdversarialMetricsBenchmark sustains throughput across iterations (no GC death spiral).
Hit-path benchmarks within noise of pre-change numbers (no producer-thread regression).
Wire format unchanged — SerializingMetricWriterTest and MetricsIntegrationTest pass with the lazy-error-histogram empty payload.

🤖 Generated with Claude Code

The aggregator allocated one AggregateMetric per entry and reached its counters/histograms through entry.aggregate.X. Folding the state and methods onto AggregateEntry removes: - one object header + alignment (~16-24 bytes) per entry, - one reference field on AggregateEntry, - one virtual dispatch hop on the consumer hot path (entry.recordOneDuration replaces entry.aggregate.recordOneDuration). Moves ERROR_TAG / TOP_LEVEL_TAG onto AggregateEntry. recordOneDuration, recordDurations, clearAggregate, and the counter/histogram accessors all live directly on the entry now. equals/hashCode still depend only on the immutable label fields (the counters are not part of identity). Also migrates AggregateMetricTest.groovy to AggregateEntryRecordingTest.java under JUnit 5, per project conventions. Benchmark unchanged (within noise) since the fold is purely a consumer-side + memory change: before: 1.995 us/op after: 2.040 us/op (stdev 0.011; prior stdev 0.023) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each entry held a List<UTF8BytesString> for peer tags, which meant: - on miss, Canonical.toEntry allocated an ArrayList plus its internal Object[] -- two allocations and ~24 bytes of List wrapper overhead on top of the array itself (or a SingletonList wrapper at N=1); - on every report cycle, SerializingMetricWriter.add did a for-each over the list, allocating an ArrayList$Itr per entry. Switching the field to a flat UTF8BytesString[] eliminates the wrapper and the iterator allocation. For empty entries, a shared EMPTY_PEER_TAGS constant is used. Canonical's scratch buffer also moves from ArrayList<UTF8BytesString> to UTF8BytesString[] + count with on-demand growth, so populate / matches / hashOf all operate directly on the array without bounds-check method-call overhead. Benchmark unchanged (consumer-side change; producer-bound benchmark): before: 2.040 us/op after: 2.023 us/op Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing hit-path benchmarks publish the same 64 spans every op, so after warmup every lookup is a cache hit and the consumer-side allocations (AggregateEntry construction, peer-tag array, etc.) never fire. That hides the impact of the entry/peer-tag memory work. This variant builds a pool of 4096 single-span traces with unique (service, operation, resource) tuples and publishes one per op, cycling. Steady state exercises canonicalize + miss + insert + the blocked_by_tracer fallback once cardinality budgets fill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Property/TagCardinalityHandler each held a pre-sized HashMap<..., UTF8BytesString>. Every distinct value cost a HashMap$Node (~40 bytes: header + hash + 3 refs + alignment). For the 9 property handlers (caps 8-64) plus N tag handlers (cap 512 each), that's a lot of Node objects surviving each reporting interval. Swap for two parallel arrays (CharSequence/String keys + UTF8BytesString values) sized to the next power of two >= 2 * cardinalityLimit, with linear probing. No per-entry object allocation; reset is just two Arrays.fill calls. Semantics preserved exactly: - size-vs-cap check still runs before lookup, so values registered earlier in the cycle also collapse to the sentinel once the cap is reached; - reset drops every entry (callers must obtain fresh references after reset); - cacheBlocked survives reset, keeping the sentinel instance stable. Memory comparison at full cardinality (property handler, cap 64): HashMap: 48 (obj) + 528 (table[128]) + 64*40 (Nodes) = 3136 B Flat: 40 (obj) + 2*528 (two tables) = 1096 B For a tag handler at cap 512: HashMap: 4160 + 512*40 = 24640 B Flat: 40 + 2*4112 = 8264 B (saves ~16 KB per tag handler) Trade-off: empty handlers are slightly larger (flat keeps both arrays prealloc'd; HashMap only has the bucket table). Crossover is ~13 entries -- typical steady-state for any handler in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each AggregateEntry allocated two DDSketchHistograms in its constructor (ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store, roughly 60-80 bytes per histogram even when empty. Most spans aren't errors, so most entries' errorLatencies sit empty for life. Now the field starts null. recordOneDuration / recordDurations lazy-allocate on the first error; if no error ever lands on the entry, it stays null and ~80 bytes of empty-histogram overhead are reclaimed. Across a full 2048-entry table that's ~150 KB if 95% of entries never error -- the typical case. For the wire format, SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) on first use and writes those cached bytes when an entry's errorLatencies is null. The cache is per-writer (not a global static) so each writer instance picks up the Histograms factory state at the time of its first report, avoiding races with test setup that registers the DDSketch factory at varying points. Trade-off: entries that DO see an error retain the histogram across clearAggregate (just cleared, not nulled), so always-erroring entries allocate exactly once. Same total allocation as before for that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous limits (RESOURCE=32, OPERATION=64, TYPE=8, ...) were inherited from the prior DDCache sizes and were chosen for memory conservation, not to cover the cardinality typical APM workloads actually produce. A single REST API with 200 routes exhausted the RESOURCE budget within seconds and collapsed everything onto the blocked_by_tracer sentinel. Recalibrated against realistic distinct-value counts per 10-second reporting window: RESOURCE 32 -> 512 (route + SQL template count for web apps) HTTP_ENDPOINT 32 -> 256 (parameterized routes) OPERATION 64 -> 128 (one per integration kind) SERVICE 32 -> 128 (microservice / peer-service hub) TYPE 8 -> 32 (DDSpanTypes has ~30 known values) HTTP_METHOD 8 -> 16 (9 standard verbs + custom) GRPC_STATUS_CODE 32 -> 24 (17 gRPC codes; tightened) SERVICE_SOURCE 16 -> 16 (unchanged) SPAN_KIND 16 -> 16 (unchanged) Memory cost with the flat handler tables: ~3.8 KB -> ~19.6 KB across all 9 handlers. Negligible relative to the maxAggregates=2048 entry table. Benchmark numbers unchanged (the limits don't show up in producer allocation or the existing hit-path benchmarks): ClientStatsAggregatorBenchmark 3.632 us/op (was 3.555) ClientStatsAggregatorDDSpanBenchmark 2.027 us/op (was 1.967) ClientStatsAggregatorMissPathBenchmark 0.057 us/op, 96 B/op (same) Also updates cardinalityBlockedValuesCollapseIntoOneEntry to push the test beyond the new SERVICE cap (150 services -> 128 in-budget + 1 sentinel = 129 entries). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@benchmark

Existing JMH benchmarks cover stages of the trace pipeline in isolation -- span creation (CoreTracerBenchmark), scope activation (ScopeLifecycleBenchmark), metrics aggregation (ClientStatsAggregator*Benchmark), wire serialization (TraceMapperBenchmark), propagation -- but nothing exercises a full startSpan -> setTag -> finish -> trace assembly -> metrics publish trip in one op. This benchmark fills that gap. It builds a real CoreTracer with a NoopWriter, reflectively swaps in a real ClientStatsAggregator (default tracer init leaves it as the no-op until agent discovery runs), then per @benchmark op runs the full single-span lifecycle. Two modes: - "stable" -- identical labels each op (consumer-side hit path) - "varied" -- unique (service, operation, resource) per op (consumer-side miss path until cardinality budgets fill) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The metrics aggregator is bounded by design at every layer (inbox cap, aggregate-table cap, per-field cardinality cap, fixed-size flat handler tables, bounded histogram dense stores). This benchmark stresses all of those simultaneously to verify the bounds hold and to measure throughput + allocation under attack: - 8 producer threads hammering publish() concurrently. - Unique (service, operation, resource, peer.hostname) per op so handlers saturate to blocked_by_tracer instantly and the aggregate table fills+evicts continuously. - Random durations across a 1ns..1s range so histograms accept many distinct bins. - Random error / topLevel flags so both histograms exercise. Per-fork tearDown prints the CountingHealthMetrics drop counters so we can see how the subsystem absorbed the burst: snapshots dropped at the inbox (backpressure when consumer can't keep up) versus snapshots dropped at the table (cap reached with no stale entry). The expected shape under attack is "lots of inbox drops, zero table drops" -- producer-driven backpressure stays ahead of table saturation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new "Behavior under adversarial load" section comparing the new ClientStatsAggregator producer/consumer split against master's ConflatingMetricsAggregator under high-cardinality + random-duration attack. Captures the measured numbers from AdversarialMetricsBenchmark: - master degrades 73% by iteration 3 (1.5M -> 410K ops/s) with the GC threads consuming >100% of wall-clock budget; - this design sustains ~5.85M ops/s steady-state with ~10% GC time; - 139M snapshots dropped via the inbox-full backpressure counter, zero dropped at the AggregateTable cap; - worst-case heap ceiling of ~32 MB regardless of attack rate. Explains why master degrades (no producer/consumer split -> all canonicalization allocation on the calling thread -> GC death spiral) and why this design holds (offer-based backpressure at the MPSC inbox boundary turns overflow into a counted drop rather than heap growth). Also updates two stale references in the doc: - AggregateMetric -> "AggregateEntry's per-bucket counters" (fold); - benchmark summary now lists all four JMH benchmarks and includes the miss-path numbers (5.3x faster, 4.2x less producer allocation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh and others added 9 commits May 15, 2026 19:11

dougqh added type: enhancement Enhancements and improvements tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389
dougqh wants to merge 9 commits into
dougqh/control-tag-cardinalityfrom
dougqh/metrics-memory-efficiency

dougqh commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented May 16, 2026

Summary

Memory + design changes

Benchmarks

Docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant