Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389
Draft
dougqh wants to merge 9 commits into
Draft
Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark#11389dougqh wants to merge 9 commits into
dougqh wants to merge 9 commits into
Conversation
The aggregator allocated one AggregateMetric per entry and reached its
counters/histograms through entry.aggregate.X. Folding the state and
methods onto AggregateEntry removes:
- one object header + alignment (~16-24 bytes) per entry,
- one reference field on AggregateEntry,
- one virtual dispatch hop on the consumer hot path
(entry.recordOneDuration replaces entry.aggregate.recordOneDuration).
Moves ERROR_TAG / TOP_LEVEL_TAG onto AggregateEntry. recordOneDuration,
recordDurations, clearAggregate, and the counter/histogram accessors all
live directly on the entry now. equals/hashCode still depend only on the
immutable label fields (the counters are not part of identity).
Also migrates AggregateMetricTest.groovy to AggregateEntryRecordingTest.java
under JUnit 5, per project conventions.
Benchmark unchanged (within noise) since the fold is purely a consumer-side
+ memory change:
before: 1.995 us/op
after: 2.040 us/op (stdev 0.011; prior stdev 0.023)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each entry held a List<UTF8BytesString> for peer tags, which meant:
- on miss, Canonical.toEntry allocated an ArrayList plus its internal
Object[] -- two allocations and ~24 bytes of List wrapper overhead
on top of the array itself (or a SingletonList wrapper at N=1);
- on every report cycle, SerializingMetricWriter.add did a for-each
over the list, allocating an ArrayList$Itr per entry.
Switching the field to a flat UTF8BytesString[] eliminates the wrapper
and the iterator allocation. For empty entries, a shared EMPTY_PEER_TAGS
constant is used. Canonical's scratch buffer also moves from
ArrayList<UTF8BytesString> to UTF8BytesString[] + count with on-demand
growth, so populate / matches / hashOf all operate directly on the array
without bounds-check method-call overhead.
Benchmark unchanged (consumer-side change; producer-bound benchmark):
before: 2.040 us/op
after: 2.023 us/op
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing hit-path benchmarks publish the same 64 spans every op, so after warmup every lookup is a cache hit and the consumer-side allocations (AggregateEntry construction, peer-tag array, etc.) never fire. That hides the impact of the entry/peer-tag memory work. This variant builds a pool of 4096 single-span traces with unique (service, operation, resource) tuples and publishes one per op, cycling. Steady state exercises canonicalize + miss + insert + the blocked_by_tracer fallback once cardinality budgets fill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Property/TagCardinalityHandler each held a pre-sized HashMap<..., UTF8BytesString>. Every distinct value cost a HashMap$Node (~40 bytes: header + hash + 3 refs + alignment). For the 9 property handlers (caps 8-64) plus N tag handlers (cap 512 each), that's a lot of Node objects surviving each reporting interval. Swap for two parallel arrays (CharSequence/String keys + UTF8BytesString values) sized to the next power of two >= 2 * cardinalityLimit, with linear probing. No per-entry object allocation; reset is just two Arrays.fill calls. Semantics preserved exactly: - size-vs-cap check still runs before lookup, so values registered earlier in the cycle also collapse to the sentinel once the cap is reached; - reset drops every entry (callers must obtain fresh references after reset); - cacheBlocked survives reset, keeping the sentinel instance stable. Memory comparison at full cardinality (property handler, cap 64): HashMap: 48 (obj) + 528 (table[128]) + 64*40 (Nodes) = 3136 B Flat: 40 (obj) + 2*528 (two tables) = 1096 B For a tag handler at cap 512: HashMap: 4160 + 512*40 = 24640 B Flat: 40 + 2*4112 = 8264 B (saves ~16 KB per tag handler) Trade-off: empty handlers are slightly larger (flat keeps both arrays prealloc'd; HashMap only has the bucket table). Crossover is ~13 entries -- typical steady-state for any handler in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each AggregateEntry allocated two DDSketchHistograms in its constructor (ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store, roughly 60-80 bytes per histogram even when empty. Most spans aren't errors, so most entries' errorLatencies sit empty for life. Now the field starts null. recordOneDuration / recordDurations lazy-allocate on the first error; if no error ever lands on the entry, it stays null and ~80 bytes of empty-histogram overhead are reclaimed. Across a full 2048-entry table that's ~150 KB if 95% of entries never error -- the typical case. For the wire format, SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) on first use and writes those cached bytes when an entry's errorLatencies is null. The cache is per-writer (not a global static) so each writer instance picks up the Histograms factory state at the time of its first report, avoiding races with test setup that registers the DDSketch factory at varying points. Trade-off: entries that DO see an error retain the histogram across clearAggregate (just cleared, not nulled), so always-erroring entries allocate exactly once. Same total allocation as before for that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous limits (RESOURCE=32, OPERATION=64, TYPE=8, ...) were inherited from the prior DDCache sizes and were chosen for memory conservation, not to cover the cardinality typical APM workloads actually produce. A single REST API with 200 routes exhausted the RESOURCE budget within seconds and collapsed everything onto the blocked_by_tracer sentinel. Recalibrated against realistic distinct-value counts per 10-second reporting window: RESOURCE 32 -> 512 (route + SQL template count for web apps) HTTP_ENDPOINT 32 -> 256 (parameterized routes) OPERATION 64 -> 128 (one per integration kind) SERVICE 32 -> 128 (microservice / peer-service hub) TYPE 8 -> 32 (DDSpanTypes has ~30 known values) HTTP_METHOD 8 -> 16 (9 standard verbs + custom) GRPC_STATUS_CODE 32 -> 24 (17 gRPC codes; tightened) SERVICE_SOURCE 16 -> 16 (unchanged) SPAN_KIND 16 -> 16 (unchanged) Memory cost with the flat handler tables: ~3.8 KB -> ~19.6 KB across all 9 handlers. Negligible relative to the maxAggregates=2048 entry table. Benchmark numbers unchanged (the limits don't show up in producer allocation or the existing hit-path benchmarks): ClientStatsAggregatorBenchmark 3.632 us/op (was 3.555) ClientStatsAggregatorDDSpanBenchmark 2.027 us/op (was 1.967) ClientStatsAggregatorMissPathBenchmark 0.057 us/op, 96 B/op (same) Also updates cardinalityBlockedValuesCollapseIntoOneEntry to push the test beyond the new SERVICE cap (150 services -> 128 in-budget + 1 sentinel = 129 entries). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing JMH benchmarks cover stages of the trace pipeline in isolation -- span creation (CoreTracerBenchmark), scope activation (ScopeLifecycleBenchmark), metrics aggregation (ClientStatsAggregator*Benchmark), wire serialization (TraceMapperBenchmark), propagation -- but nothing exercises a full startSpan -> setTag -> finish -> trace assembly -> metrics publish trip in one op. This benchmark fills that gap. It builds a real CoreTracer with a NoopWriter, reflectively swaps in a real ClientStatsAggregator (default tracer init leaves it as the no-op until agent discovery runs), then per @benchmark op runs the full single-span lifecycle. Two modes: - "stable" -- identical labels each op (consumer-side hit path) - "varied" -- unique (service, operation, resource) per op (consumer-side miss path until cardinality budgets fill) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The metrics aggregator is bounded by design at every layer (inbox cap,
aggregate-table cap, per-field cardinality cap, fixed-size flat handler
tables, bounded histogram dense stores). This benchmark stresses all of
those simultaneously to verify the bounds hold and to measure throughput
+ allocation under attack:
- 8 producer threads hammering publish() concurrently.
- Unique (service, operation, resource, peer.hostname) per op so
handlers saturate to blocked_by_tracer instantly and the aggregate
table fills+evicts continuously.
- Random durations across a 1ns..1s range so histograms accept many
distinct bins.
- Random error / topLevel flags so both histograms exercise.
Per-fork tearDown prints the CountingHealthMetrics drop counters so we
can see how the subsystem absorbed the burst: snapshots dropped at the
inbox (backpressure when consumer can't keep up) versus snapshots
dropped at the table (cap reached with no stale entry). The expected
shape under attack is "lots of inbox drops, zero table drops" --
producer-driven backpressure stays ahead of table saturation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new "Behavior under adversarial load" section comparing the new
ClientStatsAggregator producer/consumer split against master's
ConflatingMetricsAggregator under high-cardinality + random-duration
attack. Captures the measured numbers from AdversarialMetricsBenchmark:
- master degrades 73% by iteration 3 (1.5M -> 410K ops/s) with the
GC threads consuming >100% of wall-clock budget;
- this design sustains ~5.85M ops/s steady-state with ~10% GC time;
- 139M snapshots dropped via the inbox-full backpressure counter,
zero dropped at the AggregateTable cap;
- worst-case heap ceiling of ~32 MB regardless of attack rate.
Explains why master degrades (no producer/consumer split -> all
canonicalization allocation on the calling thread -> GC death spiral)
and why this design holds (offer-based backpressure at the MPSC inbox
boundary turns overflow into a counted drop rather than heap growth).
Also updates two stale references in the doc:
- AggregateMetric -> "AggregateEntry's per-bucket counters" (fold);
- benchmark summary now lists all four JMH benchmarks and includes
the miss-path numbers (5.3x faster, 4.2x less producer allocation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sits on top of #11387. Reduces the per-entry footprint of the metrics aggregator, replaces HashMap-backed cardinality state with flat open-addressed tables, and adds three new JMH benchmarks (miss-path, full-pipeline, adversarial). All consumer-side changes; producer-thread hot path is unchanged or marginally faster.
Memory + design changes
AggregateMetricintoAggregateEntry— drops one object header + reference field per entry (~16-24 bytes) and one indirection on the consumer's hot recording path.UTF8BytesString[]instead ofList— saves theArrayListwrapper allocation per miss; eliminates one iterator allocation per entry per report cycle.HashMap$Nodeallocations. At full cardinality saves ~7 KB on the 9 property handlers and ~16 KB perTagCardinalityHandler(cap 512).reset()is twoArrays.fillcalls.AggregateEntry.errorLatenciesstarts null; allocated on first error. Entries that never error save ~80 B of empty-histogram wrapper. At full table cap with sparse errors: ~150 KB reclaimed. The serializer writes a cached empty-histogram byte payload for entries that never errored, keeping the wire format unchanged.blocked_by_traceron day one.Benchmarks
ClientStatsAggregatorMissPathBenchmark— pool of 4096 single-span traces with unique labels per op. Producer measures 57 ns/op + 96 B/op vs master's 305 ns/op + 399 B/op (5.3× faster, 4.2× less allocation per metrics-eligible span).TracePipelineBenchmark— end-to-end 3-span HTTP-style trace throughCoreTracerwith metrics aggregator wired in. Multi-threaded (8 producers, throughput mode). Shows the metrics subsystem improvements at the full-pipeline level: under varied labels, branch is +1.8% throughput and -734 B/op vs master (real signal; stable mode is identical, since the metrics-aggregator improvements only manifest on miss).AdversarialMetricsBenchmark— 8 threads × unique labels per op × random durations across 1ns-1s × random error/top-level flags. Bounded-design verification + drop counters at teardown.Under adversarial load (this is the headline):
Master enters a GC death spiral because it allocates
MetricKey+DDCachelookups +LRUCacheentries synchronously on every producer thread; this branch's producer/consumer split converts overflow into counted drops viaonStatsInboxFullinstead.Docs
Updated
docs/client_metrics_design.mdwith a new "Behavior under adversarial load" section, the AggregateMetric→AggregateEntry fold, and a benchmark summary that lists all four producer-side JMH benchmarks.Test plan
:dd-trace-core:test --tests "datadog.trace.common.metrics.*"— all pass (existing + updated cardinality test for the new 128-cap service handler).ClientStatsAggregatorMissPathBenchmarknumbers reproduce locally (57 ns/op ± 8, 96 B/op).AdversarialMetricsBenchmarksustains throughput across iterations (no GC death spiral).SerializingMetricWriterTestandMetricsIntegrationTestpass with the lazy-error-histogram empty payload.🤖 Generated with Claude Code