Skip to content

Event Deltas: Deterministic sampling with adaptive sample size #1827

@alex-fedotyev

Description

@alex-fedotyev

Summary

Replace ORDER BY rand() LIMIT 1000 with deterministic cityHash64(SpanId) sampling and adaptive sample sizing based on total span count, with a visible sample annotation in the legend.

Problem

  • Non-deterministic sampling: ORDER BY rand() means the same hover on the same bar highlights different heatmap cells after each query re-fetch, creating a confusing experience
  • Fixed sample size: 1,000 rows works for small datasets but under-represents rare attribute values at scale (100K+ spans). For tiny datasets (< 1,000), the sample IS the population but users can't tell
  • No transparency: Users see percentages like "2.3%" without knowing if that's 23/1000 sampled or 23/23 total

Changes

Deterministic sampling

  • STABLE_SAMPLE_EXPR = 'cityHash64(SpanId)' — used in ORDER BY clause for all sample queries (outlier, inlier, all-spans, and PartIds CTE). Same data always produces the same sample
  • Set to 'rand()' to restore non-deterministic behavior (tunable constant)

Adaptive sample sizing

  • computeEffectiveSampleSize(totalCount)clamp(MIN_SAMPLE_SIZE, ceil(totalCount * SAMPLE_RATIO), MAX_SAMPLE_SIZE)
  • Constants: SAMPLE_SIZE=1000 (fallback), MIN_SAMPLE_SIZE=500, MAX_SAMPLE_SIZE=5000, SAMPLE_RATIO=0.01 (1%)
  • Lightweight count() query runs in parallel — ClickHouse resolves from MergeTree metadata (near-instant)
  • Falls back to SAMPLE_SIZE when count is unavailable (query still loading)

Legend annotation

  • Shows (n=X of Y sampled) when total count is available
  • Shows (n=X sampled) as fallback

Files

  • packages/app/src/components/deltaChartUtils.ts (SAMPLE_SIZE, MIN/MAX_SAMPLE_SIZE, SAMPLE_RATIO, STABLE_SAMPLE_EXPR, computeEffectiveSampleSize)
  • packages/app/src/components/DBDeltaChart.tsx (count query, effectiveSampleSize in all query configs, legend annotation)
  • packages/app/src/components/__tests__/DBDeltaChart.test.ts (computeEffectiveSampleSize tests)

Dependencies

None — standalone improvement to the sampling mechanism.

Test plan

  • Same data + same hover always highlights the same heatmap cells (deterministic)
  • Small dataset (100 spans) → sample size = MIN_SAMPLE_SIZE (500)
  • Medium dataset (200K spans) → sample size = 2,000 (1% of 200K)
  • Large dataset (10M spans) → sample size = MAX_SAMPLE_SIZE (5,000)
  • Legend shows "(n=1,000 of 483,291 sampled)" with actual numbers
  • Setting STABLE_SAMPLE_EXPR = 'rand()' restores non-deterministic behavior

Context

Part of the Event Deltas improvement series. Reference implementation in PR #1797.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions