Skip to content

feat(timing): precise client/server timing split — pure-kernel backend, full-dispatch overhead, participant-only residual#115

Merged
spMohanty merged 21 commits into
mainfrom
feat/client-timing-split
Jun 6, 2026
Merged

feat(timing): precise client/server timing split — pure-kernel backend, full-dispatch overhead, participant-only residual#115
spMohanty merged 21 commits into
mainfrom
feat/client-timing-split

Conversation

@spMohanty
Copy link
Copy Markdown
Collaborator

Summary

The per-MLP timing split (wall = backend + overhead + residual) feeds the
leaderboard — residual is the billed bucket (C_m = F_m + λ·R_m). On the
server-backed path the client proxy reported it imprecisely: the framework's own
cost (request encode/decode, the ZMQ round-trip, and result reconstruction
such as .tolist()) leaked into the billed residual. Concretely, the grading
harness serializes the participant's predictions with preds.tolist() inside
the budget context — so participants were billed for the harness materializing
their own output.

This makes the decomposition precise and physically grounded:

bucket definition measured
backend the pure numpy kernel — the actual numerical computation server: times only the numpy call (_run_kernel); reports it as compute_time_ns
overhead all flopscope machinery: client encode/decode/reconstruction + the wire + server-side marshaling/storage; not billed client: dispatch − backend
residual the participant's own Python, outside any flopscope call (the sandbox has no numpy); the billed bucket client: wall − dispatch

What changed

  • flopscope-server: a _run_kernel chokepoint times only the numpy call;
    compute_time_ns now reports kernel-only (arg marshaling, cost model, result
    storage, and fetch/serialize contribute 0). No new wire field — the
    existing total_compute_time_ns simply narrows in meaning.
  • flopscope-client: a new _dispatch.py accumulator
    (dispatch_span / timed_dispatch, with baseline/delta nesting so each op's
    wall is counted exactly once) wraps every op-dispatch entry point —
    including the data-materialization methods (tolist, __repr__, __str__,
    __float__, __int__, __bool__) so result reconstruction lands in
    overhead, not residual. BudgetContext computes
    overhead = dispatch − kernel, residual = wall − dispatch.

Test plan

  • Server: test_compute_time_is_kernel_only, test_fetch_contributes_no_kernel; full server suite green (206).
  • Client: _dispatch nesting unit tests (no double-count); decomposition unit tests (identity + both clamp branches); real client↔server integration suite.
  • Acceptance criteria (all green): test_tolist_is_overhead_not_residual — 10× .tolist() of a 128² array with no participant Python yields residual ≈ 0.66 ms (reconstruction correctly in overhead); test_residual_is_only_python (a sleep(0.2) is the only thing in residual); test_worker_tolist_not_billed; the no-double-count identity test; and a coverage test asserting every op family increments the dispatch accumulator.
  • Zero regressions: full client suite failure/error count unchanged from the pre-existing baseline.

Rollout (not in this PR)

  • Matched-version release: cz bump of flopscope / flopscope-client / flopscope-server together. The hello handshake enforces exact version match, so the narrowed compute_time_ns meaning can't drift across a version mismatch.
  • The consuming evaluator re-pins both flopscope[server] and flopscope-client to the new version.
  • This turns on real residual billing — re-scoring is a separate, owned workstream.
  • The participant-facing contract docs (whestbench-public) are updated separately.

spMohanty added 17 commits June 6, 2026 12:20
Add _extract_compute_ns and _decompose_timing to flopscope-client/_budget.py.
Both are pure functions (no I/O) that carry the close-response parsing and
wall/backend/overhead/residual decomposition math needed by __exit__ wiring.
… + round-trip

Wire __enter__ to snapshot wall-clock and round-trip baseline before the
budget_open send, and __exit__ to read server compute time from the close
response's comms_summary, then call _decompose_timing to fill
_wall_time_s / _flopscope_backend_time / _flopscope_overhead_time /
_residual_wall_time.  Add integration regression suite (6 tests) that
spawns a real FlopscopeServer and asserts the split is non-zero,
decomposes wall, and correctly assigns participant sleep to residual.
Update unit test mocks to configure comms_tracker.total_round_trip_ns
as an int so __enter__/__exit__ arithmetic does not TypeError.
Apply @timed_dispatch / timed_dispatch(proxy) at all client op-dispatch
entry points so every flopscope op family increments the dispatch
accumulator: _fetch_data, __getitem__, _dispatch_op,
RemoteGenerator._call, _make_proxy, _make_linalg_proxy,
_make_random_proxy, _DistributionProxy.{pdf,cdf,ppf},
flops.{einsum,svd}_cost, and the special-cased array()/einsum().

Coverage test (test_every_op_family_increments_dispatch) verifies each
family before and after with a real server subprocess.
…antics

Replace Option-1 body tests with Option-3 equivalents: backend = pure kernel,
overhead = all flopscope machinery (incl. .tolist() and implicit fetches),
residual = participant Python only. Adds test_tolist_is_overhead_not_residual
as the Option-3 acceptance criterion and keeps test_every_op_family_increments_dispatch.
@spMohanty spMohanty force-pushed the feat/client-timing-split branch from fc16292 to 43cdd36 Compare June 6, 2026 14:36
@spMohanty spMohanty merged commit 17e85a3 into main Jun 6, 2026
22 checks passed
@spMohanty spMohanty deleted the feat/client-timing-split branch June 6, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant