feat(timing): precise client/server timing split — pure-kernel backend, full-dispatch overhead, participant-only residual#115
Merged
Conversation
Add _extract_compute_ns and _decompose_timing to flopscope-client/_budget.py. Both are pure functions (no I/O) that carry the close-response parsing and wall/backend/overhead/residual decomposition math needed by __exit__ wiring.
… + round-trip Wire __enter__ to snapshot wall-clock and round-trip baseline before the budget_open send, and __exit__ to read server compute time from the close response's comms_summary, then call _decompose_timing to fill _wall_time_s / _flopscope_backend_time / _flopscope_overhead_time / _residual_wall_time. Add integration regression suite (6 tests) that spawns a real FlopscopeServer and asserts the split is non-zero, decomposes wall, and correctly assigns participant sleep to residual. Update unit test mocks to configure comms_tracker.total_round_trip_ns as an int so __enter__/__exit__ arithmetic does not TypeError.
Apply @timed_dispatch / timed_dispatch(proxy) at all client op-dispatch
entry points so every flopscope op family increments the dispatch
accumulator: _fetch_data, __getitem__, _dispatch_op,
RemoteGenerator._call, _make_proxy, _make_linalg_proxy,
_make_random_proxy, _DistributionProxy.{pdf,cdf,ppf},
flops.{einsum,svd}_cost, and the special-cased array()/einsum().
Coverage test (test_every_op_family_increments_dispatch) verifies each
family before and after with a real server subprocess.
…antics Replace Option-1 body tests with Option-3 equivalents: backend = pure kernel, overhead = all flopscope machinery (incl. .tolist() and implicit fetches), residual = participant Python only. Adds test_tolist_is_overhead_not_residual as the Option-3 acceptance criterion and keeps test_every_op_family_increments_dispatch.
…ead, not residual
fc16292 to
43cdd36
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The per-MLP timing split (
wall = backend + overhead + residual) feeds theleaderboard —
residualis the billed bucket (C_m = F_m + λ·R_m). On theserver-backed path the client proxy reported it imprecisely: the framework's own
cost (request encode/decode, the ZMQ round-trip, and result reconstruction
such as
.tolist()) leaked into the billedresidual. Concretely, the gradingharness serializes the participant's predictions with
preds.tolist()insidethe budget context — so participants were billed for the harness materializing
their own output.
This makes the decomposition precise and physically grounded:
_run_kernel); reports it ascompute_time_nsdispatch − backendwall − dispatchWhat changed
_run_kernelchokepoint times only the numpy call;compute_time_nsnow reports kernel-only (arg marshaling, cost model, resultstorage, and
fetch/serialize contribute 0). No new wire field — theexisting
total_compute_time_nssimply narrows in meaning._dispatch.pyaccumulator(
dispatch_span/timed_dispatch, with baseline/delta nesting so each op'swall is counted exactly once) wraps every op-dispatch entry point —
including the data-materialization methods (
tolist,__repr__,__str__,__float__,__int__,__bool__) so result reconstruction lands inoverhead, notresidual.BudgetContextcomputesoverhead = dispatch − kernel,residual = wall − dispatch.Test plan
test_compute_time_is_kernel_only,test_fetch_contributes_no_kernel; full server suite green (206)._dispatchnesting unit tests (no double-count); decomposition unit tests (identity + both clamp branches); real client↔server integration suite.test_tolist_is_overhead_not_residual— 10×.tolist()of a 128² array with no participant Python yieldsresidual ≈ 0.66 ms(reconstruction correctly inoverhead);test_residual_is_only_python(asleep(0.2)is the only thing in residual);test_worker_tolist_not_billed; the no-double-count identity test; and a coverage test asserting every op family increments the dispatch accumulator.Rollout (not in this PR)
cz bumpofflopscope/flopscope-client/flopscope-servertogether. Thehellohandshake enforces exact version match, so the narrowedcompute_time_nsmeaning can't drift across a version mismatch.flopscope[server]andflopscope-clientto the new version.residualbilling — re-scoring is a separate, owned workstream.whestbench-public) are updated separately.