feat(benchmarking): enable tests to run in dedicated environment or in docker#3157
feat(benchmarking): enable tests to run in dedicated environment or in docker#3157
Conversation
…mark - Create test/e2e/benchmark/ subpackage with SpamoorSuite (testify/suite) - Move spamoor smoke test into suite as TestSpamoorSmoke - Split helpers into focused files: traces.go, output.go, metrics.go - Introduce resultWriter for defer-based benchmark JSON output - Export shared symbols from evm_test_common.go for cross-package use - Restructure CI to fan-out benchmark jobs and fan-in publishing - Run benchmarks on PRs only when benchmark-related files change
Resolve conflicts keeping the benchmark suite refactoring: - benchmark.yml: keep path filters and suite-style test command - evm_spamoor_smoke_test.go: keep deleted (moved to benchmark pkg) - evm_test_common.go: keep exported types, drop writeTraceBenchmarkJSON (now in benchmark/output.go)
go test sets the working directory to the package under test, so the env var should be relative to test/e2e/benchmark/, not test/e2e/.
go test treats all arguments after an unknown flag (--evm-binary) as test binary args, so ./benchmark/ was never recognized as a package pattern.
go test sets the cwd to the package directory (test/e2e/benchmark/), so the binary path needs an extra parent traversal.
The benchmark package doesn't define the --binary flag that test-e2e passes. It has its own CI workflow so it doesn't need to run here.
…nfig collectBlockMetrics hit reth's 20K FilterLogs limit at high tx volumes. Replace with direct header iteration over [startBlock, endBlock] and add Phase 1 metrics: non-empty ratio, block interval p50/p99, gas/block and tx/block p50/p99. Optimize spamoor configuration for 100ms block time: - --slot-duration 100ms, --startup-delay 0 on daemon - throughput=50 per 100ms slot (500 tx/s per spammer) - max_pending=50000 to avoid 3s block poll backpressure - 5 staggered spammers with 50K txs each Results: 55 MGas/s, 1414 TPS, 19.8% non-empty blocks (up from 6%).
- Move startBlock capture after spammer creation to exclude warm-up - Replace 20s drain sleep with smart poll (waitForDrain) - Add deleteAllSpammers cleanup to handle stale spamoor DB entries - Lower trace sample rate to 10% to prevent Jaeger OOM
- make reth tag configurable via EV_RETH_TAG env var (default pr-140) - fix OTLP config: remove duplicate env vars, use http/protobuf protocol - use require.Eventually for host readiness polling - rename requireHTTP to requireHostUp - use non-fatal logging in resultWriter.flush deferred context - fix stale doc comment (setupCommonEVMEnv -> SetupCommonEVMEnv) - rename loop variable to avoid shadowing testing.TB convention - add block/internal/executing/** to CI path trigger - remove unused require import from output.go
# Conflicts: # scripts/test.mk
# Conflicts: # test/e2e/benchmark/suite_test.go
# Conflicts: # test/e2e/benchmark/suite_test.go
move EV_RETH_TAG resolution and rpc connection limits into setupEnv so all benchmark tests share the same reth configuration. lower ERC20 spammer count from 5 to 2 to reduce resource contention on local hardware while keeping the loop for easy scaling on dedicated infra.
- add blockMetricsSummary with summarize(), log(), and entries() methods - add evNodeOverhead() for computing ProduceBlock vs ExecuteTxs overhead - add collectTraces() suite method to deduplicate trace collection pattern - add addEntries() convenience method on resultWriter - slim TestERC20Throughput from ~217 to ~119 lines - reuse collectTraces in TestSpamoorSmoke
Bumps tastora to pick up host network support in the spamoor builder. Spamoor in external mode now uses host networking so it can resolve the same hostnames as the host machine.
Record startTime when the provider is created and use it as the lower bound for trace queries, preventing spans from previous runs being included in the analysis.
…run-without-infra # Conflicts: # .github/workflows/benchmark.yml # test/e2e/benchmark/gasburner_test.go # test/e2e/benchmark/helpers.go # test/e2e/benchmark/spamoor_erc20_test.go # test/e2e/benchmark/suite_test.go # test/e2e/benchmark/traces.go # test/e2e/go.mod # test/e2e/go.sum
…metrics - Make gas_units_to_burn, max_wallets, num_spammers, throughput, and warmup_txs configurable via BENCH_* env vars - Add rethExecutionRate() for ev-reth GGas/s measurement - Add engineSpanEntries() for ProduceBlock/GetPayload/NewPayload timing - Switch local benchmarks from Jaeger to VictoriaTraces - Add setupExternalEnv for running against pre-deployed infrastructure - Update tastora to 2ee1b0a (victoriatraces support)
- benchConfig consolidates all BENCH_* env vars into a single struct constructed once per test via newBenchConfig(serviceName) - benchmarkResult collects all output metrics (block summary, overhead, GGas/s, engine span timing, seconds_per_gigagas, span averages) and produces entries via a single entries() call - Removes scattered envInt/envOrDefault calls from test files - Removes manual entry-by-entry result assembly from each test - Net reduction of ~129 lines across existing files
📝 WalkthroughWalkthroughRefactors e2e benchmark infrastructure from hard-coded configuration to environment-variable-driven setup, replaces Jaeger tracing with VictoriaTraces backend, and introduces ASCII flowchart visualization and unified benchmark result aggregation across tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Benchmark Test
participant Cfg as BenchConfig
participant Spammer as Spammers
participant VT as VictoriaTraces
participant Result as BenchmarkResult
Test->>Cfg: newBenchConfig(serviceName)
Cfg-->>Cfg: Load BENCH_* env vars
Cfg-->>Test: Return config with parameters
Test->>Spammer: Start spammers (NumSpammers)
Spammer->>VT: Emit trace spans via OTLP
Spammer-->>Test: Spammers running
Test->>Test: Execute benchmark workload
Spammer->>VT: Send transaction traces
Test->>VT: collectSpans(serviceName)
VT-->>Test: Stream spans via LogsQL
Test->>Test: Parse spans + extract hostNames
Test-->>Test: Build richSpan hierarchy
Test->>Result: newBenchmarkResult(name, metrics, traces)
Result->>Result: Aggregate span durations
Result->>Result: Build ASCII flowchart
Result->>Test: Entries with summary + visualization
Test->>Test: Log results & flowcharts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @chatton's task in 2m 0s —— View job Code Review
Good overall structure and clean separation of concerns. The refactoring into Issues1. The docstring says "fails the test on timeout" but the implementation only logs a warning and returns silently. If a test depends on this reaching the target, it may pass incorrectly when the warmup was incomplete. 2.
3.
4. Duplicate network roundtrip for rich and flat spans (
Nits / Polish5. Invalid env var values (e.g. 6. If a trace has multiple candidate roots (no parent found in set), the one with the longest duration wins. This is a reasonable heuristic but can silently produce misleading trees if span IDs are missing from the collection window. A comment explaining the tie-breaking rule would help future readers. 7. The 8. Each span line is unmarshalled twice: once fully into 9. Running ev-reth with 10. The reviewer 11. For each block in the range, 12. Missing documentation of The SummaryThe core architecture (config extraction, |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3157 +/- ##
==========================================
- Coverage 60.75% 60.74% -0.01%
==========================================
Files 117 117
Lines 11988 11988
==========================================
- Hits 7283 7282 -1
- Misses 3886 3887 +1
Partials 819 819
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
test/e2e/benchmark/gasburner_test.go
Outdated
|
|
||
| // give spammers a moment to start, then verify none failed immediately | ||
| time.Sleep(3 * time.Second) | ||
| assertSpammersRunning(t, api, spammerIDs) |
There was a problem hiding this comment.
nit: avoid sleep.
require.EventuallyWithT(t, func(collect *assert.CollectT) {
assertSpammersRunning(t, api, spammerIDs)
}, 3*time.Second, time.Second/10)
- replace time.Sleep with require.EventuallyWithT for spammer checks - use benchConfig env vars in TestERC20Throughput instead of hardcoded constants - remove dead truncateID function - fix stale Jaeger comment in smoke test - deduplicate HTTP boilerplate in trace fetching via fetchLogStream helper - fix fragile string comparison for ProduceBlock avg logging - make waitForMetricTarget responsive to context cancellation - add BENCH_WAIT_TIMEOUT env var support
|
This PR is already getting a bit too large, I can do some cleanups in a followup if there are no major blockers once CI is green |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/benchmark/suite_test.go (1)
48-55:⚠️ Potential issue | 🟠 MajorPin the default reth image instead of
latest.Using a floating tag makes benchmark tests non-deterministic—the same commit can pull different binaries over time, causing baseline drift and failures unrelated to code changes. Default to an immutable release tag or digest, keeping
EV_RETH_TAGas the override mechanism.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/benchmark/suite_test.go` around lines 48 - 55, The defaultRethTag constant is set to the floating "latest" tag which makes rethTag() non-deterministic; replace defaultRethTag with a pinned immutable release tag or digest (e.g., a specific semver or sha256 digest) so tests pull a fixed image, while keeping the EV_RETH_TAG environment variable override in rethTag() intact; update any test docs/comments to reflect that the default is now a pinned release rather than "latest".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/benchmark/config.go`:
- Around line 33-47: newBenchConfig currently relies on envInt/envDuration which
silently masks bad/misspelled or nonsensical values; change newBenchConfig (and
any other place constructing benchConfig) to validate parsed environment inputs
and return an error instead of falling back silently. Update the signature of
newBenchConfig to return (benchConfig, error), parse each env var (using
envInt/envDuration) for BENCH_NUM_SPAMMERS, BENCH_COUNT_PER_SPAMMER,
BENCH_THROUGHPUT, BENCH_WARMUP_TXS, BENCH_GAS_UNITS_TO_BURN, BENCH_MAX_WALLETS,
BENCH_WAIT_TIMEOUT, etc., enforce sensible bounds (e.g., non-negative integers,
positive durations, and explicit max limits), and return a descriptive error
when a value is out of range or unparseable; apply the same validation logic
wherever benchConfig is built so invalid BENCH_* overrides fail fast.
In `@test/e2e/benchmark/flowchart.go`:
- Around line 307-315: The code currently uses a per-trace dedupe map `seen`
when populating `countsByName` (in the loop that also fills `dursByName`), so
`countsByName` represents traces-per-operation but `renderAggregateTree` prints
it as "(%d calls)"; either remove the dedupe so you count every occurrence or
change the label to "traces". Fix: in the span-aggregation loop (the block that
creates `seen := make(map[string]bool)` and increments `countsByName[s.name]`),
either delete the `seen` logic and always do `countsByName[s.name]++` for each
span to count calls, or keep `seen` and update `renderAggregateTree` to display
"traces" instead of "calls"; apply the same change to the similar block around
lines 345-351 where `seen`/`countsByName` is used.
In `@test/e2e/benchmark/helpers.go`:
- Around line 456-482: waitForMetricTarget currently logs and returns on timeout
or ctx.Done(), allowing callers like TestGasBurner to continue as if successful;
change the two branches handling <-ctx.Done() and <-timer.C to fail the test
instead (use t.Fatalf with a clear message including name, target, and timeout)
so the benchmark stops immediately when the metric target isn't met; if you
prefer to propagate errors instead, change waitForMetricTarget's signature to
return an error and update callers (e.g., TestGasBurner) to handle the error
instead of assuming success.
In `@test/e2e/benchmark/spamoor_erc20_test.go`:
- Around line 81-86: The block metrics are including warm-up spans because the
trace window isn't reset before collecting traces; mirror TestGasBurner by
waiting for the warm-up gate and calling e.traces.resetStartTime() right before
calling s.collectTraces so engineSpanEntries/spanAvgEntries exclude
deploy/funding spans; specifically, add the same warm-up synchronization used in
TestGasBurner and invoke e.traces.resetStartTime() immediately before
collectTraces/ newBenchmarkResult("ERC20Throughput", ...) so both trace-derived
metrics and the block summary use the same post-warmup window.
In `@test/e2e/benchmark/suite_test.go`:
- Around line 231-237: The current best-effort rich-span collection calls
richSpanCollector.collectRichSpans twice and can block ~3 minutes each even when
results are optional; change the logic in the branch that checks
e.traces.(richSpanCollector) so: 1) only call rc.collectRichSpans(ctx,
"ev-reth") if len(tr.evReth) > 0 to gate that extra call, and 2) replace the
blocking collectRichSpans calls with a short-timeout or non-blocking/try variant
(or wrap the call with a short context timeout) when populating tr.evNodeRich
and tr.evRethRich to avoid multi-minute waits when rich spans are unavailable.
Ensure you update references to richSpanCollector, collectRichSpans,
serviceName, tr.evNodeRich, tr.evRethRich, and tr.evReth accordingly.
- Around line 166-173: The test currently logs and prints raw external secrets
(rpcURL and the BENCH_PRIVATE_KEY env var) and may include them in failure
messages; change logging to never include full URLs or private keys: in the
block using rpcURL, replace the s.Logf call to log only the mode or a sanitized
host (e.g., extract and log url.Host or mask credentials) and stop logging
privateKey entirely; update the s.Require().NotEmpty and s.Require().NoError
messages to avoid interpolating rpcURL or privateKey (use generic messages like
"failed to dial external RPC" or include the sanitized host only), and apply the
same sanitization/no-logging pattern to the other occurrences that reference
rpcURL/BENCH_* (the ethclient.Dial call, s.Require assertions, and any other
s.Logf around those vars).
In `@test/e2e/benchmark/traces.go`:
- Around line 61-67: The uiURL builder in victoriaTraceProvider (function uiURL)
interpolates start and end times directly which can contain '+' from RFC3339
offsets and become spaces in the query; update uiURL to URL-encode the start and
end parameters (e.g., use neturl.QueryEscape or build the query with url.Values)
so both start and end are escaped before concatenation, and apply the same
change to the other analogous builder around lines 155-162 to ensure both
timestamps are properly encoded.
- Line 90: The error messages currently include the raw trace query URL
(v.queryURL) which may contain sensitive data; remove the URL from returned
errors and instead report only the service name and status/error. Locate the
fmt.Errorf calls that reference v.queryURL (e.g., the timed out message using
"timed out waiting for %s traces from %s: %w") and change them to omit
v.queryURL, e.g. format with serviceName and ctx.Err() or a non-sensitive status
string; apply the same replacement for the other similar fmt.Errorf/log calls
that reference v.queryURL at the other noted locations (lines around 118, 145,
and 171-176). Ensure v.queryURL is not included in any returned error or test
output.
- Around line 205-214: The code currently discards errors from strconv.ParseInt
when computing ns and startNs and appends a malformed richSpan
(zero-time/duration); change the ParseInt calls to capture their errors and, if
either returns an error, skip the current row (or log the error) instead of
appending. Specifically, update the parsing around ns, startNs (the
strconv.ParseInt calls) and the subsequent append to richSpan so you check both
err values and continue to the next iteration when a parse fails; preserve use
of row.TraceID, row.SpanID, extractHostName(line), richSpan, startTime and
duration when the parses succeed.
---
Outside diff comments:
In `@test/e2e/benchmark/suite_test.go`:
- Around line 48-55: The defaultRethTag constant is set to the floating "latest"
tag which makes rethTag() non-deterministic; replace defaultRethTag with a
pinned immutable release tag or digest (e.g., a specific semver or sha256
digest) so tests pull a fixed image, while keeping the EV_RETH_TAG environment
variable override in rethTag() intact; update any test docs/comments to reflect
that the default is now a pinned release rather than "latest".
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c92be09c-4806-4c9f-9301-e1af376518a5
⛔ Files ignored due to path filters (1)
test/e2e/go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
test/e2e/benchmark/config.gotest/e2e/benchmark/flowchart.gotest/e2e/benchmark/gasburner_test.gotest/e2e/benchmark/helpers.gotest/e2e/benchmark/output.gotest/e2e/benchmark/result.gotest/e2e/benchmark/spamoor_erc20_test.gotest/e2e/benchmark/spamoor_smoke_test.gotest/e2e/benchmark/suite_test.gotest/e2e/benchmark/traces.gotest/e2e/go.mod
| func newBenchConfig(serviceName string) benchConfig { | ||
| return benchConfig{ | ||
| ServiceName: serviceName, | ||
| BlockTime: envOrDefault("BENCH_BLOCK_TIME", "100ms"), | ||
| SlotDuration: envOrDefault("BENCH_SLOT_DURATION", "250ms"), | ||
| GasLimit: envOrDefault("BENCH_GAS_LIMIT", ""), | ||
| ScrapeInterval: envOrDefault("BENCH_SCRAPE_INTERVAL", "1s"), | ||
| NumSpammers: envInt("BENCH_NUM_SPAMMERS", 2), | ||
| CountPerSpammer: envInt("BENCH_COUNT_PER_SPAMMER", 2000), | ||
| Throughput: envInt("BENCH_THROUGHPUT", 200), | ||
| WarmupTxs: envInt("BENCH_WARMUP_TXS", 200), | ||
| GasUnitsToBurn: envInt("BENCH_GAS_UNITS_TO_BURN", 1_000_000), | ||
| MaxWallets: envInt("BENCH_MAX_WALLETS", 500), | ||
| WaitTimeout: envDuration("BENCH_WAIT_TIMEOUT", 10*time.Minute), | ||
| } |
There was a problem hiding this comment.
Fail fast on invalid BENCH_* overrides.
envInt/envDuration currently mask typos, and parseable but nonsensical values like negative counts or timeouts still flow through. That means a mis-set BENCH_NUM_SPAMMERS, BENCH_WARMUP_TXS, or BENCH_WAIT_TIMEOUT can still produce benchmark output for a different workload than the job requested. Please validate bounds when building benchConfig and return an error instead of silently defaulting.
As per coding guidelines, "Validate all inputs from external sources".
Also applies to: 68-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/config.go` around lines 33 - 47, newBenchConfig currently
relies on envInt/envDuration which silently masks bad/misspelled or nonsensical
values; change newBenchConfig (and any other place constructing benchConfig) to
validate parsed environment inputs and return an error instead of falling back
silently. Update the signature of newBenchConfig to return (benchConfig, error),
parse each env var (using envInt/envDuration) for BENCH_NUM_SPAMMERS,
BENCH_COUNT_PER_SPAMMER, BENCH_THROUGHPUT, BENCH_WARMUP_TXS,
BENCH_GAS_UNITS_TO_BURN, BENCH_MAX_WALLETS, BENCH_WAIT_TIMEOUT, etc., enforce
sensible bounds (e.g., non-negative integers, positive durations, and explicit
max limits), and return a descriptive error when a value is out of range or
unparseable; apply the same validation logic wherever benchConfig is built so
invalid BENCH_* overrides fail fast.
| for _, spans := range byTrace { | ||
| // count each unique operation once per trace | ||
| seen := make(map[string]bool) | ||
| for _, s := range spans { | ||
| dursByName[s.name] = append(dursByName[s.name], s.duration) | ||
| if !seen[s.name] { | ||
| countsByName[s.name]++ | ||
| seen[s.name] = true | ||
| } |
There was a problem hiding this comment.
The aggregate chart is labeling per-trace counts as calls.
countsByName is incremented once per trace because of seen, but renderAggregateTree prints that number as (%d calls). Any operation that appears multiple times in one trace will be underreported, so the aggregate chart becomes misleading. Either count every occurrence or relabel the output as traces.
Also applies to: 345-351
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/flowchart.go` around lines 307 - 315, The code currently
uses a per-trace dedupe map `seen` when populating `countsByName` (in the loop
that also fills `dursByName`), so `countsByName` represents traces-per-operation
but `renderAggregateTree` prints it as "(%d calls)"; either remove the dedupe so
you count every occurrence or change the label to "traces". Fix: in the
span-aggregation loop (the block that creates `seen := make(map[string]bool)`
and increments `countsByName[s.name]`), either delete the `seen` logic and
always do `countsByName[s.name]++` for each span to count calls, or keep `seen`
and update `renderAggregateTree` to display "traces" instead of "calls"; apply
the same change to the similar block around lines 345-351 where
`seen`/`countsByName` is used.
| // waitForMetricTarget polls a metric getter function every 2s until the | ||
| // returned value >= target, or fails the test on timeout. | ||
| func waitForMetricTarget(t testing.TB, name string, poll func() (float64, error), target float64, timeout time.Duration) { | ||
| t.Helper() | ||
| deadline := time.Now().Add(timeout) | ||
| for time.Now().Before(deadline) { | ||
| ctx := t.Context() | ||
| timer := time.NewTimer(timeout) | ||
| defer timer.Stop() | ||
| ticker := time.NewTicker(2 * time.Second) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| v, err := poll() | ||
| if err == nil && v >= target { | ||
| t.Logf("metric %s reached %.0f (target %.0f)", name, v, target) | ||
| return | ||
| } | ||
| time.Sleep(2 * time.Second) | ||
| select { | ||
| case <-ctx.Done(): | ||
| t.Logf("metric %s: context cancelled (target %.0f)", name, target) | ||
| return | ||
| case <-timer.C: | ||
| t.Logf("metric %s did not reach target %.0f within %v", name, target, timeout) | ||
| return | ||
| case <-ticker.C: | ||
| } | ||
| } | ||
| t.Fatalf("metric %s did not reach target %.0f within %v", name, target, timeout) | ||
| } |
There was a problem hiding this comment.
Don't let timeout/cancel turn into a partial "successful" benchmark.
On timeout or ctx.Done(), this helper only logs and returns. TestGasBurner then keeps collecting blocks and writes benchmark output even though the target metric was never reached. Please fail here or return an error that callers must handle.
💡 Suggested change
func waitForMetricTarget(t testing.TB, name string, poll func() (float64, error), target float64, timeout time.Duration) {
t.Helper()
ctx := t.Context()
timer := time.NewTimer(timeout)
defer timer.Stop()
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
+ var lastValue float64
+ var lastErr error
for {
v, err := poll()
+ lastValue, lastErr = v, err
if err == nil && v >= target {
t.Logf("metric %s reached %.0f (target %.0f)", name, v, target)
return
}
select {
case <-ctx.Done():
- t.Logf("metric %s: context cancelled (target %.0f)", name, target)
- return
+ require.FailNowf(t, "metric wait cancelled",
+ "%s reached %.0f/%.0f before cancellation: %v", name, lastValue, target, ctx.Err())
case <-timer.C:
- t.Logf("metric %s did not reach target %.0f within %v", name, target, timeout)
- return
+ require.FailNowf(t, "metric target not reached",
+ "%s reached %.0f/%.0f within %v (last error: %v)", name, lastValue, target, timeout, lastErr)
case <-ticker.C:
}
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/helpers.go` around lines 456 - 482, waitForMetricTarget
currently logs and returns on timeout or ctx.Done(), allowing callers like
TestGasBurner to continue as if successful; change the two branches handling
<-ctx.Done() and <-timer.C to fail the test instead (use t.Fatalf with a clear
message including name, target, and timeout) so the benchmark stops immediately
when the metric target isn't met; if you prefer to propagate errors instead,
change waitForMetricTarget's signature to return an error and update callers
(e.g., TestGasBurner) to handle the error instead of assuming success.
| traces := s.collectTraces(e, cfg.ServiceName) | ||
|
|
||
| // collect and report traces | ||
| traces := s.collectTraces(e, serviceName) | ||
|
|
||
| if overhead, ok := evNodeOverhead(traces.evNode); ok { | ||
| t.Logf("ev-node overhead: %.1f%%", overhead) | ||
| w.addEntry(entry{Name: "ERC20Throughput - ev-node overhead", Unit: "%", Value: overhead}) | ||
| } | ||
|
|
||
| w.addEntries(summary.entries("ERC20Throughput")) | ||
| w.addSpans(traces.allSpans()) | ||
| result := newBenchmarkResult("ERC20Throughput", bm, traces) | ||
| s.Require().Greater(result.summary.SteadyState, time.Duration(0), "expected non-zero steady-state duration") | ||
| result.log(t, wallClock) | ||
| w.addEntries(result.entries()) |
There was a problem hiding this comment.
Trace-derived metrics still include warm-up work.
This test records block metrics after the launcher setup phase, but the trace window is never reset before collectTraces. The block summary therefore measures steady-state throughput while engineSpanEntries and spanAvgEntries still include deploy/funding spans. Mirror TestGasBurner's warm-up gate plus e.traces.resetStartTime() so both data sources describe the same window.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/spamoor_erc20_test.go` around lines 81 - 86, The block
metrics are including warm-up spans because the trace window isn't reset before
collecting traces; mirror TestGasBurner by waiting for the warm-up gate and
calling e.traces.resetStartTime() right before calling s.collectTraces so
engineSpanEntries/spanAvgEntries exclude deploy/funding spans; specifically, add
the same warm-up synchronization used in TestGasBurner and invoke
e.traces.resetStartTime() immediately before collectTraces/
newBenchmarkResult("ERC20Throughput", ...) so both trace-derived metrics and the
block summary use the same post-warmup window.
| t.Logf("external mode: using RPC %s", rpcURL) | ||
|
|
||
| // collectTraces fetches ev-node traces (required) and ev-reth traces (optional) | ||
| // from Jaeger, then prints reports for both. | ||
| func (s *SpamoorSuite) collectTraces(e *env, serviceName string) *traceResult { | ||
| t := s.T() | ||
| tr := &traceResult{ | ||
| evNode: s.collectServiceTraces(e, serviceName), | ||
| evReth: s.tryCollectServiceTraces(e, "ev-reth"), | ||
| } | ||
| e2e.PrintTraceReport(t, serviceName, tr.evNode) | ||
| if len(tr.evReth) > 0 { | ||
| e2e.PrintTraceReport(t, "ev-reth", tr.evReth) | ||
| } | ||
| return tr | ||
| } | ||
| privateKey := os.Getenv("BENCH_PRIVATE_KEY") | ||
| s.Require().NotEmpty(privateKey, "BENCH_PRIVATE_KEY must be set in external mode") | ||
|
|
||
| // collectServiceTraces fetches traces from Jaeger for the given service and returns the spans. | ||
| func (s *SpamoorSuite) collectServiceTraces(e *env, serviceName string) []e2e.TraceSpan { | ||
| ctx, cancel := context.WithTimeout(s.T().Context(), 3*time.Minute) | ||
| defer cancel() | ||
| // eth client | ||
| ethClient, err := ethclient.Dial(rpcURL) | ||
| s.Require().NoError(err, "failed to dial external RPC %s", rpcURL) |
There was a problem hiding this comment.
Redact external BENCH_ URLs before logging them.*
These values can carry basic-auth credentials, API tokens, or signed query params, and the current logs/failure output will copy them into CI artifacts. Log only the mode or a sanitized host identifier instead. As per coding guidelines: "Never expose private keys in logs or errors."
Also applies to: 198-200, 227-229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/suite_test.go` around lines 166 - 173, The test currently
logs and prints raw external secrets (rpcURL and the BENCH_PRIVATE_KEY env var)
and may include them in failure messages; change logging to never include full
URLs or private keys: in the block using rpcURL, replace the s.Logf call to log
only the mode or a sanitized host (e.g., extract and log url.Host or mask
credentials) and stop logging privateKey entirely; update the
s.Require().NotEmpty and s.Require().NoError messages to avoid interpolating
rpcURL or privateKey (use generic messages like "failed to dial external RPC" or
include the sanitized host only), and apply the same sanitization/no-logging
pattern to the other occurrences that reference rpcURL/BENCH_* (the
ethclient.Dial call, s.Require assertions, and any other s.Logf around those
vars).
| if rc, ok := e.traces.(richSpanCollector); ok { | ||
| if spans, err := rc.collectRichSpans(ctx, serviceName); err == nil { | ||
| tr.evNodeRich = spans | ||
| } | ||
| if spans, err := rc.collectRichSpans(ctx, "ev-reth"); err == nil { | ||
| tr.evRethRich = spans | ||
| } |
There was a problem hiding this comment.
Best-effort rich-span collection shouldn't add minutes of blocking.
This branch ignores rich-span errors, but collectRichSpans still waits up to 3 minutes per call. When rich spans are unavailable, optional diagnostics can add roughly 3-6 minutes of dead time to a run. Gate the ev-reth branch on len(tr.evReth) > 0, and use a short-timeout/try variant for rich spans as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/suite_test.go` around lines 231 - 237, The current
best-effort rich-span collection calls richSpanCollector.collectRichSpans twice
and can block ~3 minutes each even when results are optional; change the logic
in the branch that checks e.traces.(richSpanCollector) so: 1) only call
rc.collectRichSpans(ctx, "ev-reth") if len(tr.evReth) > 0 to gate that extra
call, and 2) replace the blocking collectRichSpans calls with a short-timeout or
non-blocking/try variant (or wrap the call with a short context timeout) when
populating tr.evNodeRich and tr.evRethRich to avoid multi-minute waits when rich
spans are unavailable. Ensure you update references to richSpanCollector,
collectRichSpans, serviceName, tr.evNodeRich, tr.evRethRich, and tr.evReth
accordingly.
| func (v *victoriaTraceProvider) uiURL(serviceName string) string { | ||
| query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName) | ||
| return fmt.Sprintf("%s/select/vmui/#/query?query=%s&start=%s&end=%s", | ||
| strings.TrimRight(v.queryURL, "/"), | ||
| neturl.QueryEscape(query), | ||
| v.startTime.Format(time.RFC3339), | ||
| time.Now().Format(time.RFC3339)) |
There was a problem hiding this comment.
Encode start and end when building VictoriaTraces URLs.
time.RFC3339/time.RFC3339Nano emits +HH:MM on positive-offset machines. Interpolating that directly into the query string turns + into a space, so local trace lookups break outside UTC. Build both URLs with url.Values (or escape start and end too).
Suggested fix
func (v *victoriaTraceProvider) uiURL(serviceName string) string {
query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)
- return fmt.Sprintf("%s/select/vmui/#/query?query=%s&start=%s&end=%s",
- strings.TrimRight(v.queryURL, "/"),
- neturl.QueryEscape(query),
- v.startTime.Format(time.RFC3339),
- time.Now().Format(time.RFC3339))
+ params := neturl.Values{
+ "query": []string{query},
+ "start": []string{v.startTime.UTC().Format(time.RFC3339)},
+ "end": []string{time.Now().UTC().Format(time.RFC3339)},
+ }
+ return fmt.Sprintf("%s/select/vmui/#/query?%s",
+ strings.TrimRight(v.queryURL, "/"),
+ params.Encode())
}
...
end := time.Now()
query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)
baseURL := strings.TrimRight(v.queryURL, "/")
- url := fmt.Sprintf("%s/select/logsql/query?query=%s&start=%s&end=%s",
- baseURL,
- neturl.QueryEscape(query),
- v.startTime.Format(time.RFC3339Nano),
- end.Format(time.RFC3339Nano))
+ params := neturl.Values{
+ "query": []string{query},
+ "start": []string{v.startTime.UTC().Format(time.RFC3339Nano)},
+ "end": []string{end.UTC().Format(time.RFC3339Nano)},
+ }
+ url := fmt.Sprintf("%s/select/logsql/query?%s", baseURL, params.Encode())Also applies to: 155-162
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/traces.go` around lines 61 - 67, The uiURL builder in
victoriaTraceProvider (function uiURL) interpolates start and end times directly
which can contain '+' from RFC3339 offsets and become spaces in the query;
update uiURL to URL-encode the start and end parameters (e.g., use
neturl.QueryEscape or build the query with url.Values) so both start and end are
escaped before concatenation, and apply the same change to the other analogous
builder around lines 155-162 to ensure both timestamps are properly encoded.
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return nil, fmt.Errorf("timed out waiting for %s traces from %s: %w", serviceName, v.queryURL, ctx.Err()) |
There was a problem hiding this comment.
Keep raw trace query URLs out of returned errors.
BENCH_TRACE_QUERY_URL is externally supplied and may include credentials or signed parameters. These messages bubble into failed test output, so keep the raw URL out of the error text and report only the service name / status instead. As per coding guidelines: "Never expose private keys in logs or errors."
Also applies to: 118-118, 145-145, 171-176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/traces.go` at line 90, The error messages currently
include the raw trace query URL (v.queryURL) which may contain sensitive data;
remove the URL from returned errors and instead report only the service name and
status/error. Locate the fmt.Errorf calls that reference v.queryURL (e.g., the
timed out message using "timed out waiting for %s traces from %s: %w") and
change them to omit v.queryURL, e.g. format with serviceName and ctx.Err() or a
non-sensitive status string; apply the same replacement for the other similar
fmt.Errorf/log calls that reference v.queryURL at the other noted locations
(lines around 118, 145, and 171-176). Ensure v.queryURL is not included in any
returned error or test output.
| ns, _ := strconv.ParseInt(row.Duration, 10, 64) | ||
| startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64) | ||
| spans = append(spans, richSpan{ | ||
| traceID: row.TraceID, | ||
| spanID: row.SpanID, | ||
| parentSpanID: row.ParentSpanID, | ||
| name: row.Name, | ||
| hostName: extractHostName(line), | ||
| startTime: time.Unix(0, startNs), | ||
| duration: time.Duration(ns) * time.Nanosecond, |
There was a problem hiding this comment.
Drop malformed rich-span rows instead of synthesizing zero-value spans.
Both ParseInt errors are discarded here, so a bad row becomes a span at Unix epoch with zero duration. That can scramble flowchart ordering and span summaries; skip the row (or log it) when either parse fails.
Suggested fix
- ns, _ := strconv.ParseInt(row.Duration, 10, 64)
- startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
+ ns, err := strconv.ParseInt(row.Duration, 10, 64)
+ if err != nil {
+ continue
+ }
+ startNs, err := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
+ if err != nil {
+ continue
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ns, _ := strconv.ParseInt(row.Duration, 10, 64) | |
| startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64) | |
| spans = append(spans, richSpan{ | |
| traceID: row.TraceID, | |
| spanID: row.SpanID, | |
| parentSpanID: row.ParentSpanID, | |
| name: row.Name, | |
| hostName: extractHostName(line), | |
| startTime: time.Unix(0, startNs), | |
| duration: time.Duration(ns) * time.Nanosecond, | |
| ns, err := strconv.ParseInt(row.Duration, 10, 64) | |
| if err != nil { | |
| continue | |
| } | |
| startNs, err := strconv.ParseInt(row.StartTimeUnixNano, 10, 64) | |
| if err != nil { | |
| continue | |
| } | |
| spans = append(spans, richSpan{ | |
| traceID: row.TraceID, | |
| spanID: row.SpanID, | |
| parentSpanID: row.ParentSpanID, | |
| name: row.Name, | |
| hostName: extractHostName(line), | |
| startTime: time.Unix(0, startNs), | |
| duration: time.Duration(ns) * time.Nanosecond, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/traces.go` around lines 205 - 214, The code currently
discards errors from strconv.ParseInt when computing ns and startNs and appends
a malformed richSpan (zero-time/duration); change the ParseInt calls to capture
their errors and, if either returns an error, skip the current row (or log the
error) instead of appending. Specifically, update the parsing around ns, startNs
(the strconv.ParseInt calls) and the subsequent append to richSpan so you check
both err values and continue to the next iteration when a parse fails; preserve
use of row.TraceID, row.SpanID, extractHostName(line), richSpan, startTime and
duration when the parses succeed.
|
Will clean up the richspans/flowchart in a follow up, this is still experimental |
Overview
This PR
BENCH_*variables.richSpantype which is used to output a flowchart which captures spans and subspansBelow are a sample of runs (locally with docker) which we can use to inform settings for the dedicated hardware runs.
Summary by CodeRabbit
Release Notes
New Features
Tests