Skip to content

feat(benchmarking): enable tests to run in dedicated environment or in docker#3157

Open
chatton wants to merge 59 commits intomainfrom
cian/enable-tests-to-run-without-infra
Open

feat(benchmarking): enable tests to run in dedicated environment or in docker#3157
chatton wants to merge 59 commits intomainfrom
cian/enable-tests-to-run-without-infra

Conversation

@chatton
Copy link
Contributor

@chatton chatton commented Mar 12, 2026

Overview

This PR

  • enables the tests to run with existing infra via the BENCH_* variables.
  • Jaeger docker implementation switched to victoria traces to match what we deploy in real environments
  • Added a richSpan type which is used to output a flowchart which captures spans and subspans

Below are a sample of runs (locally with docker) which we can use to inform settings for the dedicated hardware runs.

Throughput Results (sorted by MGas/s)

  ┌─────┬───────┬────────────────────────┬──────────┬────────┬────────┬───────┬──────────┬─────────────┬──────────────┐
  │  #  │  Run  │  Block / Gas / Scrape  │ Spammers │ MGas/s │ s/Ggas │  TPS  │ blocks/s │ non-empty % │    Status     │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 1   │ run21 │ 500ms / 2G / 250ms     │        8 │  383.0 │   2.61 │  98.0 │     0.21 │        57.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 2   │ run22 │ 500ms / 3G / 250ms     │        8 │  374.0 │   2.67 │  91.0 │     0.43 │       100.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 3   │ run18 │ 250ms / 1G / 250ms     │        6 │  361.8 │   2.76 │ 106.8 │     0.44 │        58.1 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 4   │ run19 │ 250ms / 2G / 250ms     │        8 │  353.3 │   2.83 │ 101.4 │     0.54 │       100.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 5   │ run23 │ 1s / 2G / 250ms        │        8 │  350.0 │   2.85 │  87.0 │     0.21 │        68.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 6   │ run25 │ 1s / 5G / 250ms        │       10 │  337.0 │   2.97 │  72.0 │     0.16 │       100.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 7   │ run24 │ 1s / 3G / 250ms        │        8 │  332.0 │   3.01 │ 132.0 │     0.38 │        75.0 │ PASS (retry) │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 8   │ run20 │ 500ms / 1G / 250ms     │        6 │  310.1 │   3.23 │  90.0 │     0.35 │        57.4 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 9   │ run07 │ 250ms / 300M / 250ms   │        4 │  297.1 │   3.37 │  59.5 │     0.99 │        48.6 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 10  │ run11 │ 500ms / 500M / 250ms   │        4 │  288.7 │   3.46 │  57.8 │     0.58 │        49.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 11  │ run04 │ 100ms / 100M / 50ms    │        4 │  283.4 │   3.53 │  56.8 │     2.84 │        49.9 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 12  │ run14 │ 1s / 500M / 50ms       │        4 │  245.5 │   4.07 │  49.2 │     0.99 │       100.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 13  │ run10 │ 500ms / 300M / 250ms   │        4 │  235.8 │   4.24 │  47.3 │     0.79 │        49.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 14  │ run13 │ 1s / 500M / 250ms      │        4 │  226.3 │   4.42 │  45.3 │     0.45 │        49.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 15  │ run03 │ 100ms / 100M / 250ms   │        4 │  200.5 │   4.99 │  40.2 │     2.01 │        29.6 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 16  │ run06 │ 250ms / 100M / 250ms   │        4 │  190.6 │   5.25 │  38.2 │     1.91 │        48.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 17  │ run12 │ 1s / 300M / 250ms      │        4 │  189.4 │   5.28 │  37.9 │     0.63 │        62.9 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 18  │ run15 │ 2s / 500M / 250ms      │        4 │  166.4 │   6.01 │  33.3 │     0.33 │        65.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 19  │ run02 │ 100ms / 30M / 50ms     │        4 │  163.8 │   6.10 │  32.8 │     5.47 │        55.4 │ PASS (retry) │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 20  │ run17 │ 2s / 500M / 50ms       │        4 │  125.4 │   7.98 │  25.1 │     0.50 │       100.0 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 21  │ run09 │ 500ms / 100M / 50ms    │        4 │  117.5 │   8.51 │  23.5 │     1.18 │        58.8 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 22  │ run08 │ 500ms / 100M / 250ms   │        4 │  115.5 │   8.66 │  23.1 │     1.16 │        57.6 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 23  │ run16 │ 2s / 300M / 250ms      │        4 │   96.4 │  10.37 │  19.3 │     0.32 │        63.7 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ 24  │ run05 │ 250ms / 30M / 250ms    │        4 │   64.9 │  15.40 │  13.0 │     2.17 │        54.3 │ PASS         │
  ├─────┼───────┼────────────────────────┼──────────┼────────┼────────┼───────┼──────────┼─────────────┼──────────────┤
  │ —   │ run01 │ 100ms / 30M / 250ms    │        4 │      — │      — │     — │        — │           — │ FAILED       │
  └─────┴───────┴────────────────────────┴──────────┴────────┴────────┴───────┴──────────┴─────────────┴──────────────┘

Summary by CodeRabbit

Release Notes

  • New Features

    • Benchmarking tests now support dynamic configuration via environment variables
    • Added ASCII flowchart visualization for distributed trace analysis
    • Switched tracing backend for improved trace collection and analysis capabilities
  • Tests

    • Enhanced benchmark test suite with improved trace result collection and reporting
    • Updated benchmark tests to use configuration-driven approach for greater flexibility

…mark

- Create test/e2e/benchmark/ subpackage with SpamoorSuite (testify/suite)
- Move spamoor smoke test into suite as TestSpamoorSmoke
- Split helpers into focused files: traces.go, output.go, metrics.go
- Introduce resultWriter for defer-based benchmark JSON output
- Export shared symbols from evm_test_common.go for cross-package use
- Restructure CI to fan-out benchmark jobs and fan-in publishing
- Run benchmarks on PRs only when benchmark-related files change
Resolve conflicts keeping the benchmark suite refactoring:
- benchmark.yml: keep path filters and suite-style test command
- evm_spamoor_smoke_test.go: keep deleted (moved to benchmark pkg)
- evm_test_common.go: keep exported types, drop writeTraceBenchmarkJSON
  (now in benchmark/output.go)
go test sets the working directory to the package under test, so the
env var should be relative to test/e2e/benchmark/, not test/e2e/.
go test treats all arguments after an unknown flag (--evm-binary) as
test binary args, so ./benchmark/ was never recognized as a package
pattern.
go test sets the cwd to the package directory (test/e2e/benchmark/),
so the binary path needs an extra parent traversal.
The benchmark package doesn't define the --binary flag that test-e2e
passes. It has its own CI workflow so it doesn't need to run here.
…nfig

collectBlockMetrics hit reth's 20K FilterLogs limit at high tx volumes.
Replace with direct header iteration over [startBlock, endBlock] and add
Phase 1 metrics: non-empty ratio, block interval p50/p99, gas/block and
tx/block p50/p99.

Optimize spamoor configuration for 100ms block time:
- --slot-duration 100ms, --startup-delay 0 on daemon
- throughput=50 per 100ms slot (500 tx/s per spammer)
- max_pending=50000 to avoid 3s block poll backpressure
- 5 staggered spammers with 50K txs each

Results: 55 MGas/s, 1414 TPS, 19.8% non-empty blocks (up from 6%).
- Move startBlock capture after spammer creation to exclude warm-up
- Replace 20s drain sleep with smart poll (waitForDrain)
- Add deleteAllSpammers cleanup to handle stale spamoor DB entries
- Lower trace sample rate to 10% to prevent Jaeger OOM
- make reth tag configurable via EV_RETH_TAG env var (default pr-140)
- fix OTLP config: remove duplicate env vars, use http/protobuf protocol
- use require.Eventually for host readiness polling
- rename requireHTTP to requireHostUp
- use non-fatal logging in resultWriter.flush deferred context
- fix stale doc comment (setupCommonEVMEnv -> SetupCommonEVMEnv)
- rename loop variable to avoid shadowing testing.TB convention
- add block/internal/executing/** to CI path trigger
- remove unused require import from output.go
# Conflicts:
#	scripts/test.mk
# Conflicts:
#	test/e2e/benchmark/suite_test.go
# Conflicts:
#	test/e2e/benchmark/suite_test.go
move EV_RETH_TAG resolution and rpc connection limits into setupEnv
so all benchmark tests share the same reth configuration. lower ERC20
spammer count from 5 to 2 to reduce resource contention on local
hardware while keeping the loop for easy scaling on dedicated infra.
- add blockMetricsSummary with summarize(), log(), and entries() methods
- add evNodeOverhead() for computing ProduceBlock vs ExecuteTxs overhead
- add collectTraces() suite method to deduplicate trace collection pattern
- add addEntries() convenience method on resultWriter
- slim TestERC20Throughput from ~217 to ~119 lines
- reuse collectTraces in TestSpamoorSmoke
Bumps tastora to pick up host network support in the spamoor builder.
Spamoor in external mode now uses host networking so it can resolve
the same hostnames as the host machine.
Record startTime when the provider is created and use it as the lower
bound for trace queries, preventing spans from previous runs being
included in the analysis.
chatton added 17 commits March 4, 2026 15:13
…run-without-infra

# Conflicts:
#	.github/workflows/benchmark.yml
#	test/e2e/benchmark/gasburner_test.go
#	test/e2e/benchmark/helpers.go
#	test/e2e/benchmark/spamoor_erc20_test.go
#	test/e2e/benchmark/suite_test.go
#	test/e2e/benchmark/traces.go
#	test/e2e/go.mod
#	test/e2e/go.sum
…metrics

- Make gas_units_to_burn, max_wallets, num_spammers, throughput, and
  warmup_txs configurable via BENCH_* env vars
- Add rethExecutionRate() for ev-reth GGas/s measurement
- Add engineSpanEntries() for ProduceBlock/GetPayload/NewPayload timing
- Switch local benchmarks from Jaeger to VictoriaTraces
- Add setupExternalEnv for running against pre-deployed infrastructure
- Update tastora to 2ee1b0a (victoriatraces support)
- benchConfig consolidates all BENCH_* env vars into a single struct
  constructed once per test via newBenchConfig(serviceName)
- benchmarkResult collects all output metrics (block summary, overhead,
  GGas/s, engine span timing, seconds_per_gigagas, span averages) and
  produces entries via a single entries() call
- Removes scattered envInt/envOrDefault calls from test files
- Removes manual entry-by-entry result assembly from each test
- Net reduction of ~129 lines across existing files
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Refactors e2e benchmark infrastructure from hard-coded configuration to environment-variable-driven setup, replaces Jaeger tracing with VictoriaTraces backend, and introduces ASCII flowchart visualization and unified benchmark result aggregation across tests.

Changes

Cohort / File(s) Summary
Configuration & Metrics Infrastructure
test/e2e/benchmark/config.go, test/e2e/benchmark/result.go
Introduces environment-variable-driven benchConfig structure with sensible defaults and traceResult/benchmarkResult aggregation for unified metric logging and entry generation.
Span Visualization & Output
test/e2e/benchmark/flowchart.go, test/e2e/benchmark/output.go
Adds ASCII flowchart rendering for trace trees with span hierarchies and aggregation; simplifies output.go to delegate span aggregation via spanAvgEntries.
Tracing Backend Migration
test/e2e/benchmark/traces.go, test/e2e/benchmark/suite_test.go
Replaces Jaeger with VictoriaTraces via new traceProvider abstraction; introduces richSpan for hierarchical representation and streaming LogsQL-based span collection with host extraction.
Benchmark Test Helpers
test/e2e/benchmark/helpers.go
Adds span name constants, requireSpammersRunning poller, rethExecutionRate computation, and context-driven waitForMetricTarget with refactored metric waiting logic.
Test Refactoring
test/e2e/benchmark/gasburner_test.go, test/e2e/benchmark/spamoor_erc20_test.go, test/e2e/benchmark/spamoor_smoke_test.go
Converts tests from hard-coded constants to newBenchConfig factory; unifies result reporting via benchmarkResult.entries(); updates gasburner with warmup and trace collection; refactors ERC20 to use result flow.
Dependencies
test/e2e/go.mod
Updates tastora dependency pseudo-version from -0.20260302131806- to -0.20260312082036-.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Benchmark Test
    participant Cfg as BenchConfig
    participant Spammer as Spammers
    participant VT as VictoriaTraces
    participant Result as BenchmarkResult
    
    Test->>Cfg: newBenchConfig(serviceName)
    Cfg-->>Cfg: Load BENCH_* env vars
    Cfg-->>Test: Return config with parameters
    
    Test->>Spammer: Start spammers (NumSpammers)
    Spammer->>VT: Emit trace spans via OTLP
    Spammer-->>Test: Spammers running
    
    Test->>Test: Execute benchmark workload
    Spammer->>VT: Send transaction traces
    
    Test->>VT: collectSpans(serviceName)
    VT-->>Test: Stream spans via LogsQL
    Test->>Test: Parse spans + extract hostNames
    Test-->>Test: Build richSpan hierarchy
    
    Test->>Result: newBenchmarkResult(name, metrics, traces)
    Result->>Result: Aggregate span durations
    Result->>Result: Build ASCII flowchart
    Result->>Test: Entries with summary + visualization
    
    Test->>Test: Log results & flowcharts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • evstack/ev-node#3115: Adds gas-burner benchmark and supporting utilities to the refactored benchmark suite in gasburner_test.go and shared helpers.
  • evstack/ev-node#3114: Modifies benchmark codebase (helpers.go, traces.go, output.go, suite_test.go) with ERC20 benchmarking functionality and tracing helpers.
  • evstack/ev-node#3107: Reorganizes Spamoor e2e benchmark into testify-based suite, touching suite_test.go, traces.go, and spamoor test files with tracing infrastructure changes.

Suggested reviewers

  • alpe
  • julienrbrt
  • tac0turtle

Poem

🐰 Hops with joy as benchmarks now take flight,
Config-driven, traces glow with VictoriaLight!
Flowcharts bloom in ASCII art so fine,
From Jaeger's past to futures so divine,
Results aggregate in harmony true—huzzah!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description provides an overview of the three main changes: BENCH_* variable support, Jaeger→VictoriaTraces migration, and richSpan flowchart visualization, with substantial benchmark result data demonstrating practical outcomes.
Title check ✅ Passed The PR title accurately summarizes the main objective: enabling benchmark tests to run in either a dedicated environment or in Docker by making configuration dynamic and switching tracing backends.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cian/enable-tests-to-run-without-infra
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 18, 2026, 9:18 AM

@claude
Copy link
Contributor

claude bot commented Mar 12, 2026

Claude finished @chatton's task in 2m 0s —— View job


Code Review

  • Read and analyze changed files
  • Review code quality & architecture
  • Review error handling & resilience
  • Review testing patterns
  • Review performance & resource efficiency
  • Compile and post final review

Good overall structure and clean separation of concerns. The refactoring into config.go, result.go, and flowchart.go is a clear improvement over the prior monolith. Below are findings grouped by severity.


Issues

1. waitForMetricTarget docstring vs. implementation mismatch (helpers.go:458)

The docstring says "fails the test on timeout" but the implementation only logs a warning and returns silently. If a test depends on this reaching the target, it may pass incorrectly when the warmup was incomplete.

Fix this →

2. setupExternalEnv: spamoor started before required env var check (suite_test.go:198-199)

BENCH_TRACE_QUERY_URL is checked with s.Require().NotEmpty after spamoor has already been built and started. If the env var is absent the test fails leaving spamoor cleanup to the deferred handler, which works — but the order is surprising and the spamoor spinup was wasteful. Consider moving the env var validation to the top of the function before any Docker work.

Fix this →

3. http.DefaultClient has no timeout (traces.go:169)

fetchLogStream uses http.DefaultClient.Do(req) which has no read or response header timeout. The context passed in provides cancellation, but if the server stalls mid-response bufio.Scanner.Scan() can block even after context cancellation. Prefer a short-lived http.Client with explicit timeouts or at least a ResponseHeaderTimeout.

Fix this →

4. Duplicate network roundtrip for rich and flat spans (suite_test.go:215, result.go:14)

collectTraces calls both collectSpans (flat) and collectRichSpans (rich) against the same VictoriaTraces endpoint. Each call re-fetches the same log stream. Since richSpan is a strict superset of a flat span, the flat spans could be derived from the rich response, halving the number of requests and eliminating the duplicate 3-minute timeout window.


Nits / Polish

5. envInt / envDuration silently swallow bad values (config.go:70-93)

Invalid env var values (e.g. BENCH_NUM_SPAMMERS=abc) silently fall back to defaults. Consider adding a t.Logf("warning: …") or at least a comment noting this is intentional, so operators aren't confused when misconfigured runs produce unexpected results.

6. buildTree: arbitrary root selection when multiple roots exist (flowchart.go:203-207)

If a trace has multiple candidate roots (no parent found in set), the one with the longest duration wins. This is a reasonable heuristic but can silently produce misleading trees if span IDs are missing from the collection window. A comment explaining the tie-breaking rule would help future readers.

7. buildAggregateTree aggregates by name across all positions (flowchart.go:305-317)

The dursByName map accumulates durations across all traces without considering position in the tree. If the same operation name appears in two different parent contexts (e.g. a retry span and a normal span both named "Engine.GetPayload"), their durations are averaged together. This can silently skew the aggregate view.

8. extractHostName allocates a full map[string]string per span (traces.go:276)

Each span line is unmarshalled twice: once fully into logsqlRichSpan and once into map[string]string just to scan for a key containing "host.name". For large trace collections this creates significant GC pressure. A simple strings.Index / bytes.Contains scan on the raw bytes would avoid the allocation.

Fix this →

9. RUST_LOG=debug in benchmark environment (suite_test.go:95)

Running ev-reth with RUST_LOG=debug generates a large volume of log output and adds non-trivial I/O overhead to the process under test. For benchmarks this likely distorts the results. Consider RUST_LOG=info or RUST_LOG=warn unless debug-level spans are explicitly needed.

Fix this →

10. smoke_test.go: time.Sleep(60s) still present (spamoor_smoke_test.go:68)

The reviewer alpe flagged a sleep in gasburner_test.go which has been correctly replaced with requireSpammersRunning. However the 60-second sleep in the smoke test is still a bare time.Sleep. Even for smoke tests it would be better to poll for a condition (e.g. enough transactions sent) rather than sleeping a fixed interval.

11. collectBlockMetrics makes 2N sequential RPC calls (helpers.go:487)

For each block in the range, HeaderByNumber and TransactionCount are called sequentially. For large ranges (hundreds of blocks) this adds measurable latency to test teardown. The header's GasUsed and TxHash/bloom fields could be used to determine emptiness without the extra TransactionCount call in many cases.

12. Missing documentation of BENCH_* env vars in one place

The config.go file is a good start, but suite_test.go introduces additional required vars (BENCH_ETH_RPC_URL, BENCH_PRIVATE_KEY, BENCH_TRACE_QUERY_URL) for external mode that aren't listed in benchConfig. A small doc comment block or README section enumerating all recognized BENCH_* variables (and which are required vs optional, and their defaults) would significantly help operators running against dedicated hardware.


Summary

The core architecture (config extraction, traceProvider interface, richSpan / flowchart additions) is well-designed. The main actionable items before merging from main are: the waitForMetricTarget docstring/behavior mismatch (#1), the env-var ordering in setupExternalEnv (#2), and the RUST_LOG=debug benchmark skew (#9). The rest are quality-of-life improvements that can follow in subsequent PRs.

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.74%. Comparing base (6e0bd9a) to head (0a4c8a4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3157      +/-   ##
==========================================
- Coverage   60.75%   60.74%   -0.01%     
==========================================
  Files         117      117              
  Lines       11988    11988              
==========================================
- Hits         7283     7282       -1     
- Misses       3886     3887       +1     
  Partials      819      819              
Flag Coverage Δ
combined 60.74% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress.


// give spammers a moment to start, then verify none failed immediately
time.Sleep(3 * time.Second)
assertSpammersRunning(t, api, spammerIDs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid sleep.

	require.EventuallyWithT(t, func(collect *assert.CollectT) {
		assertSpammersRunning(t, api, spammerIDs)
	}, 3*time.Second, time.Second/10)

chatton added 2 commits March 18, 2026 09:02
- replace time.Sleep with require.EventuallyWithT for spammer checks
- use benchConfig env vars in TestERC20Throughput instead of hardcoded constants
- remove dead truncateID function
- fix stale Jaeger comment in smoke test
- deduplicate HTTP boilerplate in trace fetching via fetchLogStream helper
- fix fragile string comparison for ProduceBlock avg logging
- make waitForMetricTarget responsive to context cancellation
- add BENCH_WAIT_TIMEOUT env var support
@chatton chatton marked this pull request as ready for review March 18, 2026 09:18
@chatton
Copy link
Contributor Author

chatton commented Mar 18, 2026

This PR is already getting a bit too large, I can do some cleanups in a followup if there are no major blockers once CI is green

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/benchmark/suite_test.go (1)

48-55: ⚠️ Potential issue | 🟠 Major

Pin the default reth image instead of latest.

Using a floating tag makes benchmark tests non-deterministic—the same commit can pull different binaries over time, causing baseline drift and failures unrelated to code changes. Default to an immutable release tag or digest, keeping EV_RETH_TAG as the override mechanism.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/suite_test.go` around lines 48 - 55, The defaultRethTag
constant is set to the floating "latest" tag which makes rethTag()
non-deterministic; replace defaultRethTag with a pinned immutable release tag or
digest (e.g., a specific semver or sha256 digest) so tests pull a fixed image,
while keeping the EV_RETH_TAG environment variable override in rethTag() intact;
update any test docs/comments to reflect that the default is now a pinned
release rather than "latest".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/benchmark/config.go`:
- Around line 33-47: newBenchConfig currently relies on envInt/envDuration which
silently masks bad/misspelled or nonsensical values; change newBenchConfig (and
any other place constructing benchConfig) to validate parsed environment inputs
and return an error instead of falling back silently. Update the signature of
newBenchConfig to return (benchConfig, error), parse each env var (using
envInt/envDuration) for BENCH_NUM_SPAMMERS, BENCH_COUNT_PER_SPAMMER,
BENCH_THROUGHPUT, BENCH_WARMUP_TXS, BENCH_GAS_UNITS_TO_BURN, BENCH_MAX_WALLETS,
BENCH_WAIT_TIMEOUT, etc., enforce sensible bounds (e.g., non-negative integers,
positive durations, and explicit max limits), and return a descriptive error
when a value is out of range or unparseable; apply the same validation logic
wherever benchConfig is built so invalid BENCH_* overrides fail fast.

In `@test/e2e/benchmark/flowchart.go`:
- Around line 307-315: The code currently uses a per-trace dedupe map `seen`
when populating `countsByName` (in the loop that also fills `dursByName`), so
`countsByName` represents traces-per-operation but `renderAggregateTree` prints
it as "(%d calls)"; either remove the dedupe so you count every occurrence or
change the label to "traces". Fix: in the span-aggregation loop (the block that
creates `seen := make(map[string]bool)` and increments `countsByName[s.name]`),
either delete the `seen` logic and always do `countsByName[s.name]++` for each
span to count calls, or keep `seen` and update `renderAggregateTree` to display
"traces" instead of "calls"; apply the same change to the similar block around
lines 345-351 where `seen`/`countsByName` is used.

In `@test/e2e/benchmark/helpers.go`:
- Around line 456-482: waitForMetricTarget currently logs and returns on timeout
or ctx.Done(), allowing callers like TestGasBurner to continue as if successful;
change the two branches handling <-ctx.Done() and <-timer.C to fail the test
instead (use t.Fatalf with a clear message including name, target, and timeout)
so the benchmark stops immediately when the metric target isn't met; if you
prefer to propagate errors instead, change waitForMetricTarget's signature to
return an error and update callers (e.g., TestGasBurner) to handle the error
instead of assuming success.

In `@test/e2e/benchmark/spamoor_erc20_test.go`:
- Around line 81-86: The block metrics are including warm-up spans because the
trace window isn't reset before collecting traces; mirror TestGasBurner by
waiting for the warm-up gate and calling e.traces.resetStartTime() right before
calling s.collectTraces so engineSpanEntries/spanAvgEntries exclude
deploy/funding spans; specifically, add the same warm-up synchronization used in
TestGasBurner and invoke e.traces.resetStartTime() immediately before
collectTraces/ newBenchmarkResult("ERC20Throughput", ...) so both trace-derived
metrics and the block summary use the same post-warmup window.

In `@test/e2e/benchmark/suite_test.go`:
- Around line 231-237: The current best-effort rich-span collection calls
richSpanCollector.collectRichSpans twice and can block ~3 minutes each even when
results are optional; change the logic in the branch that checks
e.traces.(richSpanCollector) so: 1) only call rc.collectRichSpans(ctx,
"ev-reth") if len(tr.evReth) > 0 to gate that extra call, and 2) replace the
blocking collectRichSpans calls with a short-timeout or non-blocking/try variant
(or wrap the call with a short context timeout) when populating tr.evNodeRich
and tr.evRethRich to avoid multi-minute waits when rich spans are unavailable.
Ensure you update references to richSpanCollector, collectRichSpans,
serviceName, tr.evNodeRich, tr.evRethRich, and tr.evReth accordingly.
- Around line 166-173: The test currently logs and prints raw external secrets
(rpcURL and the BENCH_PRIVATE_KEY env var) and may include them in failure
messages; change logging to never include full URLs or private keys: in the
block using rpcURL, replace the s.Logf call to log only the mode or a sanitized
host (e.g., extract and log url.Host or mask credentials) and stop logging
privateKey entirely; update the s.Require().NotEmpty and s.Require().NoError
messages to avoid interpolating rpcURL or privateKey (use generic messages like
"failed to dial external RPC" or include the sanitized host only), and apply the
same sanitization/no-logging pattern to the other occurrences that reference
rpcURL/BENCH_* (the ethclient.Dial call, s.Require assertions, and any other
s.Logf around those vars).

In `@test/e2e/benchmark/traces.go`:
- Around line 61-67: The uiURL builder in victoriaTraceProvider (function uiURL)
interpolates start and end times directly which can contain '+' from RFC3339
offsets and become spaces in the query; update uiURL to URL-encode the start and
end parameters (e.g., use neturl.QueryEscape or build the query with url.Values)
so both start and end are escaped before concatenation, and apply the same
change to the other analogous builder around lines 155-162 to ensure both
timestamps are properly encoded.
- Line 90: The error messages currently include the raw trace query URL
(v.queryURL) which may contain sensitive data; remove the URL from returned
errors and instead report only the service name and status/error. Locate the
fmt.Errorf calls that reference v.queryURL (e.g., the timed out message using
"timed out waiting for %s traces from %s: %w") and change them to omit
v.queryURL, e.g. format with serviceName and ctx.Err() or a non-sensitive status
string; apply the same replacement for the other similar fmt.Errorf/log calls
that reference v.queryURL at the other noted locations (lines around 118, 145,
and 171-176). Ensure v.queryURL is not included in any returned error or test
output.
- Around line 205-214: The code currently discards errors from strconv.ParseInt
when computing ns and startNs and appends a malformed richSpan
(zero-time/duration); change the ParseInt calls to capture their errors and, if
either returns an error, skip the current row (or log the error) instead of
appending. Specifically, update the parsing around ns, startNs (the
strconv.ParseInt calls) and the subsequent append to richSpan so you check both
err values and continue to the next iteration when a parse fails; preserve use
of row.TraceID, row.SpanID, extractHostName(line), richSpan, startTime and
duration when the parses succeed.

---

Outside diff comments:
In `@test/e2e/benchmark/suite_test.go`:
- Around line 48-55: The defaultRethTag constant is set to the floating "latest"
tag which makes rethTag() non-deterministic; replace defaultRethTag with a
pinned immutable release tag or digest (e.g., a specific semver or sha256
digest) so tests pull a fixed image, while keeping the EV_RETH_TAG environment
variable override in rethTag() intact; update any test docs/comments to reflect
that the default is now a pinned release rather than "latest".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c92be09c-4806-4c9f-9301-e1af376518a5

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0bd9a and 0a4c8a4.

⛔ Files ignored due to path filters (1)
  • test/e2e/go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • test/e2e/benchmark/config.go
  • test/e2e/benchmark/flowchart.go
  • test/e2e/benchmark/gasburner_test.go
  • test/e2e/benchmark/helpers.go
  • test/e2e/benchmark/output.go
  • test/e2e/benchmark/result.go
  • test/e2e/benchmark/spamoor_erc20_test.go
  • test/e2e/benchmark/spamoor_smoke_test.go
  • test/e2e/benchmark/suite_test.go
  • test/e2e/benchmark/traces.go
  • test/e2e/go.mod

Comment on lines +33 to +47
func newBenchConfig(serviceName string) benchConfig {
return benchConfig{
ServiceName: serviceName,
BlockTime: envOrDefault("BENCH_BLOCK_TIME", "100ms"),
SlotDuration: envOrDefault("BENCH_SLOT_DURATION", "250ms"),
GasLimit: envOrDefault("BENCH_GAS_LIMIT", ""),
ScrapeInterval: envOrDefault("BENCH_SCRAPE_INTERVAL", "1s"),
NumSpammers: envInt("BENCH_NUM_SPAMMERS", 2),
CountPerSpammer: envInt("BENCH_COUNT_PER_SPAMMER", 2000),
Throughput: envInt("BENCH_THROUGHPUT", 200),
WarmupTxs: envInt("BENCH_WARMUP_TXS", 200),
GasUnitsToBurn: envInt("BENCH_GAS_UNITS_TO_BURN", 1_000_000),
MaxWallets: envInt("BENCH_MAX_WALLETS", 500),
WaitTimeout: envDuration("BENCH_WAIT_TIMEOUT", 10*time.Minute),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on invalid BENCH_* overrides.

envInt/envDuration currently mask typos, and parseable but nonsensical values like negative counts or timeouts still flow through. That means a mis-set BENCH_NUM_SPAMMERS, BENCH_WARMUP_TXS, or BENCH_WAIT_TIMEOUT can still produce benchmark output for a different workload than the job requested. Please validate bounds when building benchConfig and return an error instead of silently defaulting.

As per coding guidelines, "Validate all inputs from external sources".

Also applies to: 68-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/config.go` around lines 33 - 47, newBenchConfig currently
relies on envInt/envDuration which silently masks bad/misspelled or nonsensical
values; change newBenchConfig (and any other place constructing benchConfig) to
validate parsed environment inputs and return an error instead of falling back
silently. Update the signature of newBenchConfig to return (benchConfig, error),
parse each env var (using envInt/envDuration) for BENCH_NUM_SPAMMERS,
BENCH_COUNT_PER_SPAMMER, BENCH_THROUGHPUT, BENCH_WARMUP_TXS,
BENCH_GAS_UNITS_TO_BURN, BENCH_MAX_WALLETS, BENCH_WAIT_TIMEOUT, etc., enforce
sensible bounds (e.g., non-negative integers, positive durations, and explicit
max limits), and return a descriptive error when a value is out of range or
unparseable; apply the same validation logic wherever benchConfig is built so
invalid BENCH_* overrides fail fast.

Comment on lines +307 to +315
for _, spans := range byTrace {
// count each unique operation once per trace
seen := make(map[string]bool)
for _, s := range spans {
dursByName[s.name] = append(dursByName[s.name], s.duration)
if !seen[s.name] {
countsByName[s.name]++
seen[s.name] = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The aggregate chart is labeling per-trace counts as calls.

countsByName is incremented once per trace because of seen, but renderAggregateTree prints that number as (%d calls). Any operation that appears multiple times in one trace will be underreported, so the aggregate chart becomes misleading. Either count every occurrence or relabel the output as traces.

Also applies to: 345-351

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/flowchart.go` around lines 307 - 315, The code currently
uses a per-trace dedupe map `seen` when populating `countsByName` (in the loop
that also fills `dursByName`), so `countsByName` represents traces-per-operation
but `renderAggregateTree` prints it as "(%d calls)"; either remove the dedupe so
you count every occurrence or change the label to "traces". Fix: in the
span-aggregation loop (the block that creates `seen := make(map[string]bool)`
and increments `countsByName[s.name]`), either delete the `seen` logic and
always do `countsByName[s.name]++` for each span to count calls, or keep `seen`
and update `renderAggregateTree` to display "traces" instead of "calls"; apply
the same change to the similar block around lines 345-351 where
`seen`/`countsByName` is used.

Comment on lines 456 to 482
// waitForMetricTarget polls a metric getter function every 2s until the
// returned value >= target, or fails the test on timeout.
func waitForMetricTarget(t testing.TB, name string, poll func() (float64, error), target float64, timeout time.Duration) {
t.Helper()
deadline := time.Now().Add(timeout)
for time.Now().Before(deadline) {
ctx := t.Context()
timer := time.NewTimer(timeout)
defer timer.Stop()
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()

for {
v, err := poll()
if err == nil && v >= target {
t.Logf("metric %s reached %.0f (target %.0f)", name, v, target)
return
}
time.Sleep(2 * time.Second)
select {
case <-ctx.Done():
t.Logf("metric %s: context cancelled (target %.0f)", name, target)
return
case <-timer.C:
t.Logf("metric %s did not reach target %.0f within %v", name, target, timeout)
return
case <-ticker.C:
}
}
t.Fatalf("metric %s did not reach target %.0f within %v", name, target, timeout)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't let timeout/cancel turn into a partial "successful" benchmark.

On timeout or ctx.Done(), this helper only logs and returns. TestGasBurner then keeps collecting blocks and writes benchmark output even though the target metric was never reached. Please fail here or return an error that callers must handle.

💡 Suggested change
 func waitForMetricTarget(t testing.TB, name string, poll func() (float64, error), target float64, timeout time.Duration) {
 	t.Helper()
 	ctx := t.Context()
 	timer := time.NewTimer(timeout)
 	defer timer.Stop()
 	ticker := time.NewTicker(2 * time.Second)
 	defer ticker.Stop()
+	var lastValue float64
+	var lastErr error
 
 	for {
 		v, err := poll()
+		lastValue, lastErr = v, err
 		if err == nil && v >= target {
 			t.Logf("metric %s reached %.0f (target %.0f)", name, v, target)
 			return
 		}
 		select {
 		case <-ctx.Done():
-			t.Logf("metric %s: context cancelled (target %.0f)", name, target)
-			return
+			require.FailNowf(t, "metric wait cancelled",
+				"%s reached %.0f/%.0f before cancellation: %v", name, lastValue, target, ctx.Err())
 		case <-timer.C:
-			t.Logf("metric %s did not reach target %.0f within %v", name, target, timeout)
-			return
+			require.FailNowf(t, "metric target not reached",
+				"%s reached %.0f/%.0f within %v (last error: %v)", name, lastValue, target, timeout, lastErr)
 		case <-ticker.C:
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/helpers.go` around lines 456 - 482, waitForMetricTarget
currently logs and returns on timeout or ctx.Done(), allowing callers like
TestGasBurner to continue as if successful; change the two branches handling
<-ctx.Done() and <-timer.C to fail the test instead (use t.Fatalf with a clear
message including name, target, and timeout) so the benchmark stops immediately
when the metric target isn't met; if you prefer to propagate errors instead,
change waitForMetricTarget's signature to return an error and update callers
(e.g., TestGasBurner) to handle the error instead of assuming success.

Comment on lines +81 to +86
traces := s.collectTraces(e, cfg.ServiceName)

// collect and report traces
traces := s.collectTraces(e, serviceName)

if overhead, ok := evNodeOverhead(traces.evNode); ok {
t.Logf("ev-node overhead: %.1f%%", overhead)
w.addEntry(entry{Name: "ERC20Throughput - ev-node overhead", Unit: "%", Value: overhead})
}

w.addEntries(summary.entries("ERC20Throughput"))
w.addSpans(traces.allSpans())
result := newBenchmarkResult("ERC20Throughput", bm, traces)
s.Require().Greater(result.summary.SteadyState, time.Duration(0), "expected non-zero steady-state duration")
result.log(t, wallClock)
w.addEntries(result.entries())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Trace-derived metrics still include warm-up work.

This test records block metrics after the launcher setup phase, but the trace window is never reset before collectTraces. The block summary therefore measures steady-state throughput while engineSpanEntries and spanAvgEntries still include deploy/funding spans. Mirror TestGasBurner's warm-up gate plus e.traces.resetStartTime() so both data sources describe the same window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/spamoor_erc20_test.go` around lines 81 - 86, The block
metrics are including warm-up spans because the trace window isn't reset before
collecting traces; mirror TestGasBurner by waiting for the warm-up gate and
calling e.traces.resetStartTime() right before calling s.collectTraces so
engineSpanEntries/spanAvgEntries exclude deploy/funding spans; specifically, add
the same warm-up synchronization used in TestGasBurner and invoke
e.traces.resetStartTime() immediately before collectTraces/
newBenchmarkResult("ERC20Throughput", ...) so both trace-derived metrics and the
block summary use the same post-warmup window.

Comment on lines +166 to +173
t.Logf("external mode: using RPC %s", rpcURL)

// collectTraces fetches ev-node traces (required) and ev-reth traces (optional)
// from Jaeger, then prints reports for both.
func (s *SpamoorSuite) collectTraces(e *env, serviceName string) *traceResult {
t := s.T()
tr := &traceResult{
evNode: s.collectServiceTraces(e, serviceName),
evReth: s.tryCollectServiceTraces(e, "ev-reth"),
}
e2e.PrintTraceReport(t, serviceName, tr.evNode)
if len(tr.evReth) > 0 {
e2e.PrintTraceReport(t, "ev-reth", tr.evReth)
}
return tr
}
privateKey := os.Getenv("BENCH_PRIVATE_KEY")
s.Require().NotEmpty(privateKey, "BENCH_PRIVATE_KEY must be set in external mode")

// collectServiceTraces fetches traces from Jaeger for the given service and returns the spans.
func (s *SpamoorSuite) collectServiceTraces(e *env, serviceName string) []e2e.TraceSpan {
ctx, cancel := context.WithTimeout(s.T().Context(), 3*time.Minute)
defer cancel()
// eth client
ethClient, err := ethclient.Dial(rpcURL)
s.Require().NoError(err, "failed to dial external RPC %s", rpcURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Redact external BENCH_ URLs before logging them.*

These values can carry basic-auth credentials, API tokens, or signed query params, and the current logs/failure output will copy them into CI artifacts. Log only the mode or a sanitized host identifier instead. As per coding guidelines: "Never expose private keys in logs or errors."

Also applies to: 198-200, 227-229

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/suite_test.go` around lines 166 - 173, The test currently
logs and prints raw external secrets (rpcURL and the BENCH_PRIVATE_KEY env var)
and may include them in failure messages; change logging to never include full
URLs or private keys: in the block using rpcURL, replace the s.Logf call to log
only the mode or a sanitized host (e.g., extract and log url.Host or mask
credentials) and stop logging privateKey entirely; update the
s.Require().NotEmpty and s.Require().NoError messages to avoid interpolating
rpcURL or privateKey (use generic messages like "failed to dial external RPC" or
include the sanitized host only), and apply the same sanitization/no-logging
pattern to the other occurrences that reference rpcURL/BENCH_* (the
ethclient.Dial call, s.Require assertions, and any other s.Logf around those
vars).

Comment on lines +231 to +237
if rc, ok := e.traces.(richSpanCollector); ok {
if spans, err := rc.collectRichSpans(ctx, serviceName); err == nil {
tr.evNodeRich = spans
}
if spans, err := rc.collectRichSpans(ctx, "ev-reth"); err == nil {
tr.evRethRich = spans
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Best-effort rich-span collection shouldn't add minutes of blocking.

This branch ignores rich-span errors, but collectRichSpans still waits up to 3 minutes per call. When rich spans are unavailable, optional diagnostics can add roughly 3-6 minutes of dead time to a run. Gate the ev-reth branch on len(tr.evReth) > 0, and use a short-timeout/try variant for rich spans as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/suite_test.go` around lines 231 - 237, The current
best-effort rich-span collection calls richSpanCollector.collectRichSpans twice
and can block ~3 minutes each even when results are optional; change the logic
in the branch that checks e.traces.(richSpanCollector) so: 1) only call
rc.collectRichSpans(ctx, "ev-reth") if len(tr.evReth) > 0 to gate that extra
call, and 2) replace the blocking collectRichSpans calls with a short-timeout or
non-blocking/try variant (or wrap the call with a short context timeout) when
populating tr.evNodeRich and tr.evRethRich to avoid multi-minute waits when rich
spans are unavailable. Ensure you update references to richSpanCollector,
collectRichSpans, serviceName, tr.evNodeRich, tr.evRethRich, and tr.evReth
accordingly.

Comment on lines +61 to +67
func (v *victoriaTraceProvider) uiURL(serviceName string) string {
query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)
return fmt.Sprintf("%s/select/vmui/#/query?query=%s&start=%s&end=%s",
strings.TrimRight(v.queryURL, "/"),
neturl.QueryEscape(query),
v.startTime.Format(time.RFC3339),
time.Now().Format(time.RFC3339))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Encode start and end when building VictoriaTraces URLs.

time.RFC3339/time.RFC3339Nano emits +HH:MM on positive-offset machines. Interpolating that directly into the query string turns + into a space, so local trace lookups break outside UTC. Build both URLs with url.Values (or escape start and end too).

Suggested fix
 func (v *victoriaTraceProvider) uiURL(serviceName string) string {
 	query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)
-	return fmt.Sprintf("%s/select/vmui/#/query?query=%s&start=%s&end=%s",
-		strings.TrimRight(v.queryURL, "/"),
-		neturl.QueryEscape(query),
-		v.startTime.Format(time.RFC3339),
-		time.Now().Format(time.RFC3339))
+	params := neturl.Values{
+		"query": []string{query},
+		"start": []string{v.startTime.UTC().Format(time.RFC3339)},
+		"end":   []string{time.Now().UTC().Format(time.RFC3339)},
+	}
+	return fmt.Sprintf("%s/select/vmui/#/query?%s",
+		strings.TrimRight(v.queryURL, "/"),
+		params.Encode())
 }
 ...
 	end := time.Now()
 	query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)
 	baseURL := strings.TrimRight(v.queryURL, "/")
-	url := fmt.Sprintf("%s/select/logsql/query?query=%s&start=%s&end=%s",
-		baseURL,
-		neturl.QueryEscape(query),
-		v.startTime.Format(time.RFC3339Nano),
-		end.Format(time.RFC3339Nano))
+	params := neturl.Values{
+		"query": []string{query},
+		"start": []string{v.startTime.UTC().Format(time.RFC3339Nano)},
+		"end":   []string{end.UTC().Format(time.RFC3339Nano)},
+	}
+	url := fmt.Sprintf("%s/select/logsql/query?%s", baseURL, params.Encode())

Also applies to: 155-162

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/traces.go` around lines 61 - 67, The uiURL builder in
victoriaTraceProvider (function uiURL) interpolates start and end times directly
which can contain '+' from RFC3339 offsets and become spaces in the query;
update uiURL to URL-encode the start and end parameters (e.g., use
neturl.QueryEscape or build the query with url.Values) so both start and end are
escaped before concatenation, and apply the same change to the other analogous
builder around lines 155-162 to ensure both timestamps are properly encoded.


select {
case <-ctx.Done():
return nil, fmt.Errorf("timed out waiting for %s traces from %s: %w", serviceName, v.queryURL, ctx.Err())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep raw trace query URLs out of returned errors.

BENCH_TRACE_QUERY_URL is externally supplied and may include credentials or signed parameters. These messages bubble into failed test output, so keep the raw URL out of the error text and report only the service name / status instead. As per coding guidelines: "Never expose private keys in logs or errors."

Also applies to: 118-118, 145-145, 171-176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/traces.go` at line 90, The error messages currently
include the raw trace query URL (v.queryURL) which may contain sensitive data;
remove the URL from returned errors and instead report only the service name and
status/error. Locate the fmt.Errorf calls that reference v.queryURL (e.g., the
timed out message using "timed out waiting for %s traces from %s: %w") and
change them to omit v.queryURL, e.g. format with serviceName and ctx.Err() or a
non-sensitive status string; apply the same replacement for the other similar
fmt.Errorf/log calls that reference v.queryURL at the other noted locations
(lines around 118, 145, and 171-176). Ensure v.queryURL is not included in any
returned error or test output.

Comment on lines +205 to +214
ns, _ := strconv.ParseInt(row.Duration, 10, 64)
startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
spans = append(spans, richSpan{
traceID: row.TraceID,
spanID: row.SpanID,
parentSpanID: row.ParentSpanID,
name: row.Name,
hostName: extractHostName(line),
startTime: time.Unix(0, startNs),
duration: time.Duration(ns) * time.Nanosecond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Drop malformed rich-span rows instead of synthesizing zero-value spans.

Both ParseInt errors are discarded here, so a bad row becomes a span at Unix epoch with zero duration. That can scramble flowchart ordering and span summaries; skip the row (or log it) when either parse fails.

Suggested fix
-		ns, _ := strconv.ParseInt(row.Duration, 10, 64)
-		startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
+		ns, err := strconv.ParseInt(row.Duration, 10, 64)
+		if err != nil {
+			continue
+		}
+		startNs, err := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
+		if err != nil {
+			continue
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ns, _ := strconv.ParseInt(row.Duration, 10, 64)
startNs, _ := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
spans = append(spans, richSpan{
traceID: row.TraceID,
spanID: row.SpanID,
parentSpanID: row.ParentSpanID,
name: row.Name,
hostName: extractHostName(line),
startTime: time.Unix(0, startNs),
duration: time.Duration(ns) * time.Nanosecond,
ns, err := strconv.ParseInt(row.Duration, 10, 64)
if err != nil {
continue
}
startNs, err := strconv.ParseInt(row.StartTimeUnixNano, 10, 64)
if err != nil {
continue
}
spans = append(spans, richSpan{
traceID: row.TraceID,
spanID: row.SpanID,
parentSpanID: row.ParentSpanID,
name: row.Name,
hostName: extractHostName(line),
startTime: time.Unix(0, startNs),
duration: time.Duration(ns) * time.Nanosecond,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/benchmark/traces.go` around lines 205 - 214, The code currently
discards errors from strconv.ParseInt when computing ns and startNs and appends
a malformed richSpan (zero-time/duration); change the ParseInt calls to capture
their errors and, if either returns an error, skip the current row (or log the
error) instead of appending. Specifically, update the parsing around ns, startNs
(the strconv.ParseInt calls) and the subsequent append to richSpan so you check
both err values and continue to the next iteration when a parse fails; preserve
use of row.TraceID, row.SpanID, extractHostName(line), richSpan, startTime and
duration when the parses succeed.

@chatton chatton changed the title feat(benchmarking): [WIP] enable tests to run in dedicated environment or in docker feat(benchmarking): enable tests to run in dedicated environment or in docker Mar 18, 2026
@chatton
Copy link
Contributor Author

chatton commented Mar 18, 2026

Will clean up the richspans/flowchart in a follow up, this is still experimental

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants