-
Notifications
You must be signed in to change notification settings - Fork 24
Add benchmark metrics interpretation guide #1407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: rkuris <[email protected]>
Co-authored-by: rkuris <[email protected]>
| - **Interpretation**: | ||
| - Indicates the I/O amplification factor for write operations | ||
| - Lower values mean more efficient inserts | ||
| - Varies by benchmark type due to different access patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Let's add some more details here. Typically, reads per insert will increase as the size of the merkle trie gets bigger, so this number will steadily grow as the trie splits when more data is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanation that reads per insert increases as the trie grows larger and more splitting occurs, typically stabilizing at tree depth 6-7. (00b0d46)
benchmark/README.md
Outdated
| - **Expected values**: | ||
| - `single`: ~2-5 (minimal tree traversal) | ||
| - `zipf`: ~5-15 (skewed access pattern reduces tree depth) | ||
| - `tenkrandom`: ~10-30 (uniform random access requires deeper traversals) | ||
| - **Poor performance**: Values 2-3x higher than expected may indicate cache or index issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers look very different from what we see. Typically the trie ends up at depth 6 or 7 depending on how many values are inserted. We should never see 10 unless the data all happens to have exactly the same prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected the expected values to reflect typical trie depth of 6-7: single ~2-3, zipf ~4-7, tenkrandom ~6-8. Updated the poor performance note to indicate values >10 suggest issues. (00b0d46)
benchmark/README.md
Outdated
| - `single` benchmark: 10k-50k ops/sec (limited by commit overhead) | ||
| - `zipf`/`tenkrandom`: 50k-200k ops/sec (batched commits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to caching, single actually outperforms the others. Single will always hit the cache and do very little I/O which is the limiting factor. I think these are backwards -- single should get 50k-200k and zipf about half that, and tenkrandom will be the slowest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the operation rate values - single is now correctly listed as fastest (50k-200k ops/sec) due to cache hits and minimal I/O, while zipf and tenkrandom are 25k-100k ops/sec. Updated all benchmark-specific sections for consistency. (00b0d46)
Co-authored-by: rkuris <[email protected]>
The benchmark documentation explained setup and execution but lacked guidance on interpreting collected metrics, leaving users unable to effectively analyze performance results or identify bottlenecks.
Changes
Added comprehensive "Understanding Metrics" section to
benchmark/README.mddocumenting:Metrics catalog: Cache performance (node/freelist hit rates, reads per insert), operation rates (insert/update/delete breakdown, merkle op types), throughput (proposals/sec), and system resources (dirty bytes)
Interpretation thresholds: "Good" vs "bad" performance criteria for each metric with quantitative values based on actual trie behavior
single, 4-7 forzipf, 6-8 fortenkrandom(reflecting typical trie depth of 6-7)single(highest due to cache hits), 25k-100k ops/sec forzipf/tenkrandomBenchmark-specific patterns: Expected metric distributions and performance characteristics for each benchmark type (
create,single,zipf,tenkrandom)Correlation analysis: How metrics interact (cache hit rate ↔ operation rate, dirty bytes ↔ storage bottleneck)
Bottleneck identification: Symptom-based troubleshooting guide mapping observed metric patterns to root causes
Grafana workflow: Recommended analysis sequence through dashboard panels
Comparison methodology: Guidelines for consistent cross-run comparisons and avoiding confounding factors
Metrics Reference Example
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.