Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
318 changes: 318 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ code and resources for the Firewood project.
- [Benchmark Description](#benchmark-description)
- [Installation](#installation)
- [Usage](#usage)
- [Understanding Metrics](#understanding-metrics)

## Introduction

Expand Down Expand Up @@ -170,3 +171,320 @@ docker run -p 127.0.0.1:4318:4318 -p 127.0.0.1:55679:55679 otel/openteleme
```

Then, pass the `-e` option to the benchmark.

## Understanding Metrics

This section explains the metrics collected during benchmarks, how to interpret them, and how to identify performance characteristics and bottlenecks.

### Overview

Firewood exposes Prometheus metrics that are visualized in the Grafana dashboard. These metrics track cache performance, database operations, throughput, and resource usage. Understanding these metrics is essential for analyzing benchmark results and identifying performance issues.

### Key Metrics Categories

#### Cache Performance Metrics

##### Node Cache Misses (read+deserialize)

- **Metric**: `firewood_cache_node{type="miss", mode!="open"}`
- **What it measures**: Rate of cache misses when reading nodes from storage, broken down by operation mode (e.g., read, write)
- **Unit**: Operations per second
- **Interpretation**:
- Lower values indicate better cache performance
- High miss rates suggest the cache is too small or access patterns are not cache-friendly
- Different modes (read/write) can have different miss patterns
- **Good performance**: Miss rate should be low relative to total operations (< 10% of total cache accesses)
- **Poor performance**: Consistently high miss rates (> 50%) indicate inadequate cache size or poor locality

##### Cache Hit Rate

- **Metrics**:
- Node cache: `firewood_cache_node{type="hit"}` / `firewood_cache_node`
- Freelist cache: `firewood_cache_freelist{type="hit"}` / `firewood_cache_freelist`
- **What it measures**: Percentage of cache accesses that find data in cache (node cache and freelist cache tracked separately)
- **Unit**: Percentage (0-100%)
- **Interpretation**:
- Node cache hit rate: How often node reads find data in memory
- Freelist cache hit rate: How often free space lookups succeed in cache
- Higher is better - indicates effective caching
- **Good performance**:
- Node cache: > 90% for steady-state workloads
- Freelist cache: > 80% is typical
- **Poor performance**:
- Node cache: < 70% indicates cache thrashing
- May need to increase `--cache-size` parameter

##### Reads per Insert

- **Metric**: `sum(firewood_cache_node) / sum(firewood_insert)`
- **What it measures**: Average number of node reads required per insert operation
- **Unit**: Ratio (reads/insert)
- **Interpretation**:
- Indicates the I/O amplification factor for write operations
- Lower values mean more efficient inserts
- Varies by benchmark type due to different access patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Let's add some more details here. Typically, reads per insert will increase as the size of the merkle trie gets bigger, so this number will steadily grow as the trie splits when more data is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanation that reads per insert increases as the trie grows larger and more splitting occurs, typically stabilizing at tree depth 6-7. (00b0d46)

- This number will steadily grow as the trie size increases and more splitting occurs, typically stabilizing at tree depth 6-7
- **Expected values**:
- `single`: ~2-3 (minimal tree traversal, single node update)
- `zipf`: ~4-7 (skewed access pattern with hot keys)
- `tenkrandom`: ~6-8 (uniform random access across the trie)
- **Poor performance**: Values significantly higher than expected (>10) may indicate cache issues or data with identical prefixes causing excessive depth

#### Operation Rate Metrics

##### Operation Rate

- **Metrics**:
- `firewood_remove{prefix="true/false"}` - Removals (by prefix or exact key)
- `firewood_insert{merkle!="update"}` - New insertions
- `firewood_insert{merkle="update"}` - Updates to existing keys
- **What it measures**: Rate of database operations per second
- **Unit**: Operations per second
- **Interpretation**:
- Shows the throughput of different operation types
- Removal operations split by whether they're prefix-based or exact-key
- Inserts split between new keys and updates
- **Good performance**:
- Depends on hardware and benchmark type
- Rates should be stable over time during steady-state
- `single` benchmark: 50k-200k ops/sec (highest due to cache hits and minimal I/O)
- `zipf`: 25k-100k ops/sec (moderate performance with skewed access)
- `tenkrandom`: 25k-100k ops/sec (slowest due to random access patterns)
- **Poor performance**:
- Declining operation rates over time
- High variance in operation rates

##### Insert Merkle Ops by Type

- **Metric**: `firewood_insert{merkle="update/above/below/split"}`
- **What it measures**: Categorizes insertions by the type of merkle tree operation performed
- **Unit**: Operations per second
- **Interpretation**:
- `update`: Updating an existing leaf node (fastest)
- `above`: Inserting above an existing node in the tree
- `below`: Inserting below an existing node
- `split`: Splitting a node to accommodate new data (most expensive)
- **Expected distribution**:
- `single`: Almost all `update` operations
- `zipf`: Mostly `update` with occasional `split`
- `tenkrandom`: Mix of all types, more `split` operations
- **Poor performance**: Excessive `split` operations indicate tree fragmentation

#### Throughput Metrics

##### Proposals Submitted (Blocks Processed)

- **Metric**: `firewood_proposals`
- **What it measures**: Total number of proposals (transactions/batches) committed to the database
- **Unit**: Cumulative count
- **Interpretation**:
- Each proposal represents a committed batch of operations
- Rate of increase shows throughput in terms of batches per second
- Use `irate()` or `rate()` to see proposals per second
- **Good performance**: Steady linear increase over time
- **Poor performance**: Stalls, drops, or irregular patterns indicate commit issues

#### System Resource Metrics

##### Dirty Bytes (OS pending write to disk)

- **Metric**: `node_memory_Dirty_bytes`
- **What it measures**: Amount of modified data in OS page cache waiting to be written to disk
- **Unit**: Bytes
- **Interpretation**:
- Shows backlog of data waiting to be flushed to storage
- High values indicate write pressure on storage subsystem
- Can cause latency spikes when kernel forces flushes
- **Good performance**: Stays relatively stable and bounded
- **Poor performance**: Constantly growing or very high values (> 1GB) indicate I/O bottleneck

### Metric Correlation and Analysis

Understanding how metrics relate to each other helps identify root causes:

#### Cache Performance ↔ Operation Rate

- Low cache hit rate typically correlates with lower operation rates
- Improving cache size/strategy should improve both metrics

#### Operation Rate ↔ Dirty Bytes

- High operation rates increase dirty bytes
- Storage can become bottleneck if dirty bytes grow unbounded

#### Reads per Insert ↔ Cache Hit Rate

- Poor cache hit rate increases reads per insert
- Both metrics should improve together with cache tuning

### Benchmark-Specific Metrics Interpretation

Different benchmarks exhibit different metric patterns:

#### `create` Benchmark

- **Focus**: Initial database population throughput
- **Key metrics**: Operation rate, proposals per second
- **Expected pattern**: Steady operation rate, increasing proposals count
- **Notes**: Cache less relevant as there's no existing data to cache

#### `single` Benchmark

- **Focus**: Minimal transaction overhead, single-row update performance
- **Key metrics**: Proposals per second, update operation rate
- **Expected pattern**:
- Nearly all operations are `update` type
- Very high cache hit rate (> 99%)
- Low reads per insert (~2-3)
- Highest proposals/sec rate due to cache hits and minimal I/O
- **What constitutes good performance**: 50k-200k commits/sec depending on hardware

#### `zipf` Benchmark

- **Focus**: Realistic skewed workload (hot keys get more updates)
- **Key metrics**: Cache hit rate, operation rate, insert types distribution
- **Expected pattern**:
- High cache hit rate (90%+) due to hot key concentration
- Mostly `update` operations with some `split`
- Medium reads per insert (~4-7)
- **What constitutes good performance**: 25k-100k ops/sec with > 90% cache hit rate

#### `tenkrandom` Benchmark

- **Focus**: Uniform random access with mixed operations (insert/update/delete)
- **Key metrics**: All metrics relevant, cache hit rate critical
- **Expected pattern**:
- Moderate cache hit rate (depends on cache size vs. database size)
- Mix of all insert types
- Higher reads per insert (~6-8)
- Balanced insert/update/delete rates
- **What constitutes good performance**: 25k-100k ops/sec with stable cache metrics

### Comparing Results Across Runs

To effectively compare benchmark results:

1. **Ensure consistent configuration**:
- Same `--cache-size`, `--batch-size`, `--number-of-batches`
- Same database size for steady-state tests
- Same hardware or account for differences

2. **Compare steady-state metrics**:
- Ignore warm-up period (first few minutes)
- Compare averages over stable time windows
- Look for trends (improving/degrading over time)

3. **Key comparison metrics**:
- Operation rate (ops/sec)
- Proposals per second
- Cache hit rates
- Reads per insert ratio

4. **Use Grafana time range selection**:
- Select the same time window for comparison
- Use "Compare" feature to overlay different runs

### Identifying Performance Bottlenecks

#### Symptom: Low operation rates

- Check: Cache hit rate
- If low (< 70%): Increase cache size
- Check: Dirty bytes
- If high and growing: Storage I/O bottleneck, consider faster storage
- Check: Reads per insert
- If very high: Possible index fragmentation or suboptimal access patterns

#### Symptom: Declining performance over time

- Check: Cache hit rate trend
- If declining: Database growing beyond cache effectiveness
- Check: Insert merkle ops distribution
- If increasing `split` operations: Tree fragmentation increasing
- Check: Dirty bytes trend
- If growing: Storage falling behind

#### Symptom: High variability in metrics

- Check: System resources (CPU, memory, I/O)
- May indicate interference from other processes
- Consider dedicated benchmark environment

#### Symptom: Poor cache hit rate despite large cache

- Check: Reads per insert
- If very high: Access pattern is not cache-friendly
- Consider: Different cache read strategy (see `--cache-read-strategy` option)
- Check: Database size vs. cache size ratio

### Understanding Grafana Dashboard Panels

The Grafana dashboard organizes metrics into logical groups:

1. **Cache** section: Node cache misses and hit rates - start here to assess cache efficiency
2. **Throughput** section: Overall system throughput (proposals submitted)
3. **Operation Rate** section: Detailed breakdown of operation types and rates
4. **Internals** section: Deep-dive metrics like reads per insert
5. **System** section: OS-level metrics like dirty bytes

**Recommended analysis workflow**:

1. Start with "Proposals Submitted" to see overall throughput
2. Check "Cache hit rate" to assess cache effectiveness
3. Review "Operation Rate" to understand workload composition
4. Examine "Reads per insert" for write amplification
5. Monitor "Dirty bytes" for storage pressure

### Troubleshooting Anomalous Results

#### Sudden drops in operation rate

- Possible causes:
- Garbage collection pause (if using default allocator)
- Kernel forced flush of dirty pages
- Background OS operations
- Solutions:
- Use jemalloc allocator (already configured in benchmark)
- Tune kernel write-back parameters
- Ensure dedicated benchmark environment

#### Cache hit rate suddenly drops

- Possible causes:
- Database size crossed cache capacity threshold
- Benchmark switched phases (e.g., from warm-up to steady-state)
- Cache eviction policy pressure
- Solutions:
- Increase `--cache-size`
- Reduce `--number-of-batches` for smaller database
- Check if pattern is expected for benchmark type

#### Irregular/spiky metrics

- Possible causes:
- Insufficient warm-up time
- Interference from other processes
- Storage device issues (e.g., SSD garbage collection)
- Solutions:
- Allow longer warm-up period
- Use dedicated hardware
- Monitor system-level metrics

#### Very high dirty bytes

- Possible causes:
- Storage I/O bottleneck
- Kernel writeback settings too permissive
- Solutions:
- Use faster storage (NVMe SSD)
- Tune kernel parameters: `/proc/sys/vm/dirty_*`
- Reduce operation rate if necessary

### Tips for Optimal Benchmarking

1. **Warm-up period**: Run for 5-10 minutes before measuring to allow caches to stabilize
2. **Duration**: Run for at least 30 minutes to get representative results
3. **Isolation**: Minimize other processes on benchmark machine
4. **Consistency**: Keep configuration parameters consistent across comparison runs
5. **Monitoring**: Always monitor system resources (CPU, memory, I/O) alongside Firewood metrics
6. **Documentation**: Record configuration parameters with results for reproducibility