Skip to content

Commit 47dc4c7

Browse files
Copilotrkuris
andcommitted
Move metrics documentation to METRICS.md and add per-process note
Co-authored-by: rkuris <[email protected]>
1 parent f9fa291 commit 47dc4c7

File tree

2 files changed

+292
-287
lines changed

2 files changed

+292
-287
lines changed

METRICS.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Firewood Metrics
2+
3+
Firewood provides comprehensive metrics for monitoring database performance, resource utilization, and operational characteristics. These metrics are built using the [Prometheus](https://prometheus.io/) format and can be exposed for collection by monitoring systems.
4+
5+
**Note**: Metric names in this documentation use dots (e.g., `firewood.proposal.commit`), but when exported to Prometheus, dots are automatically converted to underscores (e.g., `firewood_proposal_commit`) following Prometheus naming conventions.
6+
7+
## Enabling Metrics
8+
9+
Metrics are available when Firewood is built with the `metrics` feature. By default, metrics collection is enabled in the library but needs to be explicitly started in applications.
10+
11+
**Important**: Only one metrics instance can be created per process. Attempting to initialize metrics multiple times will result in an error.
12+
13+
### For Rust Applications
14+
15+
Metrics are automatically registered when the instrumented code paths are executed. To expose metrics via HTTP:
16+
17+
```rust
18+
use metrics_exporter_prometheus::PrometheusBuilder;
19+
20+
// Set up Prometheus exporter on port 9000
21+
PrometheusBuilder::new()
22+
.install()
23+
.expect("failed to install Prometheus recorder");
24+
```
25+
26+
### For FFI/Go Applications
27+
28+
In the Go FFI layer, metrics must be explicitly enabled:
29+
30+
```go
31+
import "github.com/ava-labs/firewood-go-ethhash/ffi"
32+
33+
// Option 1: Start metrics with HTTP exporter on a specific port
34+
ffi.StartMetricsWithExporter(9000)
35+
36+
// Option 2: Start metrics without exporter (use Gatherer to access)
37+
ffi.StartMetrics()
38+
39+
// Retrieve metrics programmatically
40+
gatherer := ffi.Gatherer{}
41+
metrics, err := gatherer.Gather()
42+
```
43+
44+
See the [FFI README](ffi/README.md) for more details on FFI metrics configuration.
45+
46+
## Available Metrics
47+
48+
### Database Operations
49+
50+
#### Proposal Metrics
51+
52+
- **`firewood.proposals`** (counter)
53+
- Description: Total number of proposals created
54+
- Use: Track proposal creation rate and throughput
55+
56+
- **`firewood.proposal.create`** (counter with `success` label)
57+
- Description: Count of proposal creation operations
58+
- Labels: `success=true|false`
59+
- Use: Monitor proposal creation success rate
60+
61+
- **`firewood.proposal.create_ms`** (counter with `success` label)
62+
- Description: Time spent creating proposals in milliseconds
63+
- Labels: `success=true|false`
64+
- Use: Track proposal creation latency
65+
66+
- **`firewood.proposal.commit`** (counter with `success` label)
67+
- Description: Count of proposal commit operations
68+
- Labels: `success=true|false`
69+
- Use: Monitor commit success rate
70+
71+
- **`firewood.proposal.commit_ms`** (counter with `success` label)
72+
- Description: Time spent committing proposals in milliseconds
73+
- Labels: `success=true|false`
74+
- Use: Track commit latency and identify slow commits
75+
76+
#### Revision Management
77+
78+
- **`firewood.active_revisions`** (gauge)
79+
- Description: Current number of active revisions in memory
80+
- Use: Monitor memory usage and revision retention
81+
82+
- **`firewood.max_revisions`** (gauge)
83+
- Description: Maximum number of revisions configured
84+
- Use: Track configuration setting
85+
86+
### Merkle Trie Operations
87+
88+
#### Insert Operations
89+
90+
- **`firewood.insert`** (counter with `merkle` label)
91+
- Description: Count of insert operations by type
92+
- Labels: `merkle=update|above|below|split`
93+
- `update`: Value updated at existing key
94+
- `above`: New node inserted above existing node
95+
- `below`: New node inserted below existing node
96+
- `split`: Node split during insertion
97+
- Use: Understand insert patterns and trie structure evolution
98+
99+
#### Remove Operations
100+
101+
- **`firewood.remove`** (counter with `prefix` and `result` labels)
102+
- Description: Count of remove operations
103+
- Labels:
104+
- `prefix=true|false`: Whether operation is prefix-based removal
105+
- `result=success|nonexistent`: Whether key(s) were found
106+
- Use: Track deletion patterns and key existence
107+
108+
### Storage and I/O Metrics
109+
110+
#### Node Reading
111+
112+
- **`firewood.read_node`** (counter with `from` label)
113+
- Description: Count of node reads by source
114+
- Labels: `from=file|memory`
115+
- Use: Monitor read patterns and storage layer usage
116+
117+
#### Cache Performance
118+
119+
- **`firewood.cache.node`** (counter with `mode` and `type` labels)
120+
- Description: Node cache hit/miss statistics
121+
- Labels:
122+
- `mode`: Read operation mode
123+
- `type=hit|miss`: Cache hit or miss
124+
- Use: Evaluate cache effectiveness for nodes
125+
126+
- **`firewood.cache.freelist`** (counter with `type` label)
127+
- Description: Free list cache hit/miss statistics
128+
- Labels: `type=hit|miss`
129+
- Use: Monitor free list cache efficiency
130+
131+
#### I/O Operations
132+
133+
- **`firewood.io.read`** (counter)
134+
- Description: Total number of I/O read operations
135+
- Use: Track I/O operation count
136+
137+
- **`firewood.io.read_ms`** (counter)
138+
- Description: Total time spent in I/O reads in milliseconds
139+
- Use: Identify I/O bottlenecks and disk performance issues
140+
141+
#### Node Persistence
142+
143+
- **`firewood.flush_nodes`** (counter)
144+
- Description: Cumulative time spent flushing nodes to disk in milliseconds (counter incremented by flush duration)
145+
- Use: Monitor flush performance and identify slow disk writes; calculate average flush time using rate()
146+
147+
### Memory Management
148+
149+
#### Space Allocation
150+
151+
- **`firewood.space.reused`** (counter with `index` label)
152+
- Description: Bytes reused from free list
153+
- Labels: `index`: Size index of allocated area
154+
- Use: Track memory reuse efficiency
155+
156+
- **`firewood.space.wasted`** (counter with `index` label)
157+
- Description: Bytes wasted when allocating from free list (allocated more than needed)
158+
- Labels: `index`: Size index of allocated area
159+
- Use: Monitor allocation efficiency and fragmentation
160+
161+
- **`firewood.space.from_end`** (counter with `index` label)
162+
- Description: Bytes allocated from end of nodestore when free list was insufficient
163+
- Labels: `index`: Size index of allocated area
164+
- Use: Track database growth and free list effectiveness
165+
166+
- **`firewood.space.freed`** (counter with `index` label)
167+
- Description: Bytes freed back to free list
168+
- Labels: `index`: Size index of freed area
169+
- Use: Monitor memory reclamation
170+
171+
#### Node Management
172+
173+
- **`firewood.delete_node`** (counter with `index` label)
174+
- Description: Count of nodes deleted
175+
- Labels: `index`: Size index of deleted node
176+
- Use: Track node deletion patterns
177+
178+
#### Ring Buffer
179+
180+
- **`ring.full`** (counter)
181+
- Description: Count of times the ring buffer became full during node flushing
182+
- Use: Identify backpressure in node persistence pipeline
183+
184+
### FFI Layer Metrics
185+
186+
These metrics are specific to the Foreign Function Interface (Go) layer:
187+
188+
#### Batch Operations
189+
190+
- **`firewood.ffi.batch`** (counter)
191+
- Description: Count of batch operations completed
192+
- Use: Track FFI batch throughput
193+
194+
- **`firewood.ffi.batch_ms`** (counter)
195+
- Description: Time spent processing batches in milliseconds
196+
- Use: Monitor FFI batch latency
197+
198+
#### Proposal Operations
199+
200+
- **`firewood.ffi.propose`** (counter)
201+
- Description: Count of proposal operations via FFI
202+
- Use: Track FFI proposal throughput
203+
204+
- **`firewood.ffi.propose_ms`** (counter)
205+
- Description: Time spent creating proposals via FFI in milliseconds
206+
- Use: Monitor FFI proposal latency
207+
208+
#### Commit Operations
209+
210+
- **`firewood.ffi.commit`** (counter)
211+
- Description: Count of commit operations via FFI
212+
- Use: Track FFI commit throughput
213+
214+
- **`firewood.ffi.commit_ms`** (counter)
215+
- Description: Time spent committing via FFI in milliseconds
216+
- Use: Monitor FFI commit latency
217+
218+
#### View Caching
219+
220+
- **`firewood.ffi.cached_view.hit`** (counter)
221+
- Description: Count of cached view hits
222+
- Use: Monitor view cache effectiveness
223+
224+
- **`firewood.ffi.cached_view.miss`** (counter)
225+
- Description: Count of cached view misses
226+
- Use: Monitor view cache effectiveness
227+
228+
## Interpreting Metrics
229+
230+
### Performance Monitoring
231+
232+
1. **Latency Tracking**: The `*_ms` metrics track operation durations. Monitor these for:
233+
- Sudden increases indicating performance degradation
234+
- Baseline establishment for SLA monitoring
235+
- Correlation with system load
236+
237+
2. **Throughput Monitoring**: Counter metrics without `_ms` suffix track operation counts:
238+
- Rate of change indicates throughput
239+
- Compare with expected load patterns
240+
- Identify anomalies in operation rates
241+
242+
### Resource Utilization
243+
244+
1. **Cache Efficiency**:
245+
- Calculate hit rate: `cache.hit / (cache.hit + cache.miss)`
246+
- Target: >90% for node cache, >80% for free list cache
247+
- Low hit rates may indicate insufficient cache size
248+
249+
2. **Memory Management**:
250+
- Monitor `space.reused` vs `space.from_end` ratio
251+
- High `space.from_end` indicates database growth
252+
- High `space.wasted` suggests fragmentation issues
253+
254+
3. **Active Revisions**:
255+
- `active_revisions` approaching `max_revisions` triggers cleanup
256+
- Sustained high values may indicate memory pressure
257+
258+
### Debugging
259+
260+
1. **Failed Operations**:
261+
- Check metrics with `success=false` label
262+
- Correlate with error logs for root cause analysis
263+
264+
2. **Ring Buffer Backpressure**:
265+
- `ring.full` counter increasing indicates persistence bottleneck
266+
- May require tuning of flush parameters or disk subsystem
267+
268+
3. **Insert/Remove Patterns**:
269+
- `firewood.insert` labels show trie structure evolution
270+
- High `split` counts indicate complex key distributions
271+
- Remove `nonexistent` suggests application-level issues
272+
273+
## Example Monitoring Queries
274+
275+
For Prometheus-based monitoring (note: metric names use underscores in queries):
276+
277+
```promql
278+
# Average commit latency over 5 minutes
279+
rate(firewood_proposal_commit_ms[5m]) / rate(firewood_proposal_commit[5m])
280+
281+
# Cache hit rate
282+
sum(rate(firewood_cache_node{type="hit"}[5m])) /
283+
sum(rate(firewood_cache_node[5m]))
284+
285+
# Database growth rate (bytes/sec)
286+
rate(firewood_space_from_end[5m])
287+
288+
# Failed commit ratio
289+
rate(firewood_proposal_commit{success="false"}[5m]) /
290+
rate(firewood_proposal_commit[5m])
291+
```

0 commit comments

Comments
 (0)