Skip to content

Commit 1945efa

Browse files
Copilotrkurisdemosdemon
authored
docs: Add comprehensive metrics documentation in METRICS.md (#1402)
- [x] Successfully rebased on main (commit 03f12df) - [x] Created METRICS.md with comprehensive documentation of all 29 metrics - [x] Added note about one metrics instance per process - [x] Updated README.md to reference METRICS.md - [x] Added METRICS.md to license header exclusion list in .github/check-license-headers.yaml - [x] Clean single commit with only documentation changes - [x] No unrelated PR changes visible in diff <issue_title>Add comprehensive metrics documentation to README</issue_title> > ### Problem > There is no comprehensive document listing or describing the available metrics in Firewood. Users and developers cannot easily understand what metrics are available for monitoring or how to interpret them. > > ### Proposed Solution > Add a dedicated "Metrics" section to the top-level README.md that documents all available metrics. > > ### Content to Include > - List of all available metrics with their names > - Description of what each metric measures > - Usage examples for enabling and gathering metrics > - Information about metric labels and values > - Examples of how to interpret metrics for monitoring and debugging > > ### Current State > - Metrics are mentioned briefly in `ffi/README.md` > - The `firewood-macros/README.md` shows how metrics are instrumented > - No comprehensive listing exists of what metrics are actually available > > ### References > - See `storage/src/macros.rs` for metric macro implementations > - See `ffi/README.md` for existing metrics documentation > - See `firewood-macros/README.md` for metrics macro usage - Fixes #1398 <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Add comprehensive metrics documentation to README</issue_title> > <issue_description>### Problem > There is no comprehensive document listing or describing the available metrics in Firewood. Users and developers cannot easily understand what metrics are available for monitoring or how to interpret them. > > ### Proposed Solution > Add a dedicated "Metrics" section to the top-level README.md that documents all available metrics. > > ### Content to Include > - List of all available metrics with their names > - Description of what each metric measures > - Usage examples for enabling and gathering metrics > - Information about metric labels and values > - Examples of how to interpret metrics for monitoring and debugging > > ### Current State > - Metrics are mentioned briefly in `ffi/README.md` > - The `firewood-macros/README.md` shows how metrics are instrumented > - No comprehensive listing exists of what metrics are actually available > > ### References > - See `storage/src/macros.rs` for metric macro implementations > - See `ffi/README.md` for existing metrics documentation > - See `firewood-macros/README.md` for metrics macro usage</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> - Fixes #1398 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: rkuris <[email protected]> Co-authored-by: demosdemon <[email protected]> Co-authored-by: Brandon LeBlanc <[email protected]>
1 parent 2f71077 commit 1945efa

File tree

3 files changed

+297
-0
lines changed

3 files changed

+297
-0
lines changed

.github/check-license-headers.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
"grpc-testtool/**",
1414
"README*",
1515
"**/README*",
16+
"METRICS.md",
1617
"Cargo.toml",
1718
"Cargo.lock",
1819
"*/Cargo.toml",
@@ -40,6 +41,7 @@
4041
"grpc-testtool/**",
4142
"README*",
4243
"**/README*",
44+
"METRICS.md",
4345
"Cargo.toml",
4446
"Cargo.lock",
4547
"*/Cargo.toml",

METRICS.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Firewood Metrics
2+
3+
Firewood provides comprehensive metrics for monitoring database performance, resource utilization, and operational characteristics. These metrics are built using the [Prometheus](https://prometheus.io/) format and can be exposed for collection by monitoring systems.
4+
5+
**Note**: Metric names in this documentation use dots (e.g., `firewood.proposal.commit`), but when exported to Prometheus, dots are automatically converted to underscores (e.g., `firewood_proposal_commit`) following Prometheus naming conventions.
6+
7+
## Enabling Metrics
8+
9+
Metrics are available when Firewood is built with the `metrics` feature. By default, metrics collection is enabled in the library but needs to be explicitly started in applications.
10+
11+
**Important**: Only one metrics instance can be created per process. Attempting to initialize metrics multiple times will result in an error.
12+
13+
### For Rust Applications
14+
15+
Metrics are automatically registered when the instrumented code paths are executed. To expose metrics via HTTP:
16+
17+
```rust
18+
use metrics_exporter_prometheus::PrometheusBuilder;
19+
20+
// Set up Prometheus exporter on port 9000
21+
PrometheusBuilder::new()
22+
.install()
23+
.expect("failed to install Prometheus recorder");
24+
```
25+
26+
### For FFI/Go Applications
27+
28+
In the Go FFI layer, metrics must be explicitly enabled:
29+
30+
```go
31+
import "github.com/ava-labs/firewood-go-ethhash/ffi"
32+
33+
// Option 1: Start metrics with HTTP exporter on a specific port
34+
ffi.StartMetricsWithExporter(9000)
35+
36+
// Option 2: Start metrics without exporter (use Gatherer to access)
37+
ffi.StartMetrics()
38+
39+
// Retrieve metrics programmatically
40+
gatherer := ffi.Gatherer{}
41+
metrics, err := gatherer.Gather()
42+
```
43+
44+
See the [FFI README](ffi/README.md) for more details on FFI metrics configuration.
45+
46+
## Available Metrics
47+
48+
### Database Operations
49+
50+
#### Proposal Metrics
51+
52+
- **`firewood.proposals`** (counter)
53+
- Description: Total number of proposals created
54+
- Use: Track proposal creation rate and throughput
55+
56+
- **`firewood.proposal.create`** (counter with `success` label)
57+
- Description: Count of proposal creation operations
58+
- Labels: `success=true|false`
59+
- Use: Monitor proposal creation success rate
60+
61+
- **`firewood.proposal.create_ms`** (counter with `success` label)
62+
- Description: Time spent creating proposals in milliseconds
63+
- Labels: `success=true|false`
64+
- Use: Track proposal creation latency
65+
66+
- **`firewood.proposal.commit`** (counter with `success` label)
67+
- Description: Count of proposal commit operations
68+
- Labels: `success=true|false`
69+
- Use: Monitor commit success rate
70+
71+
- **`firewood.proposal.commit_ms`** (counter with `success` label)
72+
- Description: Time spent committing proposals in milliseconds
73+
- Labels: `success=true|false`
74+
- Use: Track commit latency and identify slow commits
75+
76+
#### Revision Management
77+
78+
- **`firewood.active_revisions`** (gauge)
79+
- Description: Current number of active revisions in memory
80+
- Use: Monitor memory usage and revision retention
81+
82+
- **`firewood.max_revisions`** (gauge)
83+
- Description: Maximum number of revisions configured
84+
- Use: Track configuration setting
85+
86+
### Merkle Trie Operations
87+
88+
#### Insert Operations
89+
90+
- **`firewood.insert`** (counter with `merkle` label)
91+
- Description: Count of insert operations by type
92+
- Labels: `merkle=update|above|below|split`
93+
- `update`: Value updated at existing key
94+
- `above`: New node inserted above existing node
95+
- `below`: New node inserted below existing node
96+
- `split`: Node split during insertion
97+
- Use: Understand insert patterns and trie structure evolution
98+
99+
#### Remove Operations
100+
101+
- **`firewood.remove`** (counter with `prefix` and `result` labels)
102+
- Description: Count of remove operations
103+
- Labels:
104+
- `prefix=true|false`: Whether operation is prefix-based removal
105+
- `result=success|nonexistent`: Whether key(s) were found
106+
- Use: Track deletion patterns and key existence
107+
108+
### Storage and I/O Metrics
109+
110+
#### Node Reading
111+
112+
- **`firewood.read_node`** (counter with `from` label)
113+
- Description: Count of node reads by source
114+
- Labels: `from=file|memory`
115+
- Use: Monitor read patterns and storage layer usage
116+
117+
#### Cache Performance
118+
119+
- **`firewood.cache.node`** (counter with `mode` and `type` labels)
120+
- Description: Node cache hit/miss statistics
121+
- Labels:
122+
- `mode`: Read operation mode
123+
- `type=hit|miss`: Cache hit or miss
124+
- Use: Evaluate cache effectiveness for nodes
125+
126+
- **`firewood.cache.freelist`** (counter with `type` label)
127+
- Description: Free list cache hit/miss statistics
128+
- Labels: `type=hit|miss`
129+
- Use: Monitor free list cache efficiency
130+
131+
#### I/O Operations
132+
133+
- **`firewood.io.read`** (counter)
134+
- Description: Total number of I/O read operations
135+
- Use: Track I/O operation count
136+
137+
- **`firewood.io.read_ms`** (counter)
138+
- Description: Total time spent in I/O reads in milliseconds
139+
- Use: Identify I/O bottlenecks and disk performance issues
140+
141+
#### Node Persistence
142+
143+
- **`firewood.flush_nodes`** (counter)
144+
- Description: Cumulative time spent flushing nodes to disk in milliseconds (counter incremented by flush duration)
145+
- Use: Monitor flush performance and identify slow disk writes; calculate average flush time using rate()
146+
147+
### Memory Management
148+
149+
#### Space Allocation
150+
151+
- **`firewood.space.reused`** (counter with `index` label)
152+
- Description: Bytes reused from free list
153+
- Labels: `index`: Size index of allocated area
154+
- Use: Track memory reuse efficiency
155+
156+
- **`firewood.space.wasted`** (counter with `index` label)
157+
- Description: Bytes wasted when allocating from free list (allocated more than needed)
158+
- Labels: `index`: Size index of allocated area
159+
- Use: Monitor allocation efficiency and fragmentation
160+
161+
- **`firewood.space.from_end`** (counter with `index` label)
162+
- Description: Bytes allocated from end of nodestore when free list was insufficient
163+
- Labels: `index`: Size index of allocated area
164+
- Use: Track database growth and free list effectiveness
165+
166+
- **`firewood.space.freed`** (counter with `index` label)
167+
- Description: Bytes freed back to free list
168+
- Labels: `index`: Size index of freed area
169+
- Use: Monitor memory reclamation
170+
171+
#### Node Management
172+
173+
- **`firewood.delete_node`** (counter with `index` label)
174+
- Description: Count of nodes deleted
175+
- Labels: `index`: Size index of deleted node
176+
- Use: Track node deletion patterns
177+
178+
#### Ring Buffer
179+
180+
- **`ring.full`** (counter)
181+
- Description: Count of times the ring buffer became full during node flushing
182+
- Use: Identify backpressure in node persistence pipeline
183+
184+
### FFI Layer Metrics
185+
186+
These metrics are specific to the Foreign Function Interface (Go) layer:
187+
188+
#### Batch Operations
189+
190+
- **`firewood.ffi.batch`** (counter)
191+
- Description: Count of batch operations completed
192+
- Use: Track FFI batch throughput
193+
194+
- **`firewood.ffi.batch_ms`** (counter)
195+
- Description: Time spent processing batches in milliseconds
196+
- Use: Monitor FFI batch latency
197+
198+
#### Proposal Operations
199+
200+
- **`firewood.ffi.propose`** (counter)
201+
- Description: Count of proposal operations via FFI
202+
- Use: Track FFI proposal throughput
203+
204+
- **`firewood.ffi.propose_ms`** (counter)
205+
- Description: Time spent creating proposals via FFI in milliseconds
206+
- Use: Monitor FFI proposal latency
207+
208+
#### Commit Operations
209+
210+
- **`firewood.ffi.commit`** (counter)
211+
- Description: Count of commit operations via FFI
212+
- Use: Track FFI commit throughput
213+
214+
- **`firewood.ffi.commit_ms`** (counter)
215+
- Description: Time spent committing via FFI in milliseconds
216+
- Use: Monitor FFI commit latency
217+
218+
#### View Caching
219+
220+
- **`firewood.ffi.cached_view.hit`** (counter)
221+
- Description: Count of cached view hits
222+
- Use: Monitor view cache effectiveness
223+
224+
- **`firewood.ffi.cached_view.miss`** (counter)
225+
- Description: Count of cached view misses
226+
- Use: Monitor view cache effectiveness
227+
228+
## Interpreting Metrics
229+
230+
### Performance Monitoring
231+
232+
1. **Latency Tracking**: The `*_ms` metrics track operation durations. Monitor these for:
233+
- Sudden increases indicating performance degradation
234+
- Baseline establishment for SLA monitoring
235+
- Correlation with system load
236+
237+
2. **Throughput Monitoring**: Counter metrics without `_ms` suffix track operation counts:
238+
- Rate of change indicates throughput
239+
- Compare with expected load patterns
240+
- Identify anomalies in operation rates
241+
242+
### Resource Utilization
243+
244+
1. **Cache Efficiency**:
245+
- Calculate hit rate: `cache.hit / (cache.hit + cache.miss)`
246+
- Target: >90% for node cache, >80% for free list cache
247+
- Low hit rates may indicate insufficient cache size
248+
249+
2. **Memory Management**:
250+
- Monitor `space.reused` vs `space.from_end` ratio
251+
- High `space.from_end` indicates database growth
252+
- High `space.wasted` suggests fragmentation issues
253+
254+
3. **Active Revisions**:
255+
- `active_revisions` approaching `max_revisions` triggers cleanup
256+
- Sustained high values may indicate memory pressure
257+
258+
### Debugging
259+
260+
1. **Failed Operations**:
261+
- Check metrics with `success=false` label
262+
- Correlate with error logs for root cause analysis
263+
264+
2. **Ring Buffer Backpressure**:
265+
- `ring.full` counter increasing indicates persistence bottleneck
266+
- May require tuning of flush parameters or disk subsystem
267+
268+
3. **Insert/Remove Patterns**:
269+
- `firewood.insert` labels show trie structure evolution
270+
- High `split` counts indicate complex key distributions
271+
- Remove `nonexistent` suggests application-level issues
272+
273+
## Example Monitoring Queries
274+
275+
For Prometheus-based monitoring (note: metric names use underscores in queries):
276+
277+
```promql
278+
# Average commit latency over 5 minutes
279+
rate(firewood_proposal_commit_ms[5m]) / rate(firewood_proposal_commit[5m])
280+
281+
# Cache hit rate
282+
sum(rate(firewood_cache_node{type="hit"}[5m])) /
283+
sum(rate(firewood_cache_node[5m]))
284+
285+
# Database growth rate (bytes/sec)
286+
rate(firewood_space_from_end[5m])
287+
288+
# Failed commit ratio
289+
rate(firewood_proposal_commit{success="false"}[5m]) /
290+
rate(firewood_proposal_commit[5m])
291+
```

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,10 @@ as well as carefully managing the free list during the creation and expiration o
7474
- `Commit` - The operation of applying one or more `Proposal`s to the most recent
7575
`Revision`.
7676

77+
## Metrics
78+
79+
Firewood provides comprehensive metrics for monitoring database performance, resource utilization, and operational characteristics. For detailed information about all available metrics, how to enable them, and how to interpret them, see [METRICS.md](METRICS.md).
80+
7781
## Build
7882

7983
In order to build firewood, the following dependencies must be installed:

0 commit comments

Comments
 (0)