Add zfs pool and dataset usage metrics. by jmcarp · Pull Request #10453 · oxidecomputer/omicron

jmcarp · 2026-05-16T03:11:27Z

Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.

Context: we had an escalation related to an internal service filling up its disk. We shipped a mitigation in #10366, but the mitigation still requires the user to take action before the disk fills up. This patch adds metrics that the user can monitor and alert on proactively, and that we can use to understand storage use of internal services (and eventually set quotas).

karencfv · 2026-05-18T06:21:42Z

This patch adds metrics that the user can monitor and alert on proactively, and that we can use to understand storage use of internal services

Has there been a consensus on what internal data we plan to expose to the customer? If not, it would be a good idea to do so. AFAIK tracking these types of things will fall under the FM umbrella, so this PR or any similar ones should probably get some eyes from someone in the FM project I think 🤔 .

CC @hawkw @rmustacc

bnaecker

Looks good overall to me. I've a few suggestions and nits, but nothing major. I can't speak to the intersection with the FM subsystem though. Thanks for doing this!

bnaecker · 2026-05-18T21:38:33Z

@@ -0,0 +1 @@
+pub mod usage;


Need the common preamble here, and other files: // This Source Code Form ...

bnaecker · 2026-05-18T21:42:22Z

+
+        let mut q = samples.lock().unwrap();
+        q.extend(new_samples);
+        if q.len() > MAX_QUEUE_LENGTH {


Kind of a nit, but this will temporarily allow the queue to be larger than the maximum length, by potentially by an arbitrary amount. I think we've solved this a few different times now, most recently here. We should probably make an abstraction for this, or use an existing one that fits our needs if we can find it.

I lifted your version for a little BoundedQueue abstraction, and used that throughout oximeter.

Excellent, thanks. Definitely handy.

bnaecker · 2026-05-18T21:50:36Z

+
+[fields.pool_kind]
+type = "string"
+description = "Oxide kind of the parent zpool (external or internal; empty for non-Oxide pools)"


I'm not entirely sure what "non-Oxide pools" means. Are those pools not automatically managed by the control plane? I wonder if we ought to only report statistics for pools for which we can positively identify them as "Oxide pools".

This was an infelicitous phrasing choice. Most pools/datasets follow a standard naming scheme, like ox(i|p)<pool_uuid> for zpools or ox(i|p)<pool_uuid>/crypt/zone/<zone_uuid>, etc., for datasets. If we can parse metadata from that kind of naming convention, using helpers that predate this patch, we include the metadata. If not, e.g. for the rpool pool, we omit the metadata. I added notes on this inline, and dropped the oxide/non-oxide phrasing from the descriptions.

bnaecker · 2026-05-18T21:50:46Z

+
+[fields.pool_id]
+type = "uuid"
+description = "Oxide UUID of the parent zpool (empty for non-Oxide pools)"


Same q as above here.

bnaecker · 2026-05-18T21:51:01Z

+
+[fields.dataset_kind]
+type = "string"
+description = "Oxide kind of the dataset (e.g. clickhouse, debug, transient_zone; empty for non-Oxide datasets)"


Same q as above here.

bnaecker · 2026-05-18T21:56:46Z

+
+[fields.dataset_id]
+type = "uuid"
+description = "Oxide UUID of the dataset (empty when unset)"


I'd say "nil" rather than "empty". Do you know the situations in which this is actually unset in practice?

bnaecker · 2026-05-18T21:57:11Z

+
+[fields.pool_kind]
+type = "string"
+description = "Oxide kind of the zpool (external or internal; empty for non-Oxide pools)"


Same note here about "non-Oxide".

bnaecker · 2026-05-18T21:57:25Z

+
+[fields.pool_id]
+type = "uuid"
+description = "Oxide UUID of the zpool (empty for non-Oxide pools)"


Same note here, and I'd use "nil" instead of "empty".

jmcarp · 2026-05-20T14:14:19Z

tracking these types of things will fall under the FM umbrella

I checked with @hawkw, and I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.

bnaecker · 2026-05-20T19:20:10Z

I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

This sounds right to me. We'd use this in FM, maybe, or otherwise consume it when dealing with a fault.

hawkw · 2026-05-20T19:37:35Z

tracking these types of things will fall under the FM umbrella

I checked with @hawkw, and I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

Yeah, I think it was really just the phrase "monitor and alert" that suggested a potential duplication of stuff falling under the FM umbrella, but this PR is just adding metrics collection. I think here you were referring to the customer consuming our metrics and producing their own alerts, which is also not particularly duplicative. We should have the metrics regardless, and when we add automated diagnosis for these issues in the fault management subsystem, I'm sure we will want to have these metrics already there to use in FM!

jmcarp requested a review from bnaecker May 16, 2026 03:11

bnaecker reviewed May 18, 2026

View reviewed changes

jmcarp force-pushed the jmcarp/zfs-stats branch 4 times, most recently from 4582312 to d6b192a Compare May 20, 2026 14:01

Add zfs pool and dataset usage metrics.

8a5fc84

Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.

jmcarp force-pushed the jmcarp/zfs-stats branch from d6b192a to 8a5fc84 Compare May 20, 2026 14:47

Conversation

jmcarp commented May 16, 2026

Uh oh!

karencfv commented May 18, 2026

Uh oh!

bnaecker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmcarp commented May 20, 2026

Uh oh!

bnaecker commented May 20, 2026

Uh oh!

hawkw commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants