Skip to content

Add zfs pool and dataset usage metrics.#10453

Open
jmcarp wants to merge 1 commit into
mainfrom
jmcarp/zfs-stats
Open

Add zfs pool and dataset usage metrics.#10453
jmcarp wants to merge 1 commit into
mainfrom
jmcarp/zfs-stats

Conversation

@jmcarp
Copy link
Copy Markdown
Contributor

@jmcarp jmcarp commented May 16, 2026

Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.

Context: we had an escalation related to an internal service filling up its disk. We shipped a mitigation in #10366, but the mitigation still requires the user to take action before the disk fills up. This patch adds metrics that the user can monitor and alert on proactively, and that we can use to understand storage use of internal services (and eventually set quotas).

@jmcarp jmcarp requested a review from bnaecker May 16, 2026 03:11
@karencfv
Copy link
Copy Markdown
Contributor

This patch adds metrics that the user can monitor and alert on proactively, and that we can use to understand storage use of internal services

Has there been a consensus on what internal data we plan to expose to the customer? If not, it would be a good idea to do so. AFAIK tracking these types of things will fall under the FM umbrella, so this PR or any similar ones should probably get some eyes from someone in the FM project I think 🤔 .

CC @hawkw @rmustacc

Copy link
Copy Markdown
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall to me. I've a few suggestions and nits, but nothing major. I can't speak to the intersection with the FM subsystem though. Thanks for doing this!

@@ -0,0 +1 @@
pub mod usage;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the common preamble here, and other files: // This Source Code Form ...

Comment thread oximeter/instruments/src/zfs/usage.rs Outdated

let mut q = samples.lock().unwrap();
q.extend(new_samples);
if q.len() > MAX_QUEUE_LENGTH {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a nit, but this will temporarily allow the queue to be larger than the maximum length, by potentially by an arbitrary amount. I think we've solved this a few different times now, most recently here. We should probably make an abstraction for this, or use an existing one that fits our needs if we can find it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lifted your version for a little BoundedQueue abstraction, and used that throughout oximeter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks. Definitely handy.


[fields.pool_kind]
type = "string"
description = "Oxide kind of the parent zpool (external or internal; empty for non-Oxide pools)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what "non-Oxide pools" means. Are those pools not automatically managed by the control plane? I wonder if we ought to only report statistics for pools for which we can positively identify them as "Oxide pools".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an infelicitous phrasing choice. Most pools/datasets follow a standard naming scheme, like ox(i|p)<pool_uuid> for zpools or ox(i|p)<pool_uuid>/crypt/zone/<zone_uuid>, etc., for datasets. If we can parse metadata from that kind of naming convention, using helpers that predate this patch, we include the metadata. If not, e.g. for the rpool pool, we omit the metadata. I added notes on this inline, and dropped the oxide/non-oxide phrasing from the descriptions.


[fields.pool_id]
type = "uuid"
description = "Oxide UUID of the parent zpool (empty for non-Oxide pools)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same q as above here.


[fields.dataset_kind]
type = "string"
description = "Oxide kind of the dataset (e.g. clickhouse, debug, transient_zone; empty for non-Oxide datasets)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same q as above here.


[fields.dataset_id]
type = "uuid"
description = "Oxide UUID of the dataset (empty when unset)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "nil" rather than "empty". Do you know the situations in which this is actually unset in practice?

Comment thread oximeter/oximeter/schema/zfs_pool.toml Outdated

[fields.pool_kind]
type = "string"
description = "Oxide kind of the zpool (external or internal; empty for non-Oxide pools)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here about "non-Oxide".

Comment thread oximeter/oximeter/schema/zfs_pool.toml Outdated

[fields.pool_id]
type = "uuid"
description = "Oxide UUID of the zpool (empty for non-Oxide pools)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here, and I'd use "nil" instead of "empty".

Comment thread oximeter/instruments/src/zfs/usage.rs
@jmcarp jmcarp force-pushed the jmcarp/zfs-stats branch 4 times, most recently from 4582312 to d6b192a Compare May 20, 2026 14:01
@jmcarp
Copy link
Copy Markdown
Contributor Author

jmcarp commented May 20, 2026

tracking these types of things will fall under the FM umbrella

I checked with @hawkw, and I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

Collect zpool and dataset usage metrics from sled-agent, using a background
worker to collect stats and a producer to expose them to oximeter. To be used
to monitor and alert on disk use of internal applications.
@jmcarp jmcarp force-pushed the jmcarp/zfs-stats branch from d6b192a to 8a5fc84 Compare May 20, 2026 14:47
@bnaecker
Copy link
Copy Markdown
Collaborator

I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

This sounds right to me. We'd use this in FM, maybe, or otherwise consume it when dealing with a fault.

@hawkw
Copy link
Copy Markdown
Member

hawkw commented May 20, 2026

tracking these types of things will fall under the FM umbrella

I checked with @hawkw, and I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative.

Yeah, I think it was really just the phrase "monitor and alert" that suggested a potential duplication of stuff falling under the FM umbrella, but this PR is just adding metrics collection. I think here you were referring to the customer consuming our metrics and producing their own alerts, which is also not particularly duplicative. We should have the metrics regardless, and when we add automated diagnosis for these issues in the fault management subsystem, I'm sure we will want to have these metrics already there to use in FM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants