Add zfs pool and dataset usage metrics.#10453
Conversation
Has there been a consensus on what internal data we plan to expose to the customer? If not, it would be a good idea to do so. AFAIK tracking these types of things will fall under the FM umbrella, so this PR or any similar ones should probably get some eyes from someone in the FM project I think 🤔 . |
bnaecker
left a comment
There was a problem hiding this comment.
Looks good overall to me. I've a few suggestions and nits, but nothing major. I can't speak to the intersection with the FM subsystem though. Thanks for doing this!
| @@ -0,0 +1 @@ | |||
| pub mod usage; | |||
There was a problem hiding this comment.
Need the common preamble here, and other files: // This Source Code Form ...
|
|
||
| let mut q = samples.lock().unwrap(); | ||
| q.extend(new_samples); | ||
| if q.len() > MAX_QUEUE_LENGTH { |
There was a problem hiding this comment.
Kind of a nit, but this will temporarily allow the queue to be larger than the maximum length, by potentially by an arbitrary amount. I think we've solved this a few different times now, most recently here. We should probably make an abstraction for this, or use an existing one that fits our needs if we can find it.
There was a problem hiding this comment.
I lifted your version for a little BoundedQueue abstraction, and used that throughout oximeter.
There was a problem hiding this comment.
Excellent, thanks. Definitely handy.
|
|
||
| [fields.pool_kind] | ||
| type = "string" | ||
| description = "Oxide kind of the parent zpool (external or internal; empty for non-Oxide pools)" |
There was a problem hiding this comment.
I'm not entirely sure what "non-Oxide pools" means. Are those pools not automatically managed by the control plane? I wonder if we ought to only report statistics for pools for which we can positively identify them as "Oxide pools".
There was a problem hiding this comment.
This was an infelicitous phrasing choice. Most pools/datasets follow a standard naming scheme, like ox(i|p)<pool_uuid> for zpools or ox(i|p)<pool_uuid>/crypt/zone/<zone_uuid>, etc., for datasets. If we can parse metadata from that kind of naming convention, using helpers that predate this patch, we include the metadata. If not, e.g. for the rpool pool, we omit the metadata. I added notes on this inline, and dropped the oxide/non-oxide phrasing from the descriptions.
|
|
||
| [fields.pool_id] | ||
| type = "uuid" | ||
| description = "Oxide UUID of the parent zpool (empty for non-Oxide pools)" |
|
|
||
| [fields.dataset_kind] | ||
| type = "string" | ||
| description = "Oxide kind of the dataset (e.g. clickhouse, debug, transient_zone; empty for non-Oxide datasets)" |
|
|
||
| [fields.dataset_id] | ||
| type = "uuid" | ||
| description = "Oxide UUID of the dataset (empty when unset)" |
There was a problem hiding this comment.
I'd say "nil" rather than "empty". Do you know the situations in which this is actually unset in practice?
|
|
||
| [fields.pool_kind] | ||
| type = "string" | ||
| description = "Oxide kind of the zpool (external or internal; empty for non-Oxide pools)" |
There was a problem hiding this comment.
Same note here about "non-Oxide".
|
|
||
| [fields.pool_id] | ||
| type = "uuid" | ||
| description = "Oxide UUID of the zpool (empty for non-Oxide pools)" |
There was a problem hiding this comment.
Same note here, and I'd use "nil" instead of "empty".
4582312 to
d6b192a
Compare
I checked with @hawkw, and I think (but correct me if I'm wrong) that this sort of instrumentation would be complementary to FM, not duplicative. |
Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.
This sounds right to me. We'd use this in FM, maybe, or otherwise consume it when dealing with a fault. |
Yeah, I think it was really just the phrase "monitor and alert" that suggested a potential duplication of stuff falling under the FM umbrella, but this PR is just adding metrics collection. I think here you were referring to the customer consuming our metrics and producing their own alerts, which is also not particularly duplicative. We should have the metrics regardless, and when we add automated diagnosis for these issues in the fault management subsystem, I'm sure we will want to have these metrics already there to use in FM! |
Collect zpool and dataset usage metrics from sled-agent, using a background worker to collect stats and a producer to expose them to oximeter. To be used to monitor and alert on disk use of internal applications.
Context: we had an escalation related to an internal service filling up its disk. We shipped a mitigation in #10366, but the mitigation still requires the user to take action before the disk fills up. This patch adds metrics that the user can monitor and alert on proactively, and that we can use to understand storage use of internal services (and eventually set quotas).