-
-
Notifications
You must be signed in to change notification settings - Fork 135
feat: add sysinfo metrics #1139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add sysinfo metrics #1139
Conversation
Pull Request Test Coverage Report for Build 13025834726Details
💛 - Coveralls |
e67c355
to
8910431
Compare
8910431
to
6f96079
Compare
WalkthroughThis set of changes introduces a comprehensive system metrics collection and reporting framework. Disk, memory, and CPU usage metrics are now gathered periodically using a scheduler and exposed via Prometheus-compatible metrics. The codebase is refactored to support these new metrics, with additional data structures and metric types added. Metric collection is integrated into server initialization routines, ensuring that system and cluster metrics are always gathered and available. Some internal APIs are streamlined, such as the removal of the Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant MetricsModule
participant Prometheus
participant Disk
participant System
loop Every minute
Scheduler->>MetricsModule: collect_all_metrics()
MetricsModule->>Disk: collect_disk_metrics()
Disk-->>MetricsModule: Disk usage stats
MetricsModule->>System: collect_system_metrics()
System-->>MetricsModule: Memory & CPU stats
MetricsModule->>Prometheus: Update metrics (disk, memory, CPU)
end
sequenceDiagram
participant Server
participant MetricsModule
Server->>MetricsModule: init_system_metrics_scheduler().await
Server->>MetricsModule: init_cluster_metrics_scheduler().await
MetricsModule-->>Server: Scheduler tasks started
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
src/parseable/streams.rs (2)
133-155
: Consider returning a more descriptive error instead of panicking.This block is generally correct and well-structured; however, you call
.expect("File and RecordBatch both are checked")
instead of returning a more descriptiveStagingError
. In production, this might abruptly terminate the application instead of gracefully propagating the error. As a follow-up, you could replace the.expect(...)
with a result-based error handling approach to facilitate better error diagnosis.- let mut writer = DiskWriter::try_new(file_path, &record.schema(), range) - .expect("File and RecordBatch both are checked"); + let mut writer = DiskWriter::try_new(file_path, &record.schema(), range) + .map_err(|e| { + StagingError::Create + })?;
1047-1047
: Use a more descriptive.expect(...)
in tests.Using
.unwrap()
in a test is acceptable, but for clarity, a more descriptive message can help diagnose failures:- staging.push("abc", &batch, time, &HashMap::new()).unwrap(); + staging.push("abc", &batch, time, &HashMap::new()) + .expect("Failed to push record batch to staging during test");src/handlers/http/cluster/mod.rs (1)
874-874
: Include error handling for querier metrics if necessary.Adding
all_metrics.push(Metrics::querier_prometheus_metrics().await);
is a solid enhancement. Consider wrapping it if there's any chance of error or unavailability fromquerier_prometheus_metrics()
so that the entire metrics collection doesn't silently fail if the querier is unreachable.src/cli.rs (3)
442-451
: Provide user-friendly fallback or logs on invalid config.The logic in
get_url
is concise, routing betweenget_endpoint
andbuild_url
. However, it uses apanic!
inget_endpoint
for invalid input. Consider returning aResult<Url, SomeConfigError>
or logging more details to help users correct their configurations in production deployments.
468-482
: Parsing by splitting on “:” alone may limit IPv6 support.The code splits on “:” to separate hostname from port, which won’t handle IPv6 addresses gracefully. If you foresee IPv6 usage, consider an established parser or additional checks to handle bracketed IPv6 addresses.
507-516
: Fail-fast parsing with a helpful panic message is appropriate.
build_url
clarifies errors for misconfigured addresses. This is fine as a default, but if you anticipate frequently needing dynamic reconfiguration, consider returning aResult<Url, ConfigError>
to handle misconfigurations more gracefully at runtime.src/metrics/prom_utils.rs (2)
65-69
: Ensure testing of newly introduced fields
The newly introduced fields for disk, memory, and CPU usage in theMetrics
struct are clear and well-defined. Ensure they’re fully covered in unit tests to verify the integration logic, especially verifying default values and changes upon metric updates.Also applies to: 98-119
124-131
: Consider adding doc comments
TheMetricType
enum and itsfrom_metric
implementation are well-structured. To aid maintainability, consider adding Rust doc comments explaining each variant’s purpose and usage, as these mappings play a crucial role in the metrics pipeline.Also applies to: 134-174
src/metrics/mod.rs (1)
195-230
: Suggest verifying naming consistency
The newly declared disk, memory, and CPU gauges and counters are logically grouped under the same namespace. As a nitpick, ensure consistent naming conventions (e.g., “_disk” vs. “_usage”) to minimize confusion in dashboards.Also applies to: 280-294
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
src/cli.rs
(1 hunks)src/event/mod.rs
(0 hunks)src/handlers/http/cluster/mod.rs
(2 hunks)src/handlers/http/ingest.rs
(2 hunks)src/handlers/http/modal/ingest_server.rs
(2 hunks)src/handlers/http/modal/query_server.rs
(2 hunks)src/handlers/http/modal/server.rs
(2 hunks)src/metrics/mod.rs
(4 hunks)src/metrics/prom_utils.rs
(4 hunks)src/parseable/streams.rs
(2 hunks)
💤 Files with no reviewable changes (1)
- src/event/mod.rs
🧰 Additional context used
🧬 Code Definitions (8)
src/handlers/http/modal/ingest_server.rs (1)
src/metrics/mod.rs (1)
init_system_metrics_scheduler
(386-406)
src/handlers/http/modal/server.rs (1)
src/metrics/mod.rs (1)
init_system_metrics_scheduler
(386-406)
src/handlers/http/modal/query_server.rs (2)
src/handlers/http/cluster/mod.rs (1)
init_cluster_metrics_scheduler
(878-927)src/metrics/mod.rs (1)
init_system_metrics_scheduler
(386-406)
src/parseable/streams.rs (2)
src/utils/time.rs (2)
granularity_range
(267-282)new
(59-61)src/parseable/staging/writer.rs (1)
try_new
(57-72)
src/cli.rs (5)
src/option.rs (1)
mode
(127-135)src/storage/object_storage.rs (1)
get_endpoint
(74-74)src/storage/localfs.rs (1)
get_endpoint
(83-85)src/storage/azure_blob.rs (1)
get_endpoint
(192-194)src/storage/s3.rs (1)
get_endpoint
(322-324)
src/metrics/prom_utils.rs (1)
src/metrics/mod.rs (2)
get_system_metrics
(509-542)get_volume_disk_usage
(454-480)
src/handlers/http/cluster/mod.rs (2)
src/metrics/mod.rs (1)
collect_all_metrics
(409-417)src/metrics/prom_utils.rs (2)
querier_prometheus_metrics
(203-252)new
(79-121)
src/metrics/mod.rs (1)
src/metrics/prom_utils.rs (1)
new
(79-121)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: coverage
🔇 Additional comments (21)
src/handlers/http/modal/ingest_server.rs (2)
31-31
: LGTM: Added import for system metrics schedulerAdded import for the new system metrics scheduling functionality that will be used in the init method.
114-114
: Added system metrics collection to ingest serverThis adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the ingestor node, which aligns with the PR's objective of enhancing monitoring capabilities. This metrics collection is initialized during server startup after metadata is stored.
The implementation properly follows an asynchronous pattern and handles potential errors by propagating them upward with the
?
operator, which ensures the server won't start if metrics initialization fails.src/handlers/http/modal/server.rs (2)
33-33
: LGTM: Added import for system metrics schedulerAdded import for the system metrics scheduling functionality that will be used in the init method.
138-139
: Added system metrics collection to the main serverThis adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the server node, which aligns with the PR's objective of enhancing monitoring capabilities. The metrics collection is properly initialized during server startup after analytics initialization.
The implementation follows an asynchronous pattern with proper error handling via the
?
operator, ensuring the server won't start if metrics initialization fails. It's appropriately placed after the analytics initialization but before spawning the server tasks.src/handlers/http/ingest.rs (2)
31-31
: Updated import to include LogSourceEntryUpdated import for format-related types to match their usage in the code.
127-127
: Simplified internal stream processingReplaced the previous schema-based event processing with a more streamlined approach using
flatten_and_push_logs
. This simplification eliminates the need for schema retrieval and additional transformation steps when ingesting internal stream data.This approach is more consistent with how other stream ingestion is handled throughout the codebase and reduces the complexity of the
ingest_internal_stream
function. Since this is used for internal pmeta streams which include the system metrics data being added in this PR, the simplified approach should make it easier to reliably capture the metrics.src/handlers/http/modal/query_server.rs (3)
22-22
: Fixed typo in cluster metrics scheduler importCorrected the function name from "schedular" to "scheduler" in the import statement.
28-28
: Added import for system metrics schedulerAdded import for the system metrics scheduling functionality that will be used in the init method.
118-120
: Improved metrics initialization sequenceEnhanced the query server initialization by adding system metrics collection and improving the cluster metrics initialization flow. The previous conditional approach has been replaced with a more robust sequential initialization pattern.
This implementation:
- Initializes system metrics collection (CPU, memory, disk usage) on the querier node
- Initializes cluster metrics collection from all ingestors
- Uses proper error handling with the
?
operator to ensure the server won't start if either metrics initialization failsThis change aligns perfectly with the PR's objective of adding sysinfo metrics from the querier node to both pmeta and cluster metrics API, enhancing the overall monitoring capabilities of the system.
src/handlers/http/cluster/mod.rs (3)
42-42
: Imports look good.Introducing
collect_all_metrics
here properly ties together the metric collection logic used later in this file.
878-878
: Asynchronous scheduler initialization appears sound.Converting
init_cluster_metrics_scheduler
into an async function aligns it well with asynchronous tasks. This helps ensure non-blocking operation when scheduling metrics collection.
885-887
: Good error logging addition.Capturing and logging errors from
collect_all_metrics()
provides better visibility. Ensure sensitive data is not included in any custom error messages returned fromcollect_all_metrics()
to avoid accidental PII leakage.src/cli.rs (2)
453-466
: Panic-based validation for endpoints is acceptable but strict.The new
get_endpoint
function panics if the endpoint string includes “http” or if environment variables are malformed. This is a valid fail-fast strategy, but you might consider a more robust error mechanism if there are advanced use cases (e.g., IPv6 or over-sanitized environment variables).
484-505
: Environment variable resolution strategy looks suitable.Resolving the
$VARNAME
pattern is convenient. The panic path is again a valid approach if environment variables are strictly required at startup. If you plan to allow optional environment variables or partial expansions, you might expand the logic accordingly.src/metrics/prom_utils.rs (3)
19-21
: Imports look good
No issues observed with introducingHashMap
andPath
here, as they align well with the new disk and memory usage collection logic.
22-45
: No immediate concerns
These added imports fromcrate::about::current
, along with system metrics and disk usage utilities, seem necessary and correctly scoped.
254-361
: Increase test coverage for metric processing
Thebuild_metrics_from_samples
,process_gauge_metric
, and related helper functions correctly process Prometheus samples intoMetrics
. However, there's a significant gap in test coverage. Implement targeted tests confirming each metric maps correctly (especially disk and memory usage) to ensure reliability.Would you like me to generate a specialized test module for these functions?
src/metrics/mod.rs (4)
21-29
: Imports and constants are fine
No issues with the newly introduced imports and constants for scheduling intervals and HTTP integration.Also applies to: 34-35, 39-39
358-383
: Struct definitions are clear
DiskMetrics
,SystemMetrics
,MemoryMetrics
, andCpuMetrics
are well-defined. Using simple numeric fields is efficient for direct usage in Prometheus metrics.
436-451
: Robust disk and system metrics collection
The functions for collecting disk usage, memory, and CPU metrics are well-structured and straightforward. Current usage of thesysinfo
crate is appropriate. Just ensure you handle large disk sizes and high CPU usage gracefully when parsing intoi64
orf64
.Please confirm large disk or CPU usage values are tested to avoid overflow or unexpected truncation issues.
Also applies to: 453-480, 482-506, 508-542
385-406
:❓ Verification inconclusive
Potential concurrency consideration
init_system_metrics_scheduler
spawns a loop sleeping every 10 seconds but uses an interval of 1 minute. Confirm that the more frequent loop wake-up won’t cause concurrency or performance overhead.Also applies to: 408-417, 419-434
🏁 Script executed:
#!/bin/bash # We will check if there are any references to concurrency control or repeated scheduling calls rg "init_system_metrics_scheduler" -A 10Length of output: 5324
Concurrency Consideration – Verify Scheduler Wake-Up Frequency
Insrc/metrics/mod.rs
, theinit_system_metrics_scheduler
function spawns an async loop that callsscheduler.run_pending()
every 10 seconds, while the metrics collection is scheduled usingSYSTEM_METRICS_INTERVAL_SECONDS
(typically a one-minute interval). Please confirm that this more frequent wake-up does not introduce any concurrency-related overhead or performance degradation. The same pattern appears in the code blocks at lines 408–417 and 419–434, so verifying that these intervals remain lightweight under load is advisable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/handlers/http/cluster/mod.rs (1)
902-911
: Well-implemented async scheduler with proper error handling.The scheduler function has been properly updated to be asynchronous, and error handling for system metrics collection has been added. However, there's a spelling inconsistency in the log message.
- info!("Setting up schedular for cluster metrics ingestion"); + info!("Setting up scheduler for cluster metrics ingestion");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/handlers/http/cluster/mod.rs
(2 hunks)src/handlers/http/modal/query_server.rs
(2 hunks)src/handlers/http/modal/server.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/handlers/http/modal/server.rs
- src/handlers/http/modal/query_server.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/handlers/http/cluster/mod.rs (2)
src/metrics/mod.rs (1)
collect_all_metrics
(409-417)src/metrics/prom_utils.rs (2)
querier_prometheus_metrics
(203-252)new
(79-121)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (2)
src/handlers/http/cluster/mod.rs (2)
43-43
: Appropriate import for system metrics collection.The import of
collect_all_metrics
function is correctly added to support the new system metrics collection functionality.
898-898
: Good enhancement adding querier metrics to cluster metrics.This addition ensures that system metrics from the querier node are included along with ingestor and indexer metrics, improving the overall monitoring capabilities of the system.
collect CPU usage, memory usage of the server collect disk usage of the volume - data, staging, hot-tier add these metrics to Prometheus Metrics export these metrics to cluster metrics API add the metrics to pmeta stream add the querier node's sysinfo metrics to pmeta and cluster metrics API
faf3e66
to
0b2d1a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/metrics/prom_utils.rs (1)
209-227
:expect()
will bring the querier down on every I/O failure – propagate errors insteadAll three resource‑collection calls (
get_system_metrics
,get_volume_disk_usage
for staging & hot‑tier) unwrap withexpect(...)
. Any transient OS error (e.g./proc
not readable, volume temporarily unmounted) will panic the whole node and break the metrics endpoint.Please replace
expect
with graceful handling, e.g.:- let system_metrics = get_system_metrics().expect("Failed to get system metrics"); + let system_metrics = match get_system_metrics() { + Ok(m) => m, + Err(e) => { + warn!("failed to get system metrics: {e}"); + return metrics; // or keep partial data & return + } + };Repeat for the two
get_volume_disk_usage
calls.This comment was previously raised on earlier commits and is still applicable.
Also applies to: 239-246
🧹 Nitpick comments (1)
src/metrics/prom_utils.rs (1)
355-361
: Hash‑map insert silently discards duplicate CPU samples
process_cpu_usage
overwrites any existing key with the latest sample:metrics.parseable_cpu_usage.insert(cpu_name.to_string(), val);If multiple samples for the same CPU (e.g. different modes) are scraped in one batch, earlier values are lost.
Consider aggregating (e.g. max/avg) or using a compound key (cpu
+mode
) to retain all series.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
src/event/mod.rs
(0 hunks)src/handlers/http/cluster/mod.rs
(2 hunks)src/handlers/http/ingest.rs
(2 hunks)src/handlers/http/modal/ingest_server.rs
(2 hunks)src/handlers/http/modal/query_server.rs
(2 hunks)src/handlers/http/modal/server.rs
(2 hunks)src/metrics/mod.rs
(4 hunks)src/metrics/prom_utils.rs
(4 hunks)src/parseable/streams.rs
(2 hunks)
💤 Files with no reviewable changes (1)
- src/event/mod.rs
🚧 Files skipped from review as they are similar to previous changes (7)
- src/handlers/http/modal/server.rs
- src/handlers/http/modal/ingest_server.rs
- src/handlers/http/modal/query_server.rs
- src/handlers/http/cluster/mod.rs
- src/handlers/http/ingest.rs
- src/parseable/streams.rs
- src/metrics/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (1)
src/metrics/prom_utils.rs (1)
329-343
: Potential precision loss when castingf64
disk gauges tou64
Prometheus exports disk sizes in bytes; large volumes can exceed 2⁵³ (≈9 PiB).
Af64
can’t represent every integer above that threshold exactly, so the castval as u64
may truncate.If you control the exporter and values will stay < 9 PiB you’re fine; otherwise, store the raw
f64
or round before casting:- "parseable_total_disk" => disk_usage.total = val as u64, + "parseable_total_disk" => disk_usage.total = val.round() as u64,or keep
f64
inDiskMetrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
src/metrics/prom_utils.rs (1)
318-358
: 🛠️ Refactor suggestionConsider updating existing methods to handle errors gracefully
The
get_daily_stats_from_samples
method uses multiple.expect()
calls which could cause panics if the expected values are not present.Consider replacing these with more graceful error handling:
- if sample.labels.get("stream").expect("stream name is present") + if sample.labels.get("stream").unwrap_or(&"".to_string()) == stream_name - && sample.labels.get("date").expect("date is present") == date + && sample.labels.get("date").unwrap_or(&"".to_string()) == dateApply this pattern to all similar occurrences in the method.
♻️ Duplicate comments (1)
src/metrics/prom_utils.rs (1)
169-193
: Consider more robust error handlingThe method relies on
.expect(...)
in the calledfrom_about_api_response
method for retrieving metadata. While the error is mapped to aPostError
, a failure could still impact server stability.Consider implementing fallback or graceful degradation when metadata cannot be retrieved, rather than propagating the error upward. This would ensure metrics collection continues even when certain components fail.
🧹 Nitpick comments (2)
src/metrics/prom_utils.rs (2)
284-298
: Check for potential precision loss in disk usage metricsThe code casts floating-point values (
f64
) to unsigned integers (u64
) without handling potential precision loss or negative values.Consider adding validation or using a safer conversion method:
- "parseable_total_disk" => disk_usage.total = val as u64, - "parseable_used_disk" => disk_usage.used = val as u64, - "parseable_available_disk" => disk_usage.available = val as u64, + "parseable_total_disk" => disk_usage.total = val.max(0.0) as u64, + "parseable_used_disk" => disk_usage.used = val.max(0.0) as u64, + "parseable_available_disk" => disk_usage.available = val.max(0.0) as u64,This ensures negative values are handled gracefully, preventing unexpected behavior.
300-308
: Apply same safe conversion to memory metricsSimilar to disk metrics, the memory usage metrics also cast floating-point values to unsigned integers without validation.
Apply the same safe conversion approach:
- "total_memory" => metrics.parseable_memory_usage.total = val as u64, - "used_memory" => metrics.parseable_memory_usage.used = val as u64, - "total_swap" => metrics.parseable_memory_usage.total_swap = val as u64, - "used_swap" => metrics.parseable_memory_usage.used_swap = val as u64, + "total_memory" => metrics.parseable_memory_usage.total = val.max(0.0) as u64, + "used_memory" => metrics.parseable_memory_usage.used = val.max(0.0) as u64, + "total_swap" => metrics.parseable_memory_usage.total_swap = val.max(0.0) as u64, + "used_swap" => metrics.parseable_memory_usage.used_swap = val.max(0.0) as u64,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/handlers/http/cluster/mod.rs
(3 hunks)src/handlers/http/modal/ingest_server.rs
(2 hunks)src/metrics/prom_utils.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/handlers/http/modal/ingest_server.rs
- src/handlers/http/cluster/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (6)
src/metrics/prom_utils.rs (6)
57-61
: Good addition of system resource metricsThe new fields for disk, memory, and CPU usage metrics align well with the PR objectives to enhance monitoring capabilities. This structured approach will make it easier to track and visualize system resource utilization across different volumes.
90-111
: Proper initialization of system resource metricsThe initialization of the new system metrics fields with sensible default values is well-implemented. This ensures these metrics start in a clean state before being populated with actual values.
116-167
: Well-designed metric type categorizationThe new
MetricType
enum provides a clean approach to categorize different types of metrics. Thefrom_metric
implementation effectively maps metric names and their labels to the appropriate enum variant, creating a maintainable foundation for metric processing.
230-231
: Fixed lifetime & deleted storage size metricsThe implementation now correctly passes the metric name to
process_storage_size
, addressing the previous issue where lifetime and deleted storage metrics were not being populated.
262-282
: Properly implemented storage metrics processingThe refactored
process_storage_size
method now correctly targets different storage structures based on the metric name, ensuring all metrics are properly updated.
310-316
: Well-structured CPU usage metrics collectionThe implementation of CPU usage metrics collection is clean and effective, storing metrics in a HashMap keyed by CPU identifier.
collect CPU usage, memory usage of the server
collect disk usage of the volume - data, staging, hot-tier
add these metrics to Prometheus Metrics
export these metrics to cluster metrics API
add the metrics to pmeta stream
add the querier node's sysinfo metrics to pmeta and cluster metrics API
Summary by CodeRabbit