Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prometheus] Add ray_cluster_provisioned_duration_seconds metric #3212

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

win5923
Copy link
Contributor

@win5923 win5923 commented Mar 20, 2025

Why are these changes needed?

Add ray_cluster_provisioned_duration_seconds metric to track the time from RayClusters created to all ray pods are ready for the first time (RayClusterProvisioned).

Manual test:

$ k apply -f config/samples/ray-cluster.sample.yaml

$ echo $(( $(date -d "$(kubectl get raycluster raycluster-kuberay -o=jsonpath='{.status.conditions[?(@.type=="HeadPodReady")].lastTransitionTime}')" +%s) - $(date -d "$(kubectl get raycluster raycluster-kuberay -o=jsonpath='{.metadata.creationTimestamp}')" +%s) )) "seconds"
41 seconds

$ k apply -f config/samples/ray-cluster.embed-grafana.yaml

$ echo $(( $(date -d "$(kubectl get raycluster raycluster-embed-grafana -o=jsonpath='{.status.conditions[?(@.type=="HeadPodReady")].lastTransitionTime}')" +%s) - $(date -d "$(kubectl get raycluster raycluster-embed-grafana -o=jsonpath='{.metadata.creationTimestamp}')" +%s) )) "seconds"
16 seconds

$ k apply -f config/samples/ray-cluster.embed-grafana.yaml -n test

$ echo $(( $(date -d "$(kubectl get raycluster raycluster-embed-grafana -n test -o=jsonpath='{.status.conditions[?(@.type=="HeadPodReady")].lastTransitionTime}')" +%s) - $(date -d "$(kubectl get raycluster raycluster-embed-grafana -n test -o=jsonpath='{.metadata.creationTimestamp}')" +%s) )) "seconds"
15 seconds

image

Related issue number

Closes #3172

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923
Copy link
Contributor Author

win5923 commented Mar 20, 2025

@troychiu PTAL, thx.

@win5923 win5923 force-pushed the metrics/ray_cluster_provisioned_duration_seconds branch from a10273a to 3b66a69 Compare March 20, 2025 16:41
Help: "The time from RayClusters created to all ray pods are ready for the first time (RayClusterProvisioned) in seconds",
// It may not be applicable to all users, but default buckets cannot be used either.
// For reference, see: https://github.com/prometheus/client_golang/blob/331dfab0cc853dca0242a0d96a80184087a80c1d/prometheus/histogram.go#L271
Buckets: []float64{30, 60, 120, 180, 240, 300, 600, 900, 1800, 3600},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what bucket ranges would be suitable for most users.

@win5923 win5923 force-pushed the metrics/ray_cluster_provisioned_duration_seconds branch 2 times, most recently from 5a481f0 to 27d38a5 Compare March 25, 2025 15:32
@win5923 win5923 force-pushed the metrics/ray_cluster_provisioned_duration_seconds branch from 27d38a5 to a009b43 Compare March 25, 2025 15:42
@kevin85421
Copy link
Member

cc @troychiu let's prioritize reviewing this PR.

@kevin85421 kevin85421 self-assigned this Apr 12, 2025
@win5923
Copy link
Contributor Author

win5923 commented Apr 12, 2025

This PR is a follow-up to #3310.

@@ -1336,6 +1333,11 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
Reason: rayv1.AllPodRunningAndReadyFirstTime,
Message: "All Ray Pods are ready for the first time",
})

// Record ray_cluster_provisioned_duration_seconds duration metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric should not be recorded in calculateStatus. It should only be recorded when the status update succeeds, to avoid counting it more than once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][metrics] ray_cluster_provisioned_duration_seconds
2 participants