Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Support Incremental Zero-Downtime Upgrades #3166

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Mar 7, 2025

Why are these changes needed?

This PR implements an alpha version of the RayService Incremental Upgrade REP.

The RayService controller logic to reconcile a RayService during an incremental upgrade is as follows:

  1. Validate the IncrementalUpgradeOptions and accept/reject the RayService CR accordingly
  2. Call reconcileGateway - on the first call this should create a new Gateway CR and subsequent calls will update the Listeners as necessary based on any changes to the RayService.Spec
  3. Call reconcileHTTPRoute - on the first call this should create a HTTPRoute CR with two backendRefs, one pointing to the old cluster and one to the pending cluster with weights 100 and 0 accordingly. Every subsequent call to reconcileHTTPRoute will update the HTTPRoute by changing the weight of each backendRef by StepSizePercent until the weight associated with each cluster equals the TargetCapacity associated with that cluster. The backendRef weight is exposed through the RayService Status field TrafficRoutedPercent. The weight is only changed if it's been at least IntervalSeconds since RayService.Status.LastTrafficMigratedTime, otherwise the controller waits until the next iteration and checks again.
  4. The controller then checks if the TrafficRoutedPercent == TargetCapacity, if so the target_capacity can be updated for one of the clusters.
  5. The controller then calls reconcileServeTargetCapacity. If the total target_capacity of both Serve configs is less than or equal to 100%, the pending cluster's target_capacity can be safely scaled up by MaxSurgePercent. If the total target_capacity is greater than 100%, the active cluster target_capacity can be decreased by MaxSurgePercent.
  6. The controller then continues with the reconciliation logic as normal. Once TargetCapacity and TrafficRoutedPercent of the pending RayService instance RayCluster equal 100%, the upgrade is complete.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary marked this pull request as ready for review March 25, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants