Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: subresouce scale selector changed for traffic routed canary #4074

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zachaller
Copy link
Collaborator

@zachaller zachaller commented Jan 24, 2025

BREAKING CHANGE

We are changing rollouts .status.selector field to report only the stable replicaset for traffic routed canaries that do not use dynamicStableScale. This is needed in order to be able to use HPA's reliably. A little explanation on how HPA. calculate desired pods. There are two methods and formulas for calculating desired pods counts we will be focusing on Value type metrics.

The formula for Value type metrics is as follows desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] The currentReplicas count comes from the selector defined in .status.selector for Argo Rollouts. Before we merge this PR this would be all pods under the Rollout. This causes issues because as we shift traffic to the desired state we end up with a whole bunch of pods not taking any traffic. In theory the number of pods taking traffic will always be equal to .spec.replicas which is also the same value as the stable replicaset, therefore we should report to the HPA the number of pods behind the stable replicaset as a proxy to .spec.replicas.

This change does have the possibility to affect your current HPA configurations. You should make sure that you understand the new behavior and that it works with your HPA's metrics. Chances are high that what you currently have configured is actually broken due to this issue.

Two examples before and after of the formula:

Before this PR

.spec.replicas = 10
setWeight = 50
desiredMetricValue = 50
currentMetricValue = 50
canaryPodCount = 5
stablePodCont = 10
 
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
15 = ceil[ (10 + 5) * ( 50 / 50) ]

Here we would want to scale up to 15 pods incorrectly.

After this PR

.spec.replicas = 10
setWeight = 50
desiredMetricValue = 50
currentMetricValue = 50
canaryPodCount = 5
stablePodCont = 10
 
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
10 = ceil[ (10) * ( 50 / 50) ]

Here we keep the pod count at 10 correctly.

This is a simplified example due to the dynamic nature of shifting weights during a rollout and metrics changing during those shifts. However, this new behavior is more correct and allows you to construct a metric that is more reliable.

Copy link
Contributor

github-actions bot commented Jan 24, 2025

Published E2E Test Results

  4 files    4 suites   3h 8m 14s ⏱️
113 tests 102 ✅  7 💤 4 ❌
456 runs  424 ✅ 28 💤 4 ❌

For more details on these failures, see this check.

Results for commit 95c0808.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Jan 24, 2025

Published Unit Test Results

2 291 tests   2 291 ✅  2m 58s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 95c0808.

♻️ This comment has been updated with latest results.

Signed-off-by: Zach Aller <[email protected]>
Signed-off-by: Zach Aller <[email protected]>
Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rollout/canary.go Outdated Show resolved Hide resolved
f.run(getKey(r2, t)) // there should be no api calls
f.assertEvents(nil)

patch := f.getPatchedRollout(patchIndex)
assert.Equal(t, "{\"status\":{\"selector\":\"foo=bar,rollouts-pod-template-hash=58c48fdff5\"}}", patch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert.Equal(t, "{\"status\":{\"selector\":\"foo=bar,rollouts-pod-template-hash=58c48fdff5\"}}", patch)
expectedPatch := `{"status":{"selector":"foo=bar,rollouts-pod-template-hash=58c48fdff5"}}`
assert.JSONEq(t, expectedPatch, patch)

Signed-off-by: Zach Aller <[email protected]>
@zachaller zachaller changed the title fix: subresouce scale selector for traffic routed canary fix: subresouce scale selector change for traffic routed canary Jan 31, 2025
@zachaller zachaller changed the title fix: subresouce scale selector change for traffic routed canary fix: subresouce scale selector changed for traffic routed canary Jan 31, 2025
Signed-off-by: Zach Aller <[email protected]>
@zachaller zachaller changed the title fix: subresouce scale selector changed for traffic routed canary fix!: subresouce scale selector changed for traffic routed canary Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants