Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter autoscaling #136

Open
whynowy opened this issue Aug 10, 2022 · 2 comments
Open

Smarter autoscaling #136

whynowy opened this issue Aug 10, 2022 · 2 comments
Labels
area/controller enhancement New feature or request opex Operational Excellence to make it easy to run in production and debug

Comments

@whynowy
Copy link
Member

whynowy commented Aug 10, 2022

Summary

With current autoscaling strategy, a high throughput pipeline (e.g. lots of backlog need to process), when the pipeline processing rate hits the bottleneck (due to ISB or anything else), the replicas will still go up, unless it reaches scale.max or kind of balance, which is not expected.

We need to make the autoscaler smarter, for example, if a 5 replica vertex has similar performance to running 6 replicas, we should only do 5.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@whynowy whynowy added the enhancement New feature or request label Aug 10, 2022
@whynowy
Copy link
Member Author

whynowy commented Sep 15, 2022

We have the mechanism that when back pressure is detected in downstream vertices, it will scale done the current vertex (or keep current replica number, depends on where the back pressure happens). Need to investigate - if there's no way to get the full picture of back pressure on all the vertices (e.g., some of the vertices were scaled down to 0, but can not scaled up due to restrictions like out of quota), what will be the impact.

@vigith vigith added the opex Operational Excellence to make it easy to run in production and debug label Jul 19, 2024
@kohlisid
Copy link
Contributor

Testing a high TPS pipeline I see a situation with the default scaling policies where the scaling up/down is too variable,
The input load is constant and with the backpressure increasing we scale up, but the scale down happens pretty soon as well which results in this cycle continuing.
We do have the option to configure cooldown etc to optimize, but from a users perspective they might not be aware of the best options to go ahead in such a situation. We might want to look into this and see how can we have a more stable approach.

We see this ping pong, with the replicas scaling up and down constantly.

Screenshot 2024-07-19 at 11 03 09 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller enhancement New feature or request opex Operational Excellence to make it easy to run in production and debug
Projects
None yet
Development

No branches or pull requests

3 participants