-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed scheduler building block and service proposal #44
Conversation
Signed-off-by: Cassandra Coyle <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I really like this proposal. A scheduler service that can be used to power multiple Dapr building blocks, including actor reminders, makes a ton of sense. I do have a couple questions about the proposal below, primarily from a workflow perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from me!
0012-BIRS-distributed-scheduler.md
Outdated
Job job = 1; | ||
|
||
// The metadata associated with the job. | ||
// This can contain the generated `key` for the optional state store when the daprd sidecar needs to lookup the entire data from a state store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal was shut down a few months ago with the reasoning that it impacts performance negatively too much. Has the thinking changed about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data can be saved without limits in the default embedded db (etcd) and for flexibility users can use the state store of their choosing.
0012-BIRS-distributed-scheduler.md
Outdated
service SchedulerCallback { | ||
|
||
// Callback RPC for job schedule being at 'trigger' time | ||
rpc TriggerJob(TriggerRequest) returns (google.protobuf.Empty) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you may want to consider is some sort of rate-limiting here. For example allowing a maximum number of active jobs.
Currently we don't rate-limit reminders explicitly, but given the perf bottlenecks we have in the current implementation (for example the I/O cost of executing a reminder), there's a "natural" limit there. This new design should have a much lower cost and at this stage the risk of DoS-ing an app may be real.
|
||
### Acceptance Criteria | ||
|
||
* How will success be measured? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add an acceptance criteria to test what happens in disaster scenarios (when more than n/2-1 instances fail, such as 2 out of 3). Also need to test how it recovers from total cluster shutdowns - I know of more than a few users that shut down their development K8s clusters when not in use, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple disaster scenarios that are hard to predict, so I will add a backup & recovery scenario to be documented to the AC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple failure scenarios but the one that is most critical is failure of N/2+1 nodes or every node (so with 3 nodes, failure of 2 or 3). This can be tested easily by shutting down some of the nodes.
I'm assuming the job Does the For example, many building blocks that support multi-tenancy can have a scope of So let's assume I have a single namespace ( Would the following command performed on App This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to. This would also have the implication that the Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :
|
I assume this proposal is more biased towards durability and horizontal scaling, over precision. I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached. If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated. |
All jobs will be namespaced to the app/sidecar namespace. We will not support global jobs, unless all the jobs explicitly fall under the same namespace. There is no broadcast call back to all apps in a namespace with this proposal. That might lean towards using the PubSub building block (as a future implementation detail). |
Signed-off-by: Cassandra Coyle <[email protected]>
You should see this explicitly in the proposal now - specifically in the |
Co-authored-by: Josh van Leeuwen <[email protected]> Signed-off-by: Cassie Coyle <[email protected]>
Signed-off-by: Cassandra Coyle <[email protected]>
Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours? |
I've been re-reading this proposal and I have a few comments
{
"schedule": "@daily",
"data": {
"task": "db-backup",
"metadata": {
"db_name": "my-prod-db",
"backup_location": "/backup-dir"
}
}
}
|
This is out of scope for this proposal. The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code. In the future we might be able to integrate scheduling into other building blocks, but for now this is completely out of scope
The purpose of this API is to both satisfy the requirements of actor reminders and Dapr workflows as well as provide users with a simple CRON like API that doesn't require users to fully embrace a completely new programming model. Since cron jobs and scheduled reminders are a very common primitive, it would be useful for users to be able to schedule these easily without the additional overhead of a full fledged programming model. Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns
Proposals do not deal with timelines, what goes into which milestone and priorities. Specifically, it's a yes/no decision whether a design can at any point be implemented in Dapr in the future. The actual priorities for milestones when it comes to which part of a proposal to implement comes down to the feature definition and triage stage that maintainers undergo at the beginning of each milestone |
Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)
They did! And interestingly it would appear that they chose to dogfood their own workflows programming model to implement the higher level 'schedule' concept. With this proposal, it appears we are doing the opposite? https://temporal.io/blog/how-we-build-it-building-scheduled-workflows-in-temporal |
I still would like some clarity on how this all manifests |
Signed-off-by: Cassandra Coyle <[email protected]>
Note: I left the List jobs endpoint in the the proposal, however for 1.14 we will not implement this API due to more work required for pagination. |
Signed-off-by: Cassandra Coyle <[email protected]>
Following are the performance numbers and associated improvements observed in the PoC tracked in PR: With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering. |
Great work cassie and Team! This all seems very positive and I can't wait to try it out within the context of Workflows - Particularly running workflow Apps with more than just 1/2 instances, as this is a critical milestone/achievement to GA actors. I'm actually on Vacay for the next 2 weeks, but I have my Mac with me, so if someone can publish a container image and some instructions how to integrate it, I will happily run it within my docker compose test harness that I've been using for load testing Workflows - DM me on Discord to discuss further :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Loving the preliminary numbers.
|
||
* Scheduler Building Block API code | ||
* Scheduler Service code | ||
* Tests added (e2e, unit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be validated in the weekly/release longhauls as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding new building blocks in alpha does not require longhaul tests. Even to stable has not been mentioned before. I suggest we modify LH tests to use scheduler (with reminders and jobs) as part of the stable criteria, not in alpha.
@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed |
That is a great question. If by "executing reminders" you mean the reminders being invoked by the sidecar, then those numbers are present and referred as "triggers" in the comment from @cicoyle. If you are referring to something else for "executing reminders", please, clarify. Thanks. |
I didn't understand those were the "triggers", thanks. In the past we discussed about the workflow perf test too, since that's heavily reliant on reminders |
Signed-off-by: Cassandra Coyle <[email protected]>
Signed-off-by: Cassandra Coyle <[email protected]>
When using the new scheduler service, we can confirm that we achieved the ability to increase throughput as the scale grows, in contrast to the existing reminder system that shows a performance degradation as scale increases. As expected, a decrease in some performance metrics is observed with the smallest parallelism and scale (details below), which then increases as parallelism and concurrency grow. While testing parallel workflows with a max concurrent workflow count of 110 with 440 total number of workflow runs, we saw performance improvements to the tune of 275% for the rate of iterations. Furthermore, Scheduler can achieve a 267% improvement in both the data sent and received, proving that the Scheduler can handle a higher rate of data while significantly decreasing the maximum total time taken to complete the request by 62%. These numbers prove the horizontal scalability of workflows with Scheduler being used as the underlying reminder system. While testing parallel workflows with a max concurrent count of between 60 & 90, we saw performance improvements of about 71%, where the existing reminder system drops by 44% in comparison, being unable to correct the performance drop as the scale of workflows continues to grow. At 350 max concurrent workflows and 1400 iterations, we see a performance improvement of 50% higher than the existing reminder system. At the smaller scale of 30 max concurrent workflows and 300 total workflow count we observed a 27% decrease in the rate of iterations and 25% in data sent/received, with a 56% improvement in latency (being the maximum total time taken to complete the request). As part of the PoC, we ran several performance tests with 100k and 10k workflows to examine other bottlenecks in the Workflow system, unrelated to the scheduling/reminder sub-systems. Several areas were identified as potential bottlenecks for Workflows as a whole and these should be an area of focus in coming iterations. |
Signed-off-by: Cassie Coyle <[email protected]>
+1 binding |
Signed-off-by: Artur Souza <[email protected]>
I think we can continue to optimize it for perf issue. +1 binding |
+1 non-binding |
Recusing myself from voting. |
1 similar comment
Recusing myself from voting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing this vote as approved. Thanks to the author, all reviewers and maintainers for participating
This design proposes 2 additions:
See images for high level overview and understanding of the architecture involved.
Please review and lmk what you think 🎉