Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed scheduler building block and service proposal #44

Merged
merged 10 commits into from
Jun 19, 2024

Conversation

cicoyle
Copy link
Contributor

@cicoyle cicoyle commented Nov 2, 2023

This design proposes 2 additions:

  • A Distributed Scheduler API Building Block
  • A Distributed Scheduler Control Plane Service

See images for high level overview and understanding of the architecture involved.

Please review and lmk what you think 🎉

@cicoyle cicoyle changed the title initial distributed scheduler building block and service proposal distributed scheduler building block and service proposal Nov 2, 2023
Copy link

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like this proposal. A scheduler service that can be used to power multiple Dapr building blocks, including actor reminders, makes a ton of sense. I do have a couple questions about the proposal below, primarily from a workflow perspective.

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from me!

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
@cicoyle
Copy link
Contributor Author

cicoyle commented Nov 6, 2023

linking to release 1.13

@JoshVanL JoshVanL mentioned this pull request Nov 6, 2023
31 tasks
Job job = 1;

// The metadata associated with the job.
// This can contain the generated `key` for the optional state store when the daprd sidecar needs to lookup the entire data from a state store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal was shut down a few months ago with the reasoning that it impacts performance negatively too much. Has the thinking changed about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data can be saved without limits in the default embedded db (etcd) and for flexibility users can use the state store of their choosing.

service SchedulerCallback {

// Callback RPC for job schedule being at 'trigger' time
rpc TriggerJob(TriggerRequest) returns (google.protobuf.Empty) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you may want to consider is some sort of rate-limiting here. For example allowing a maximum number of active jobs.

Currently we don't rate-limit reminders explicitly, but given the perf bottlenecks we have in the current implementation (for example the I/O cost of executing a reminder), there's a "natural" limit there. This new design should have a much lower cost and at this stage the risk of DoS-ing an app may be real.

0012-BIRS-distributed-scheduler.md Show resolved Hide resolved

### Acceptance Criteria

* How will success be measured?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add an acceptance criteria to test what happens in disaster scenarios (when more than n/2-1 instances fail, such as 2 out of 3). Also need to test how it recovers from total cluster shutdowns - I know of more than a few users that shut down their development K8s clusters when not in use, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple disaster scenarios that are hard to predict, so I will add a backup & recovery scenario to be documented to the AC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple failure scenarios but the one that is most critical is failure of N/2+1 nodes or every node (so with 3 nodes, failure of 2 or 3). This can be tested easily by shutting down some of the nodes.

@msfussell msfussell changed the title distributed scheduler building block and service proposal Distributed scheduler building block and service proposal Nov 10, 2023
@olitomlinson
Copy link

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ?
: http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

  • {global} scope - can be operated on by any App in the cluster.
  • {namespace} scope - can be operated on by any App contained in a specific namespace
  • {namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

@olitomlinson
Copy link

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

@cicoyle
Copy link
Contributor Author

cicoyle commented Nov 16, 2023

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ? : http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

  • {global} scope - can be operated on by any App in the cluster.
  • {namespace} scope - can be operated on by any App contained in a specific namespace
  • {namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

All jobs will be namespaced to the app/sidecar namespace. We will not support global jobs, unless all the jobs explicitly fall under the same namespace. There is no broadcast call back to all apps in a namespace with this proposal. That might lean towards using the PubSub building block (as a future implementation detail).

Signed-off-by: Cassandra Coyle <[email protected]>
@cicoyle
Copy link
Contributor Author

cicoyle commented Nov 16, 2023

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

You should see this explicitly in the proposal now - specifically in the Goals section. Thanks 👍🏻

cicoyle and others added 2 commits November 19, 2023 20:31
Co-authored-by: Josh van Leeuwen <[email protected]>
Signed-off-by: Cassie Coyle <[email protected]>
@olitomlinson
Copy link

Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours?

@ItalyPaleAle
Copy link
Contributor

Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours?

The implementation does have a dependency on #50

@olitomlinson
Copy link

olitomlinson commented Jan 2, 2024

I've been re-reading this proposal and I have a few comments

  1. There doesn't appear to be any examples of how one may author a job task i.e. the users code that actually gets invoked at the time the job executes? See below, taken from the proposal, a schedule is created which invokes a db-backup task, but how is this task expressed?
{
  "schedule": "@daily",
  "data": {
    "task": "db-backup",
    "metadata": {
      "db_name": "my-prod-db",
      "backup_location": "/backup-dir"
    }
  }
}
  1. It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

    (side bar : how would scheduled service invocation work? who would receive the result? this is exactly where you need Dapr Workflows as an orchestration, which already has a perfect method of doing scheduled and time-implicated workloads)

    Which brings me onto my 3rd point...

  2. Overlap with Workflows

    I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

    Going a step further... Dapr Workflows is the perfect language for defining schedules. This proposal uses the example of a daily backup routine, which is already achievable using a super simple Monitor/Eternal Orchestration pattern - Temporal.io and many others are leading the way here, particularly from a Developer Experience perspective.

    Anecdotally, lots of people use Azure Durable Functions for orchestrating backups & cloud infrastructure, so Dapr Workflows is in good company here for this kind of Ops stuff too.

    My 2c, I feel we should be advocating dapr adopters to learn how to express time-implicated workloads via Dapr Workflows which can grow with complexity, rather than pointing adopters to use a declarative job model which will struggle to grow beyond the basic repetition a static schedule (like cron!)

  3. Prioritisation of effort

    At the time of writing, we have the challenge of a Dapr Workflow implementation that can't GA until the Actor Reminders problem is solved (... and a heck of a lot has already been sunk into Dapr Workflows across the runtime, docs and several SDKs over the past year!)

    Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem first, rather than prioritising the delivery of the Distributed Scheduler Building Block and all the concerns that bringing a new Building Block online actually brings (SDK, Docs, Testing etc)

    I think its fair to say that the community deserve investment in existing Building Blocks that are trying to reach GA/Stable, such as Workflows, Crypto, Distributed Lock, before we bring another Building Block to life.

    Just to be clear, I'm not against the Distributed Scheduler Building Block API surface, however the overlap with what is easily achievable with Dapr Workflows leaves me wondering if its needed, right now...

@yaron2
Copy link
Member

yaron2 commented Jan 3, 2024

It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

This is out of scope for this proposal. The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code. In the future we might be able to integrate scheduling into other building blocks, but for now this is completely out of scope

I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

The purpose of this API is to both satisfy the requirements of actor reminders and Dapr workflows as well as provide users with a simple CRON like API that doesn't require users to fully embrace a completely new programming model. Since cron jobs and scheduled reminders are a very common primitive, it would be useful for users to be able to schedule these easily without the additional overhead of a full fledged programming model. Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem

Proposals do not deal with timelines, what goes into which milestone and priorities. Specifically, it's a yes/no decision whether a design can at any point be implemented in Dapr in the future. The actual priorities for milestones when it comes to which part of a proposal to implement comes down to the feature definition and triage stage that maintainers undergo at the beginning of each milestone

@olitomlinson
Copy link

The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

They did!

And interestingly it would appear that they chose to dogfood their own workflows programming model to implement the higher level 'schedule' concept.

With this proposal, it appears we are doing the opposite?

https://temporal.io/blog/how-we-build-it-building-scheduled-workflows-in-temporal

@olitomlinson
Copy link

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

I still would like some clarity on how this all manifests

Signed-off-by: Cassandra Coyle <[email protected]>
@cicoyle
Copy link
Contributor Author

cicoyle commented May 28, 2024

Note: I left the List jobs endpoint in the the proposal, however for 1.14 we will not implement this API due to more work required for pagination.

Signed-off-by: Cassandra Coyle <[email protected]>
@cicoyle
Copy link
Contributor Author

cicoyle commented Jun 4, 2024

Following are the performance numbers and associated improvements observed in the PoC tracked in PR:

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering.
When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

@olitomlinson
Copy link

Great work cassie and Team!

This all seems very positive and I can't wait to try it out within the context of Workflows - Particularly running workflow Apps with more than just 1/2 instances, as this is a critical milestone/achievement to GA actors.

I'm actually on Vacay for the next 2 weeks, but I have my Mac with me, so if someone can publish a container image and some instructions how to integrate it, I will happily run it within my docker compose test harness that I've been using for load testing Workflows - DM me on Discord to discuss further :)

Copy link
Member

@mikeee mikeee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Loving the preliminary numbers.

0012-BIRS-distributed-scheduler.md Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Show resolved Hide resolved
0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

* Scheduler Building Block API code
* Scheduler Service code
* Tests added (e2e, unit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be validated in the weekly/release longhauls as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new building blocks in alpha does not require longhaul tests. Even to stable has not been mentioned before. I suggest we modify LH tests to use scheduler (with reminders and jobs) as part of the stable criteria, not in alpha.

@mikeee mikeee mentioned this pull request Jun 4, 2024
43 tasks
@ItalyPaleAle
Copy link
Contributor

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

@artursouza
Copy link
Member

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering.
When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

That is a great question. If by "executing reminders" you mean the reminders being invoked by the sidecar, then those numbers are present and referred as "triggers" in the comment from @cicoyle. If you are referring to something else for "executing reminders", please, clarify.

Thanks.

@ItalyPaleAle
Copy link
Contributor

I didn't understand those were the "triggers", thanks. In the past we discussed about the workflow perf test too, since that's heavily reliant on reminders

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved
@cicoyle
Copy link
Contributor Author

cicoyle commented Jun 17, 2024

When using the new scheduler service, we can confirm that we achieved the ability to increase throughput as the scale grows, in contrast to the existing reminder system that shows a performance degradation as scale increases. As expected, a decrease in some performance metrics is observed with the smallest parallelism and scale (details below), which then increases as parallelism and concurrency grow.

While testing parallel workflows with a max concurrent workflow count of 110 with 440 total number of workflow runs, we saw performance improvements to the tune of 275% for the rate of iterations. Furthermore, Scheduler can achieve a 267% improvement in both the data sent and received, proving that the Scheduler can handle a higher rate of data while significantly decreasing the maximum total time taken to complete the request by 62%. These numbers prove the horizontal scalability of workflows with Scheduler being used as the underlying reminder system.

While testing parallel workflows with a max concurrent count of between 60 & 90, we saw performance improvements of about 71%, where the existing reminder system drops by 44% in comparison, being unable to correct the performance drop as the scale of workflows continues to grow.

At 350 max concurrent workflows and 1400 iterations, we see a performance improvement of 50% higher than the existing reminder system. At the smaller scale of 30 max concurrent workflows and 300 total workflow count we observed a 27% decrease in the rate of iterations and 25% in data sent/received, with a 56% improvement in latency (being the maximum total time taken to complete the request).

As part of the PoC, we ran several performance tests with 100k and 10k workflows to examine other bottlenecks in the Workflow system, unrelated to the scheduling/reminder sub-systems. Several areas were identified as potential bottlenecks for Workflows as a whole and these should be an area of focus in coming iterations.

@yaron2
Copy link
Member

yaron2 commented Jun 17, 2024

+1 binding

Signed-off-by: Artur Souza <[email protected]>
@daixiang0
Copy link
Member

I think we can continue to optimize it for perf issue.

+1 binding

@mikeee
Copy link
Member

mikeee commented Jun 19, 2024

+1 non-binding

@JoshVanL
Copy link
Contributor

Recusing myself from voting.

1 similar comment
@artursouza
Copy link
Member

Recusing myself from voting.

Copy link
Member

@yaron2 yaron2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this vote as approved. Thanks to the author, all reviewers and maintainers for participating

@yaron2 yaron2 merged commit bfcbc5b into dapr:main Jun 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants