Distributed scheduler building block and service proposal #44

cicoyle · 2023-11-02T15:56:01Z

This design proposes 2 additions:

A Distributed Scheduler API Building Block
A Distributed Scheduler Control Plane Service

See images for high level overview and understanding of the architecture involved.

Please review and lmk what you think 🎉

Signed-off-by: Cassandra Coyle <[email protected]>

cgillum

Overall, I really like this proposal. A scheduler service that can be used to power multiple Dapr building blocks, including actor reminders, makes a ton of sense. I do have a couple questions about the proposal below, primarily from a workflow perspective.

0012-BIRS-distributed-scheduler.md

JoshVanL

Some comments from me!

0012-BIRS-distributed-scheduler.md

cicoyle · 2023-11-06T18:39:40Z

linking to release 1.13

ItalyPaleAle · 2023-11-08T23:28:15Z

0012-BIRS-distributed-scheduler.md

+    Job job = 1; 
+
+    // The metadata associated with the job.
+    // This can contain the generated `key` for the optional state store when the daprd sidecar needs to lookup the entire data from a state store.


This proposal was shut down a few months ago with the reasoning that it impacts performance negatively too much. Has the thinking changed about this?

Data can be saved without limits in the default embedded db (etcd) and for flexibility users can use the state store of their choosing.

ItalyPaleAle · 2023-11-08T23:29:51Z

0012-BIRS-distributed-scheduler.md

+service SchedulerCallback {
+
+    // Callback RPC for job schedule being at 'trigger' time
+    rpc TriggerJob(TriggerRequest) returns (google.protobuf.Empty) {}


One thing you may want to consider is some sort of rate-limiting here. For example allowing a maximum number of active jobs.

Currently we don't rate-limit reminders explicitly, but given the perf bottlenecks we have in the current implementation (for example the I/O cost of executing a reminder), there's a "natural" limit there. This new design should have a much lower cost and at this stage the risk of DoS-ing an app may be real.

0012-BIRS-distributed-scheduler.md

ItalyPaleAle · 2023-11-09T06:56:40Z

0012-BIRS-distributed-scheduler.md

+
+### Acceptance Criteria
+
+* How will success be measured? 


Please also add an acceptance criteria to test what happens in disaster scenarios (when more than n/2-1 instances fail, such as 2 out of 3). Also need to test how it recovers from total cluster shutdowns - I know of more than a few users that shut down their development K8s clusters when not in use, for example.

There can be multiple disaster scenarios that are hard to predict, so I will add a backup & recovery scenario to be documented to the AC.

There can be multiple failure scenarios but the one that is most critical is failure of N/2+1 nodes or every node (so with 3 nodes, failure of 2 or 3). This can be tested easily by shutting down some of the nodes.

olitomlinson · 2023-11-11T21:17:09Z

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ?
: http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

{global} scope - can be operated on by any App in the cluster.
{namespace} scope - can be operated on by any App contained in a specific namespace
{namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

olitomlinson · 2023-11-11T22:29:38Z

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

cicoyle · 2023-11-16T19:26:55Z

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ? : http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

{global} scope - can be operated on by any App in the cluster.

{namespace} scope - can be operated on by any App contained in a specific namespace

{namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

All jobs will be namespaced to the app/sidecar namespace. We will not support global jobs, unless all the jobs explicitly fall under the same namespace. There is no broadcast call back to all apps in a namespace with this proposal. That might lean towards using the PubSub building block (as a future implementation detail).

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle · 2023-11-16T19:46:35Z

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

You should see this explicitly in the proposal now - specifically in the Goals section. Thanks 👍🏻

Co-authored-by: Josh van Leeuwen <[email protected]> Signed-off-by: Cassie Coyle <[email protected]>

Signed-off-by: Cassandra Coyle <[email protected]>

olitomlinson · 2024-01-02T22:27:18Z

Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours?

ItalyPaleAle · 2024-01-02T22:42:59Z

Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours?

The implementation does have a dependency on #50

olitomlinson · 2024-01-02T23:46:14Z

I've been re-reading this proposal and I have a few comments

There doesn't appear to be any examples of how one may author a job task i.e. the users code that actually gets invoked at the time the job executes? See below, taken from the proposal, a schedule is created which invokes a db-backup task, but how is this task expressed?

{
  "schedule": "@daily",
  "data": {
    "task": "db-backup",
    "metadata": {
      "db_name": "my-prod-db",
      "backup_location": "/backup-dir"
    }
  }
}

It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

(side bar : how would scheduled service invocation work? who would receive the result? this is exactly where you need Dapr Workflows as an orchestration, which already has a perfect method of doing scheduled and time-implicated workloads)

Which brings me onto my 3rd point...
Overlap with Workflows

I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

Going a step further... Dapr Workflows is the perfect language for defining schedules. This proposal uses the example of a daily backup routine, which is already achievable using a super simple Monitor/Eternal Orchestration pattern - Temporal.io and many others are leading the way here, particularly from a Developer Experience perspective.

Anecdotally, lots of people use Azure Durable Functions for orchestrating backups & cloud infrastructure, so Dapr Workflows is in good company here for this kind of Ops stuff too.

My 2c, I feel we should be advocating dapr adopters to learn how to express time-implicated workloads via Dapr Workflows which can grow with complexity, rather than pointing adopters to use a declarative job model which will struggle to grow beyond the basic repetition a static schedule (like cron!)
Prioritisation of effort

At the time of writing, we have the challenge of a Dapr Workflow implementation that can't GA until the Actor Reminders problem is solved (... and a heck of a lot has already been sunk into Dapr Workflows across the runtime, docs and several SDKs over the past year!)

Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem first, rather than prioritising the delivery of the Distributed Scheduler Building Block and all the concerns that bringing a new Building Block online actually brings (SDK, Docs, Testing etc)

I think its fair to say that the community deserve investment in existing Building Blocks that are trying to reach GA/Stable, such as Workflows, Crypto, Distributed Lock, before we bring another Building Block to life.

Just to be clear, I'm not against the Distributed Scheduler Building Block API surface, however the overlap with what is easily achievable with Dapr Workflows leaves me wondering if its needed, right now...

yaron2 · 2024-01-03T00:35:37Z

It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

This is out of scope for this proposal. The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code. In the future we might be able to integrate scheduling into other building blocks, but for now this is completely out of scope

I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

The purpose of this API is to both satisfy the requirements of actor reminders and Dapr workflows as well as provide users with a simple CRON like API that doesn't require users to fully embrace a completely new programming model. Since cron jobs and scheduled reminders are a very common primitive, it would be useful for users to be able to schedule these easily without the additional overhead of a full fledged programming model. Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem

Proposals do not deal with timelines, what goes into which milestone and priorities. Specifically, it's a yes/no decision whether a design can at any point be implemented in Dapr in the future. The actual priorities for milestones when it comes to which part of a proposal to implement comes down to the feature definition and triage stage that maintainers undergo at the beginning of each milestone

olitomlinson · 2024-01-03T02:12:49Z

The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

They did!

And interestingly it would appear that they chose to dogfood their own workflows programming model to implement the higher level 'schedule' concept.

With this proposal, it appears we are doing the opposite?

https://temporal.io/blog/how-we-build-it-building-scheduled-workflows-in-temporal

olitomlinson · 2024-01-26T14:38:44Z

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

I still would like some clarity on how this all manifests

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle · 2024-05-28T17:57:45Z

Note: I left the List jobs endpoint in the the proposal, however for 1.14 we will not implement this API due to more work required for pagination.

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle · 2024-06-04T06:04:06Z

Following are the performance numbers and associated improvements observed in the PoC tracked in PR:

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering.
When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

olitomlinson · 2024-06-04T14:18:23Z

Great work cassie and Team!

This all seems very positive and I can't wait to try it out within the context of Workflows - Particularly running workflow Apps with more than just 1/2 instances, as this is a critical milestone/achievement to GA actors.

I'm actually on Vacay for the next 2 weeks, but I have my Mac with me, so if someone can publish a container image and some instructions how to integrate it, I will happily run it within my docker compose test harness that I've been using for load testing Workflows - DM me on Discord to discuss further :)

mikeee

Amazing work! Loving the preliminary numbers.

0012-BIRS-distributed-scheduler.md

mikeee · 2024-06-04T15:32:17Z

0012-BIRS-distributed-scheduler.md

+
+* Scheduler Building Block API code
+* Scheduler Service code
+* Tests added (e2e, unit)


Will this be validated in the weekly/release longhauls as well?

Adding new building blocks in alpha does not require longhaul tests. Even to stable has not been mentioned before. I suggest we modify LH tests to use scheduler (with reminders and jobs) as part of the stable criteria, not in alpha.

ItalyPaleAle · 2024-06-04T16:29:48Z

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

artursouza · 2024-06-04T16:44:45Z

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering.
When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

That is a great question. If by "executing reminders" you mean the reminders being invoked by the sidecar, then those numbers are present and referred as "triggers" in the comment from @cicoyle. If you are referring to something else for "executing reminders", please, clarify.

Thanks.

ItalyPaleAle · 2024-06-04T17:21:15Z

I didn't understand those were the "triggers", thanks. In the past we discussed about the workflow perf test too, since that's heavily reliant on reminders

0012-BIRS-distributed-scheduler.md

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle · 2024-06-17T17:07:42Z

When using the new scheduler service, we can confirm that we achieved the ability to increase throughput as the scale grows, in contrast to the existing reminder system that shows a performance degradation as scale increases. As expected, a decrease in some performance metrics is observed with the smallest parallelism and scale (details below), which then increases as parallelism and concurrency grow.

While testing parallel workflows with a max concurrent workflow count of 110 with 440 total number of workflow runs, we saw performance improvements to the tune of 275% for the rate of iterations. Furthermore, Scheduler can achieve a 267% improvement in both the data sent and received, proving that the Scheduler can handle a higher rate of data while significantly decreasing the maximum total time taken to complete the request by 62%. These numbers prove the horizontal scalability of workflows with Scheduler being used as the underlying reminder system.

While testing parallel workflows with a max concurrent count of between 60 & 90, we saw performance improvements of about 71%, where the existing reminder system drops by 44% in comparison, being unable to correct the performance drop as the scale of workflows continues to grow.

At 350 max concurrent workflows and 1400 iterations, we see a performance improvement of 50% higher than the existing reminder system. At the smaller scale of 30 max concurrent workflows and 300 total workflow count we observed a 27% decrease in the rate of iterations and 25% in data sent/received, with a 56% improvement in latency (being the maximum total time taken to complete the request).

As part of the PoC, we ran several performance tests with 100k and 10k workflows to examine other bottlenecks in the Workflow system, unrelated to the scheduling/reminder sub-systems. Several areas were identified as potential bottlenecks for Workflows as a whole and these should be an area of focus in coming iterations.

0012-BIRS-distributed-scheduler.md

Signed-off-by: Cassie Coyle <[email protected]>

yaron2 · 2024-06-17T22:52:02Z

+1 binding

Signed-off-by: Artur Souza <[email protected]>

daixiang0 · 2024-06-19T07:43:35Z

I think we can continue to optimize it for perf issue.

+1 binding

mikeee · 2024-06-19T10:25:45Z

+1 non-binding

JoshVanL · 2024-06-19T14:30:53Z

Recusing myself from voting.

artursouza · 2024-06-19T15:00:07Z

Recusing myself from voting.

yaron2

Closing this vote as approved. Thanks to the author, all reviewers and maintainers for participating

initial distributed scheduler building block and service proposal

2c5333d

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle changed the title ~~initial distributed scheduler building block and service proposal~~ distributed scheduler building block and service proposal Nov 2, 2023

cicoyle mentioned this pull request Nov 2, 2023

REQUEST: New membership for cicoyle dapr/community#370

Closed

4 tasks

cgillum reviewed Nov 3, 2023

View reviewed changes

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

JoshVanL reviewed Nov 3, 2023

View reviewed changes

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

JoshVanL requested changes Nov 3, 2023

View reviewed changes

JoshVanL mentioned this pull request Nov 6, 2023

v1.13 Release Planning dapr/dapr#7093

Closed

31 tasks

ItalyPaleAle reviewed Nov 8, 2023

View reviewed changes

ItalyPaleAle reviewed Nov 9, 2023

View reviewed changes

msfussell changed the title ~~distributed scheduler building block and service proposal~~ Distributed scheduler building block and service proposal Nov 10, 2023

updates based on PR feedback

c6279d3

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle and others added 2 commits November 19, 2023 20:31

Apply suggestions from code review

a49a95f

Co-authored-by: Josh van Leeuwen <[email protected]> Signed-off-by: Cassie Coyle <[email protected]>

update example code based on PR feedback on naming convention

24a2fda

Signed-off-by: Cassandra Coyle <[email protected]>

hhunter-ms mentioned this pull request Dec 20, 2023

Distributed Scheduler API building block docs dapr/docs#3915

Closed

13 tasks

JoshVanL mentioned this pull request Mar 25, 2024

Update to latest cron lib. cicoyle/dapr#17

Merged

7 tasks

cicoyle mentioned this pull request May 3, 2024

Scheduler: Building Block & Control Plane Service dapr/dapr#7716

Closed

7 tasks

cicoyle mentioned this pull request May 24, 2024

Scheduler: Building Block & Control Plane Service dapr/dapr#7761

Closed

7 tasks

update diagrams, add streaming

e1375d5

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle mentioned this pull request May 28, 2024

Jobs Building Block & Scheduler Control Plane Service dapr/dapr#7768

Merged

7 tasks

v1 -> alpha1 user facing apis

02b0d25

Signed-off-by: Cassandra Coyle <[email protected]>

mikeee reviewed Jun 4, 2024

View reviewed changes

mikeee mentioned this pull request Jun 4, 2024

v1.14 Release Planning dapr/dapr#7605

Closed

43 tasks

artursouza reviewed Jun 5, 2024

View reviewed changes

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

artursouza requested changes Jun 5, 2024

View reviewed changes

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

cicoyle added 2 commits June 7, 2024 10:00

fix proto spacing and type -> kind for names

ae92b8d

Signed-off-by: Cassandra Coyle <[email protected]>

JobMetadataKind->JobTargetMetadata

3dc2809

Signed-off-by: Cassandra Coyle <[email protected]>

cicoyle commented Jun 17, 2024

View reviewed changes

0012-BIRS-distributed-scheduler.md Outdated Show resolved Hide resolved

Update 0012-BIRS-distributed-scheduler.md

892c354

Signed-off-by: Cassie Coyle <[email protected]>

nit: make HTTP API Restful

3d2d936

Signed-off-by: Artur Souza <[email protected]>

yaron2 approved these changes Jun 19, 2024

View reviewed changes

yaron2 merged commit bfcbc5b into dapr:main Jun 19, 2024
1 check passed

Distributed scheduler building block and service proposal #44

Distributed scheduler building block and service proposal #44

Conversation

cicoyle commented Nov 2, 2023

cgillum left a comment

Choose a reason for hiding this comment

JoshVanL left a comment

Choose a reason for hiding this comment

cicoyle commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olitomlinson commented Nov 11, 2023

olitomlinson commented Nov 11, 2023

cicoyle commented Nov 16, 2023

cicoyle commented Nov 16, 2023

olitomlinson commented Jan 2, 2024

ItalyPaleAle commented Jan 2, 2024

olitomlinson commented Jan 2, 2024 • edited Loading

yaron2 commented Jan 3, 2024

olitomlinson commented Jan 3, 2024

olitomlinson commented Jan 26, 2024

cicoyle commented May 28, 2024

cicoyle commented Jun 4, 2024

olitomlinson commented Jun 4, 2024

mikeee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ItalyPaleAle commented Jun 4, 2024

artursouza commented Jun 4, 2024

ItalyPaleAle commented Jun 4, 2024

cicoyle commented Jun 17, 2024

yaron2 commented Jun 17, 2024

daixiang0 commented Jun 19, 2024

mikeee commented Jun 19, 2024 • edited Loading

JoshVanL commented Jun 19, 2024

artursouza commented Jun 19, 2024

yaron2 left a comment

Choose a reason for hiding this comment

olitomlinson commented Jan 2, 2024 •

edited

Loading

mikeee commented Jun 19, 2024 •

edited

Loading