Change task scheduling to give more even flow of orchestrations #220

mol-pensiondk · 2024-05-17T12:54:47Z

When a large number of orchestrations are scheduled, the current in-task-creation-order scheduling of new tasks tends to stall orchestrations, giving an uneven effective throughput.

Consider a simple orchestration of two tasks, T1 followed by T2. The task T1 has an average running time of 3 seconds while the task T2 runs for one second. Now a great number of requests arive at the same time, say 10000, and these result in 10000 orchestrations being created within a few seconds. There hub workers are running 10 orchestrations and 10 tasks in parallel. As starting an orchestration and scheduling the first task (T1) takes very little time compared to the running time of T1, hundreds of T1's will have been scheduled before the first T1s complete and the first T2s are scheduled. When all 10000 orchestrations have been started (let's say at 100 per second), only a few of the T1's (say 300) will have finished and thus correspondingly few T2s scheduled. From this point on, mainly T1's are executed until, when all the 10000 T1's are done, the remaining 9700 T2s can be started. The whole thing takes about 4000 seconds to execute (plus a little overhead), but due to the in-task-creation-order scheduling only 3% of orchestrations have completed in the first 75% of the running time and the remaining 97% complete in the last 1000 seconds. The average waiting time for an orchestration to finish is about 3500 seconds, not 2000 seconds.

The scenario described above is a simplified version of what happens with our DT driven process for sending out (email) communications in which each orchestration has tasks for (1) fetching data to be merged, (2) creating the text, (3) storing the communication in our archive and finally (4) sending the mail. When large batches of communications are sent out, the scheduling means that these tasks become "stratified", putting first a strain on the systems delivering data (while document rendering, the archive server and mail server are idle), then a strain on document rendering, then on the archive and finally the mail system. The result is that not only do we have to wait a long time for the first mails to be sent out but also that the burst in load on the individual components in the chain mean that everything takes longer than if the flow through the orchestrations had been steady instead of stratified.

We would like to have the scheduling of tasks changed to be oldest-orchestration-first rather than oldest-task-first. This would mean that orchestrations would tend to finish "in order", yielding a steady flow through orchestrations and no stratification. It would also make the execution more "fair" in that new arriving orchestrations will not delay already running ones (as is currently the case).

This could be done in several different ways. Keeping the existing database structure you could modify _LockNextTask to add a join with the Instances table and order on CreatedTime. However, it would probably be more performant to add instance CreatedTime (or indeed a SequenceNumber added to Instances) to the NewTasks table and its primary key, to get the ordering without a join. This requires changing the procedure(s) that populate NewTask as well as the table and index.

microsoft-github-policy-service bot added the Needs: Triage 🔍 label May 17, 2024

bachuv added enhancement New feature or request P3 Priority 3 and removed Needs: Triage 🔍 labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change task scheduling to give more even flow of orchestrations #220

Change task scheduling to give more even flow of orchestrations #220

mol-pensiondk commented May 17, 2024

Change task scheduling to give more even flow of orchestrations #220

Change task scheduling to give more even flow of orchestrations #220

Comments

mol-pensiondk commented May 17, 2024