-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add target-based scaling support for Azure Storage #2452
Conversation
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTriggerMetrics.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/DurableTaskJobHostConfigurationExtensions.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
…der.cs Co-authored-by: Varshitha Bachu <[email protected]>
Question on this point:
I'm worried that updating our documentation won't be enough. Although Functions V3 has been deprecated, I worry that some users may still try to use it for one reason or another. Can we consider making some kind of proactive announcement that this new version of the DF extension no longer supports Functions V3, and perhaps include the expected error message for improved SEO when customers inevitably run into this? |
Also, @davidmrdavid, please fix the merge conflicts so that we can be confident about the changes we're reviewing in WebJobs.Extensions.DurableTask.csproj. |
@cgillum: thanks for the heads up on merge conflicts, I missed those. They've been addressed.
Yep, I agree with this. I'll obtain the exact error message tomorrow-ish and we can look to include that in some proactive announcement, as well as in this PR and maybe even our release notes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some questions/comments.
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskMetricsProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskScaleMonitor.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskMetricsProvider.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTriggerMetrics.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProvider.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/Listener/DurableTaskTargetScaler.cs
Outdated
Show resolved
Hide resolved
{ | ||
scaleControllerLog = "Tried to request a negative worker count." + scaleControllerLog; | ||
this.logger.LogError(scaleControllerLog); | ||
throw new Exception(scaleControllerLog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is throwing an exception the behavior we want? Or should we fail more gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't immediately think of a more graceful solution here. My hunch is that if the scaling logic is requesting a negative number, then something is terribly wrong and we should fail exceptionally so the ScaleController can take over and decide how to deal with it. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying this code is executed by the scale controller and not in the Functions host? Are there some doc comments we can add to make this clear? I think that also helps clarify my question about the direct use of ILogger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that ScaleController V3 calls this directly, although @bachuv just pointed me to a comment that suggests that this code may also be accessed in the runtime driven scale case as well, which would go through the Host. So perhaps there's two cases: one mediated by the Functions Host, and one where it's accessed directly by the Scale Controller.
I'll look to get a conclusive answer on this. For now, I've pushed a series of comments reflecting my current understanding. I won't merge until we have confirmation from the ScaleController team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with Alexey today to get clarity on this.
The caller of this code depends on whether the user is running with or without a VNET.
If running without a VNET, the ScaleController process calls this directly.
If running with a VNET, then the customer needs to enable runtime-driven scaling and in that case it is the Functions Host which calls this code for us.
I think this may complicate things a little bit, especially around throwing exceptions. We should confirm what's the preferred error handling behavior in the VNET case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alrod: if runtime-driven scaling is enabled, and the DF Extension encounters an error in calculating it's target worker count (for a example, if it wrongly determines that it needs a negative target worker count), is safe to throw an exception for the Host to catch or should we handle the failure more gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code even make since to exists?
int numWorkersToRequest = (int)Math.Max(activityWorkers, orchestratorWorkers);
Is possible activityWorkers and orchestratorWorkers are < 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me. My only ask would be to consider adding more comments to make clear the cases where code is run in the Scale Controller host instead of running in the Functions host since that has big implications in terms of what context we're allowed to assume.
@@ -10,6 +10,7 @@ | |||
using DurableTask.AzureStorage.Tracking; | |||
using DurableTask.Core; | |||
using Microsoft.Extensions.Logging; | |||
using Microsoft.WindowsAzure.Storage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the target-based scaling support still based on the old deprecated AS backend?
This PR
This PR adds target-based scaling support for the AzureStorage backend.
Changes
The requirements were relatively simple: we needed to update our WebJobs SDK dependency and then implement and expose some new interfaces:
ITargetScaler
andITargetScalerProvider
. This PR provides such an implementation for the Azure Storage backend.Backwards compatibility
These new interfaces are not available in older versions of WebJobs, meaning that we need to use the C# preprocessor to prevent older versions of the Functions runtime from trying to load those types. This has worked for Functions V1 and V2. However, our bundles-mandated TFMs cannot differentiate between Functions V3 and V4.
Once we release this change, the Durable Extension will no longer be compatible with Functions V3. Therefore, prior to merge, we should update our documentation to warn users of this change.
Merge pre-requisites
In order to prevent apps w/ extension bundles from auto-upgrading to an incompatible version of the DF Extension, there will be a patch release to Functions V3 to stop automatic bundle upgrades. We need to wait for that patch release to complete.
Testing
This PR adds unit tests to validate that we request the expected number of workers. We use the same inputs as in our internal design document.
We have also done local manual testing validating that the Host recognizes these changes and considers them when issuing scale votes.
Equivalent PRs for other backends
References
Work done in collaboration with @bachuv.
Pull request checklist
pending_docs.md
release_notes.md
/src/Worker.Extensions.DurableTask/AssemblyInfo.cs