Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add target-based scaling support for Azure Storage #2452

Merged
merged 27 commits into from
Oct 6, 2023
Merged

Conversation

davidmrdavid
Copy link
Contributor

@davidmrdavid davidmrdavid commented Apr 21, 2023

This PR

This PR adds target-based scaling support for the AzureStorage backend.

Changes

The requirements were relatively simple: we needed to update our WebJobs SDK dependency and then implement and expose some new interfaces: ITargetScaler and ITargetScalerProvider. This PR provides such an implementation for the Azure Storage backend.

Backwards compatibility

These new interfaces are not available in older versions of WebJobs, meaning that we need to use the C# preprocessor to prevent older versions of the Functions runtime from trying to load those types. This has worked for Functions V1 and V2. However, our bundles-mandated TFMs cannot differentiate between Functions V3 and V4.

Once we release this change, the Durable Extension will no longer be compatible with Functions V3. Therefore, prior to merge, we should update our documentation to warn users of this change.

Merge pre-requisites

In order to prevent apps w/ extension bundles from auto-upgrading to an incompatible version of the DF Extension, there will be a patch release to Functions V3 to stop automatic bundle upgrades. We need to wait for that patch release to complete.

Testing

This PR adds unit tests to validate that we request the expected number of workers. We use the same inputs as in our internal design document.

We have also done local manual testing validating that the Host recognizes these changes and considers them when issuing scale votes.

Equivalent PRs for other backends

References

Work done in collaboration with @bachuv.

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.

@davidmrdavid davidmrdavid changed the title [WIP] Target-based scaling support for default durability provider Add target-based scaling support for Azure Storage Apr 28, 2023
@davidmrdavid davidmrdavid marked this pull request as ready for review April 28, 2023 21:37
@cgillum
Copy link
Member

cgillum commented Sep 20, 2023

Question on this point:

Once we release this change, the Durable Extension will no longer be compatible with Functions V3. Therefore, prior to merge, we should update our documentation to warn users of this change.

I'm worried that updating our documentation won't be enough. Although Functions V3 has been deprecated, I worry that some users may still try to use it for one reason or another. Can we consider making some kind of proactive announcement that this new version of the DF extension no longer supports Functions V3, and perhaps include the expected error message for improved SEO when customers inevitably run into this?

@cgillum
Copy link
Member

cgillum commented Sep 20, 2023

Also, @davidmrdavid, please fix the merge conflicts so that we can be confident about the changes we're reviewing in WebJobs.Extensions.DurableTask.csproj.

@davidmrdavid
Copy link
Contributor Author

@cgillum: thanks for the heads up on merge conflicts, I missed those. They've been addressed.

I'm worried that updating our documentation won't be enough. Although Functions V3 has been deprecated, I worry that some users may still try to use it for one reason or another. Can we consider making some kind of proactive announcement that this new version of the DF extension no longer supports Functions V3, and perhaps include the expected error message for improved SEO when customers inevitably run into this?

Yep, I agree with this. I'll obtain the exact error message tomorrow-ish and we can look to include that in some proactive announcement, as well as in this PR and maybe even our release notes.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions/comments.

{
scaleControllerLog = "Tried to request a negative worker count." + scaleControllerLog;
this.logger.LogError(scaleControllerLog);
throw new Exception(scaleControllerLog);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is throwing an exception the behavior we want? Or should we fail more gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't immediately think of a more graceful solution here. My hunch is that if the scaling logic is requesting a negative number, then something is terribly wrong and we should fail exceptionally so the ScaleController can take over and decide how to deal with it. Do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying this code is executed by the scale controller and not in the Functions host? Are there some doc comments we can add to make this clear? I think that also helps clarify my question about the direct use of ILogger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that ScaleController V3 calls this directly, although @bachuv just pointed me to a comment that suggests that this code may also be accessed in the runtime driven scale case as well, which would go through the Host. So perhaps there's two cases: one mediated by the Functions Host, and one where it's accessed directly by the Scale Controller.

I'll look to get a conclusive answer on this. For now, I've pushed a series of comments reflecting my current understanding. I won't merge until we have confirmation from the ScaleController team.

Copy link
Contributor Author

@davidmrdavid davidmrdavid Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with Alexey today to get clarity on this.

The caller of this code depends on whether the user is running with or without a VNET.

If running without a VNET, the ScaleController process calls this directly.
If running with a VNET, then the customer needs to enable runtime-driven scaling and in that case it is the Functions Host which calls this code for us.

I think this may complicate things a little bit, especially around throwing exceptions. We should confirm what's the preferred error handling behavior in the VNET case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alrod: if runtime-driven scaling is enabled, and the DF Extension encounters an error in calculating it's target worker count (for a example, if it wrongly determines that it needs a negative target worker count), is safe to throw an exception for the Host to catch or should we handle the failure more gracefully?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code even make since to exists?

int numWorkersToRequest = (int)Math.Max(activityWorkers, orchestratorWorkers);
Is possible activityWorkers and orchestratorWorkers are < 0 ?

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. My only ask would be to consider adding more comments to make clear the cases where code is run in the Scale Controller host instead of running in the Functions host since that has big implications in terms of what context we're allowed to assume.

@davidmrdavid davidmrdavid requested a review from alrod September 25, 2023 18:42
@davidmrdavid davidmrdavid merged commit 0220ee5 into dev Oct 6, 2023
@davidmrdavid davidmrdavid deleted the dajusto/tbs branch October 6, 2023 20:59
@@ -10,6 +10,7 @@
using DurableTask.AzureStorage.Tracking;
using DurableTask.Core;
using Microsoft.Extensions.Logging;
using Microsoft.WindowsAzure.Storage;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the target-based scaling support still based on the old deprecated AS backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants