Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible scaling issues when there are lots of outbound HTTP calls #158

Open
anthonychu opened this issue Feb 10, 2020 · 9 comments
Labels
bug Something isn't working P2 Priority 2 item

Comments

@anthonychu
Copy link
Member

Now that #152 is completed, we should rerun some scale tests to see if things have improved.

Pasting in some findings from tests I ran in late December:


Did a bit more testing on durable scaling using @brandonh-msft's sample app. Ran into a few issues (some old, some new) but I think I have a better idea of what's happening now. Sharing some initial findings:

TL;DR: I managed to get it working by making these changes:

  • Change from request to Axios as the HTTP client
  • Limit the maxSockets on the HTTPS agent used by Axios
  • Rate limiting the number of outgoing requests per instance using Bottleneck

The code for this is here: https://github.com/anthonychu/durablefunctions-javascript-scaletest/tree/20191229-refactor/javascript-axios

The main issue appears to be port exhaustion. There are 2 main types of errors: EADDRINUSE and ETIMEDOUT. By limiting the maxSockets on the HTTPS agent, it seemed to have eliminated the EADDRINUSE issue.

However, there are still occasional ETIMEDOUT errors. They appear with the IP address of the API that the activity function is calling. I don't think the remote API that it is calling is down or timing out; I think it's a problem with the app itself. (Also see ETIMEDOUT error below when the client calls the host)

Using Bottleneck to limit outbound API calls to 4 per second appears to eliminate these ETIMEDOUT errors. I haven't tried increasing that number.

I've also tried a couple of other versions of the code that didn't work:

  • Modified Brandon's original app to use Bottleneck to rate limit outbound calls (code)
  • Created a version that uses callHttp() instead of an activity function (code)

Both of these encountered a different ETIMEDOUT error. The calls between the JS worker and the host via the frontend were failing when trying to invoke DurableOrchestrationClient.startNew().

{
    "message": "connect ETIMEDOUT 40.112.243.4:443",
    "name": "Error",
    "stack": "Error: connect ETIMEDOUT 40.112.243.4:443\n at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1106:14)",
    "config": {
        "url": "https://test-javascript-scaletest-request-20191229.azurewebsites.net/runtime/webhooks/durabletask/orchestrators/StartWorkflow?code=***********************************************",
        "method": "post",
        "data": "",
        "headers": {
            "Accept": "application/json, text/plain, */*",
            "Content-Type": "application/json",
            "User-Agent": "axios/0.19.0"
        },
        "transformRequest": [
            null
        ],
        "transformResponse": [
            null
        ],
        "timeout": 0,
        "xsrfCookieName": "XSRF-TOKEN",
        "xsrfHeaderName": "X-XSRF-TOKEN",
        "maxContentLength": null
    },
    "code": "ETIMEDOUT"
}

I'm not sure why this error didn't happen in the Axios implementation. I wonder if limiting the maxSockets and other settings in Axios was also affecting the instance of Axios used in the Durable Functions JS SDK.

@brandonh-msft
Copy link
Member

Internal Tracking CSEF 163249

@gunzip
Copy link

gunzip commented Dec 26, 2020

I don't know if it's related but using durable-functions 1.4.1 we get a lot of errors during traffic peaks (scaling out) ie:

Exception while executing function: Functions.HandleNHNotificationCall Result: Failure
Exception: Error: connect EADDRINUSE 127.0.0.1:50708
Stack: Error: connect EADDRINUSE 127.0.0.1:50708
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1113:14) 

it's somehow related to the orchestrator startNew() call here: https://github.com/pagopa/io-functions-app/blob/master/HandleNHNotificationCall/index.ts#L29

@davidmrdavid
Copy link
Collaborator

Hi @gunzip!
Can you clarify what you mean by "it's somehow related to startNew()? I'm trying to understand if you mean that you're getting exceptions immediately calling startNew, or something else. Any amount of further context would help, because startNew is an expected part of most any Durable Functions app.

Also, just to make sure I understand, is this behavior only there for durable-functions 1.4.1 but not in a previous version? Could you provide an example! Thanks!!

@gunzip
Copy link

gunzip commented Dec 26, 2020

Hi @davidmrdavid, I meant that it must be related to startNew() since it's the only method that the function call. I guess it's an error that the the durable orchestrator client gets when trying to connect to the orchestrator endpoint on localhost.

I don't know if it's related only to 1.4.1 since that's the latest version we're currently using. This is the typical error pattern we get during traffic spikes:

image

And the related error trace:

image

image

@san-goyal
Copy link

@gunzip did you find any solution? I'm also facing exactly same problem. Durable starter is unable to initiate orchestrator function. Restarting the function app resolve the issue for couple of days but root cause is unknown.

@davidmrdavid
Copy link
Collaborator

Hi @sangoya, could you please open a new issue describing your situation? Feel free to tag me on it and I'll investigate, thanks!

@zkbule
Copy link

zkbule commented Aug 6, 2021

Bundle Id: Microsoft.Azure.Functions.ExtensionBundle
Bundle version: 2.6.1
ExtensionName: DurableTask
ExtensionVersion: 2.0.0.0

@davidmrdavid I've seen similar error at high traffic hours as well when I use durable client getStatusAll with my http trigger function. Here's the full stack:

Full Exception :
Exception while executing function: Functions.QueryStatus
Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException : Result: Failure
Exception: Error: connect ETIMEDOUT 40.112.191.159:443
Stack: Error: connect ETIMEDOUT 40.112.191.159:443
at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1107:14)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters,FunctionInvocationContext context) at D:\a\1\s\src\WebJobs.Script\Description\Workers\WorkerFunctionInvoker.cs : 93
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) at D:\a\1\s\src\WebJobs.Script\Description\FunctionInvokerBase.cs : 82
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task1 src) at D:\a\1\s\src\WebJobs.Script\Description\FunctionGenerator.cs : 225 at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at async Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync[TReflected,TReturnValue](Object instance,Object[] arguments) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs : 52
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker,ParameterHelper parameterHelper,CancellationTokenSource timeoutTokenSource,CancellationTokenSource functionCancellationTokenSource,Boolean throwOnTimeout,TimeSpan timerInterval,IFunctionInstance instance) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 555
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance,ParameterHelper parameterHelper,ILogger logger,CancellationTokenSource functionCancellationTokenSource) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 501
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance,FunctionStartedMessage message,FunctionInstanceLogEntry instanceLogEntry,ParameterHelper parameterHelper,ILogger logger,CancellationToken cancellationToken) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 279
End of inner exception
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance,FunctionStartedMessage message,FunctionInstanceLogEntry instanceLogEntry,ParameterHelper parameterHelper,ILogger logger,CancellationToken cancellationToken) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 326
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance,CancellationToken cancellationToken) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 94

@davidmrdavid
Copy link
Collaborator

Hi @zkbule, please open a new issue with a description of these errors and, if possible, a minimal repro, and I can look into it, thanks! In it, please link this issue so we can keep related issues in association to one another. Thanks again!

@zkbule
Copy link

zkbule commented Oct 28, 2021

Hi @davidmrdavid, I meant that it must be related to startNew() since it's the only method that the function call. I guess it's an error that the the durable orchestrator client gets when trying to connect to the orchestrator endpoint on localhost.

I don't know if it's related only to 1.4.1 since that's the latest version we're currently using. This is the typical error pattern we get during traffic spikes:

image

And the related error trace:

image

image

@gunzip I've got exactly same error. Did your issue resolved?

@lilyjma lilyjma added bug Something isn't working P2 Priority 2 item labels Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Priority 2 item
Projects
None yet
Development

No branches or pull requests

7 participants