Replace Endpoints with Regional Endpoints #39390

tvaron3 · 2025-01-24T17:20:20Z

Design

Currently in the SDK, every read request is retried 3 times in the same region before failing over to other region and marking the current region unavailable. This helps read requests, however, does not help write requests which are not retried at all in the same region and cannot be retired in other regions in case of single master accounts.

With this new feature to improve write requests availability, the SDK will now maintain a fallback endpoint. Both the current and fallback endpoint will point to the same write region. This will allow the SDK to retry write requests on the fallback endpoint if in case the current endpoint is unavailable because of any connectivity issues. Below are some of the feature implementation details and the testing that we have done so far.

The GetDatabaseAccount which gets called during bootstrapping and every 5 mins will now return the following uris for a region:

<account-name>-<region-name>.documents.azure.com

<account-name>.documents.azure.com

The service will randomly send variations of these endpoints in the getDatabaseAccount call (happens every 5 mins from the SDK).

Example – for a Cosmos DB account with name testAccount and hosted in region West US, the service will now send following endpoints in some round robin fashion:

testAccount-westus.documents.azure.com

testAccount.documents.azure.com

This allows clients to point to two different VIPs for the Gateway service to improve availability in scenarios when one gateway VIP goes down.

SDK maintains current and previous endpoints for these two endpoints returned by the gateway.

SDK Updated Retry policy:

SDK’s default retries policy has 3 in-region retries in addition to the original request. Default connection timeout is 5 seconds. Default read timeout is 65s.

ServiceRequestError:

This error happens when the client is trying to connect to the server, but for some reason cannot connect. In this case, since the SDK knows that request has not reached the service, SDK will retry both read and write requests.

For the write requests, the SDK will first retry 3 times on the current endpoint and then 3 times on the fallback endpoint. If the request still fails, it will mark the endpoints unavailable for write operations and retry on other regions (if there are any more write regions available (Multi-master case)), otherwise, the write request will fail.

For reads, it will retry 3 times on the current endpoint, and then will mark the region unavailable for reads, and will retry on other regions.

ServiceResponseError:

This error happens when the client has already connected to the server, but for some reason received an error during response. In this case, SDK will only retry read operations, since it's not safe to retry on writes as the SDK does not know if the write operation succeeded or not.

Implementation

Location cache will now have a new RegionalEndpoint object that will have a current and previous. The idea is the previous can be used in certain scenarios to retry. There will now be a health check for every 5 minute global database account refresh. The health check will reach out to the different endpoints using a global database account call because it is quick. We also limited database account calls to 3 seconds and 1 retry. If this health check, fails then we set the endpoint as unavailable.

Pseudocode of new current and previous logic

request in progress
on current:
    success:
        no op
    failure:
        use previous:
 
on previous:
    success:
        temp = current
        current = previous
        previous = temp 
    failure:
        refresh the cache:
            if (current != new value):
                previous = current:
            current = new value
 
on database account refresh - every 5 mins:
    initial:
        current = new value
        if defaulty endpoint != new value:
             previous = default endpoint 
    next:
        if current != new value
            previous = current
        current = new value
    perform health check

Testing:

Testing with bringing the federations down in staging environment:
Account Regions: East US 2, North Central US
SDK Preferred Locations: East US 2, North Central US

Note: Both read and write requests are retried by default 3 times in the current region. By default, the connection / request timeout in the python SDK is 60 seconds.

Note: We have not yet tested Envoy Proxy, however, this is something we are setting up locally in house and trying to see how the new retry policies react to the Envoy Proxy. We would like to test this envoy proxy before shipping this hotfix to make sure this code change works reliably for customers using proxy.

Bootstrapping Scenario:

Global endpoint down:

If the global endpoint is down, the initial topology call fails and because of the default 60 seconds connection timeout and 3 in region retries, it currently takes around 4 mins to retry this topology call in other endpoints. This can be fixed by the customers by lowering down the request timeout - it should be set somewhere between 5 to 10 seconds. Meanwhile, we are also updating the timeout of the getDatabaseAccount call to 5 seconds so that the SDK can recover quicker.

If the global endpoint is down, and the gateway returns the global endpoint for the write region, the SDK will first try on global endpoint, and once it fails, it will construct the regional endpoint during bootstrapping case and will try there as the fallback.

Regional endpoint down:

In this case, the SDK will keep going to the global endpoint and keep refreshing the location cache every 5 mins (as usual) and if receives the regional endpoint, it will mark it unavailable after 4 attempts (1 original request + 3 retries) and will keep falling back to the global endpoint.

Both endpoints healthy:

In this case no issues observed, and the SDK is able to load balance it well.

Runtime Scenario:

Global Endpoint down:

For GetDatabaseAccount call - Because of the default 60 seconds request timeout and 3 number of retries, throughput decreases to a very low value, eventually getting down to almost 0 for few mins since connectivity issues with global endpoint for topology call. This can be fixed by lowering down the request timeout. (We are planning to reduce the timeout of this getDatabaseAccount call to 5 seconds in the current drop of the SDK).

For write requests -> they will retry on the fallback regional endpoint. If there is another write region on the preferred_regions, then writes will further go to those regions after marking the current region unavailable.

For read requests -> they will not be retried on the fallback endpoints and will go to the different region after marking the current region unavailable for further reads.

Regional Endpoint down:

Almost no effect on the throughput on both read requests and write requests, since the global endpoint is used for topology call, and since that's working, one less point of failure.

…morenoh/azure-sdk-for-python into service_response_error_policy

…morenoh/azure-sdk-for-python into tvaron3/regionalEndpoints

…to tvaron3/regionalEndpoints

azure-sdk · 2025-01-24T17:41:55Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

… failures

sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py

sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py

sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py

azure-pipelines · 2025-01-25T00:02:11Z

Azure Pipelines successfully started running 1 pipeline(s).

…3/azure-sdk-for-python into tvaron3/regionalEndpoints

FabianMeiswinkel

LGTM - Thanks!

tvaron3 · 2025-02-05T04:50:33Z

/azp run python - cosmos - tests

azure-pipelines · 2025-02-05T04:50:53Z

Azure Pipelines successfully started running 1 pipeline(s).

kushagraThapar · 2025-02-05T06:43:03Z

/check-enforcer override

eng/pipelines/templates/steps/build-test.yml

weshaggard · 2025-02-05T16:54:08Z

/check-enforcer override

* add new policy, add logic to use policy * added small test file I was using * initial regional endpoint work * groundwork * re-add AzureError logic, refactor, fix tests * Update _retry_utility.py * Updated location_cache with new design * Fixed key error with most_preferred_location * Update test_cosmos_http_logging_policy.py * Update _retry_utility.py * Added logic to refresh cache on previous endpoint usage * Added business logic update the regional endpoint based on success or failures * implementation * Update _retry_utility_async.py * fix some tests * changelog, versions, fixes * fixes * fix some tests * remove fake logic, count fix * fix some tests * Update _service_request_retry_policy.py * Update _retry_utility_async.py * retry utilities fixing * Update _retry_utility.py * additional enhancements * Update setup.py * Update _retry_utility_async.py * add tests, remove previous retry logic for ServiceRequestExceptions * clean up with finally * tests * retry utilities * disable tests * add logging to policies * GetDatabaseAccount Fix * Update _base.py * retry utilities fixes * Update _retry_utility.py * retry utulities part 34 * Update _service_request_retry_policy.py * remove extra logs * policy updates * Update _service_response_retry_policy.py * Update _service_response_retry_policy.py * policies updates and update operation types * trying out fixes * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Skipped proxy test for debugging * annotation fix * Fixed some tests cases * test fixes * Update test_service_retry_policies_async.py * Fixed some mocking behavior * fixed pylint issues * Added aiohttp minimum dependency * Updated changelog and setup.py * Updated changelog * Add changelog and fix tests. * Fix tests * bootstrapping with global endpoint as previous for writes * Add headers and cleanup * cleanup and retry all service request headers * Don't retry on a none previous * Updated the business logic with current and previous, fixed database account refresh and some retry policies * fix client id * Reacting to comments * Added print statements and fixed some retry logic * Revert getDatabase in mark endpoint * Fixed some pylint and changelog issues * Fixed version * fix bug with type check, update tests * Update test_service_retry_policies_async.py * sync tests updates * Reacting to comments and fixing service request retry policy * Code review comments and pylint issues * Fixed tests and pylint * more sync mock tests - missing async copies * Fixed min aiohttp requirements * Update _retry_utility_async.py * Change to check operation type in operations * push initial GEM mock test * Update test_service_retry_policies.py * Fixed extra retries * sync tests * Update test_service_retry_policies_async.py * Fixed extra retries and relevant tests * Only delay retry by one second * async tests - need to split up inheritance ones since endpoint unavailable stops extra retries * Change retry strategy * add sub-class errors tests * change old tests, refactoring, fix mocking bleed * Fix a test * clear last routed location pythonic * Removed aiohttp dependency * catch import errors * Skipped global endpoint manager test for debugging * Fixed tests * Removed skips * fix live tests and print statements for debugging * cleanup of few tests * updated globaldb mock * Moved some of the high offer throughput tests to live tests * Fixed global endpoint retry async test * Tried fixing global endpoint retry async test * no swaps on success test * fix import * Tried fixing global endpoint retry async test * Added separate split live tests * Added live platform matrix * some test fixes * Fixed live test pipeline * Moved test resource id to cosmosLong * Updated live tests * Running live tests with proper flag * testing logging experiments * fix tests * honor testmark argument through a safe environment variable, versus accessing the value directly * more test fixes * remove accidental log files * Fixed issues with swapping and retry policies * Fixed issues with swapping and retry policies * Marking endpoint as down fix * more test fixes * Remove print statements * Fixed some minor issues with emulator tests * split change feed tests * Fixed emulator tests * updated changelog * Fixed emulator tests again * Fixed emulator tests and event loop * vector/fts query tests * Fix session token live tests * hybrid search query fixes * Fixed live test name * fallback to regional * fix ci tests * Update conftest.py * Database accounts call will timeout in 5 seconds * Change timeouts and update docs * call updates to endpoint policy and location cache * Health check for endpoitns * database account retry policy * Fix parameter error * Retry on cosmos error fix * Retry on service request error fix * None checks for request in retry utilities * lowercase constructed regional endpoint * fix global endpoint as unhealthy * fix parsing test * Added logic for swapping on health check failed * Fixed log statement * fix pylint, docs, and remove print statements * fix pylint * fix some tests * Prepared for release --------- Co-authored-by: Simon Moreno <[email protected]> Co-authored-by: Kushagra Thapar <[email protected]> Co-authored-by: Abhijeet Mohanty <[email protected]> Co-authored-by: Scott Beddall <[email protected]>

simorenoh and others added 12 commits January 24, 2025 00:56

add new policy, add logic to use policy

b0fe9fd

Merge branch 'main' into service_response_error_policy

2444847

added small test file I was using

83b7a0d

Merge branch 'service_response_error_policy' of https://github.com/si…

39d8a86

…morenoh/azure-sdk-for-python into service_response_error_policy

initial regional endpoint work

3e5b8e5

Merge branch 'service_response_error_policy' of https://github.com/si…

41eadb4

…morenoh/azure-sdk-for-python into tvaron3/regionalEndpoints

groundwork

6df9e49

re-add AzureError logic, refactor, fix tests

c0f9d5f

Update _retry_utility.py

3eec9a1

Updated location_cache with new design

cb7e8a8

Fixed key error with most_preferred_location

3ae5c3f

Merge remote-tracking branch 'simon/service_response_error_policy' in…

116d8d3

…to tvaron3/regionalEndpoints

tvaron3 requested review from annatisch and a team as code owners January 24, 2025 17:20

github-actions bot added the Cosmos label Jan 24, 2025

Update test_cosmos_http_logging_policy.py

8fd9f9d

simorenoh and others added 3 commits January 24, 2025 12:43

Update _retry_utility.py

1305358

Added logic to refresh cache on previous endpoint usage

0fbc20c

Added business logic update the regional endpoint based on success or…

19e38c4

… failures

FabianMeiswinkel reviewed Jan 24, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py Outdated Show resolved Hide resolved

simorenoh reviewed Jan 24, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py Outdated Show resolved Hide resolved

simorenoh reviewed Jan 24, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py Outdated Show resolved Hide resolved

implementation

2342d27

jeet1995 reviewed Jan 24, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py Outdated Show resolved Hide resolved

simorenoh and others added 3 commits January 24, 2025 17:57

Update _retry_utility_async.py

cd010fa

fix some tests

42ecabc

changelog, versions, fixes

61f477b

fixes

794cf18

tvaron3 and others added 14 commits February 4, 2025 13:43

Health check for endpoitns

504bc6c

Merge branch 'tvaron3/regionalEndpoints' of https://github.com/tvaron…

d78ef9d

…3/azure-sdk-for-python into tvaron3/regionalEndpoints

database account retry policy

9507321

Fix parameter error

0421e91

Retry on cosmos error fix

e89bbe9

Retry on service request error fix

94d349b

None checks for request in retry utilities

c552653

lowercase constructed regional endpoint

f595650

fix global endpoint as unhealthy

af9a900

fix parsing test

0f7fd42

Added logic for swapping on health check failed

0579061

Fixed log statement

56c585f

fix pylint, docs, and remove print statements

7e0df0a

Merge branch 'tvaron3/regionalEndpoints' of https://github.com/tvaron…

cc5d3da

…3/azure-sdk-for-python into tvaron3/regionalEndpoints

FabianMeiswinkel approved these changes Feb 5, 2025

View reviewed changes

fix pylint

bdb491a

tvaron3 mentioned this pull request Feb 5, 2025

Optimize Health Check #39560

Open

tvaron3 and others added 2 commits February 4, 2025 21:58

fix some tests

01575db

Prepared for release

4582bd7

kushagraThapar approved these changes Feb 5, 2025

View reviewed changes

kushagraThapar enabled auto-merge (squash) February 5, 2025 06:43

weshaggard reviewed Feb 5, 2025

View reviewed changes

eng/pipelines/templates/steps/build-test.yml Show resolved Hide resolved

weshaggard approved these changes Feb 5, 2025

View reviewed changes

kushagraThapar merged commit fe6b4c7 into Azure:main Feb 5, 2025
51 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Endpoints with Regional Endpoints #39390

Replace Endpoints with Regional Endpoints #39390

tvaron3 commented Jan 24, 2025 •

edited

Loading

azure-sdk commented Jan 24, 2025

azure-pipelines bot commented Jan 25, 2025

FabianMeiswinkel left a comment

tvaron3 commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

kushagraThapar commented Feb 5, 2025

weshaggard commented Feb 5, 2025

Replace Endpoints with Regional Endpoints #39390

Replace Endpoints with Regional Endpoints #39390

Conversation

tvaron3 commented Jan 24, 2025 • edited Loading

Design

SDK Updated Retry policy:

ServiceRequestError:

ServiceResponseError:

Implementation

Testing:

Bootstrapping Scenario:

Global endpoint down:

Regional endpoint down:

Both endpoints healthy:

Runtime Scenario:

Global Endpoint down:

Regional Endpoint down:

azure-sdk commented Jan 24, 2025

azure-pipelines bot commented Jan 25, 2025

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

tvaron3 commented Feb 5, 2025

azure-pipelines bot commented Feb 5, 2025

kushagraThapar commented Feb 5, 2025

weshaggard commented Feb 5, 2025

tvaron3 commented Jan 24, 2025 •

edited

Loading