-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Endpoints with Regional Endpoints #39390
Merged
kushagraThapar
merged 194 commits into
Azure:main
from
tvaron3:tvaron3/regionalEndpoints
Feb 5, 2025
Merged
Replace Endpoints with Regional Endpoints #39390
kushagraThapar
merged 194 commits into
Azure:main
from
tvaron3:tvaron3/regionalEndpoints
Feb 5, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…morenoh/azure-sdk-for-python into service_response_error_policy
…morenoh/azure-sdk-for-python into tvaron3/regionalEndpoints
…to tvaron3/regionalEndpoints
API change check APIView has identified API level changes in this PR and created following API reviews. |
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py
Outdated
Show resolved
Hide resolved
simorenoh
reviewed
Jan 24, 2025
simorenoh
reviewed
Jan 24, 2025
jeet1995
reviewed
Jan 24, 2025
Azure Pipelines successfully started running 1 pipeline(s). |
…3/azure-sdk-for-python into tvaron3/regionalEndpoints
…3/azure-sdk-for-python into tvaron3/regionalEndpoints
FabianMeiswinkel
approved these changes
Feb 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks!
/azp run python - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
kushagraThapar
approved these changes
Feb 5, 2025
/check-enforcer override |
weshaggard
reviewed
Feb 5, 2025
weshaggard
approved these changes
Feb 5, 2025
/check-enforcer override |
l0lawrence
pushed a commit
to l0lawrence/azure-sdk-for-python
that referenced
this pull request
Feb 19, 2025
* add new policy, add logic to use policy * added small test file I was using * initial regional endpoint work * groundwork * re-add AzureError logic, refactor, fix tests * Update _retry_utility.py * Updated location_cache with new design * Fixed key error with most_preferred_location * Update test_cosmos_http_logging_policy.py * Update _retry_utility.py * Added logic to refresh cache on previous endpoint usage * Added business logic update the regional endpoint based on success or failures * implementation * Update _retry_utility_async.py * fix some tests * changelog, versions, fixes * fixes * fix some tests * remove fake logic, count fix * fix some tests * Update _service_request_retry_policy.py * Update _retry_utility_async.py * retry utilities fixing * Update _retry_utility.py * additional enhancements * Update setup.py * Update _retry_utility_async.py * add tests, remove previous retry logic for ServiceRequestExceptions * clean up with finally * tests * retry utilities * disable tests * add logging to policies * GetDatabaseAccount Fix * Update _base.py * retry utilities fixes * Update _retry_utility.py * retry utulities part 34 * Update _service_request_retry_policy.py * remove extra logs * policy updates * Update _service_response_retry_policy.py * Update _service_response_retry_policy.py * policies updates and update operation types * trying out fixes * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Skipped proxy test for debugging * annotation fix * Fixed some tests cases * test fixes * Update test_service_retry_policies_async.py * Fixed some mocking behavior * fixed pylint issues * Added aiohttp minimum dependency * Updated changelog and setup.py * Updated changelog * Add changelog and fix tests. * Fix tests * bootstrapping with global endpoint as previous for writes * Add headers and cleanup * cleanup and retry all service request headers * Don't retry on a none previous * Updated the business logic with current and previous, fixed database account refresh and some retry policies * fix client id * Reacting to comments * Added print statements and fixed some retry logic * Revert getDatabase in mark endpoint * Fixed some pylint and changelog issues * Fixed version * fix bug with type check, update tests * Update test_service_retry_policies_async.py * sync tests updates * Reacting to comments and fixing service request retry policy * Code review comments and pylint issues * Fixed tests and pylint * more sync mock tests - missing async copies * Fixed min aiohttp requirements * Update _retry_utility_async.py * Change to check operation type in operations * push initial GEM mock test * Update test_service_retry_policies.py * Fixed extra retries * sync tests * Update test_service_retry_policies_async.py * Fixed extra retries and relevant tests * Only delay retry by one second * async tests - need to split up inheritance ones since endpoint unavailable stops extra retries * Change retry strategy * add sub-class errors tests * change old tests, refactoring, fix mocking bleed * Fix a test * clear last routed location pythonic * Removed aiohttp dependency * catch import errors * Skipped global endpoint manager test for debugging * Fixed tests * Removed skips * fix live tests and print statements for debugging * cleanup of few tests * updated globaldb mock * Moved some of the high offer throughput tests to live tests * Fixed global endpoint retry async test * Tried fixing global endpoint retry async test * no swaps on success test * fix import * Tried fixing global endpoint retry async test * Added separate split live tests * Added live platform matrix * some test fixes * Fixed live test pipeline * Moved test resource id to cosmosLong * Updated live tests * Running live tests with proper flag * testing logging experiments * fix tests * honor testmark argument through a safe environment variable, versus accessing the value directly * more test fixes * remove accidental log files * Fixed issues with swapping and retry policies * Fixed issues with swapping and retry policies * Marking endpoint as down fix * more test fixes * Remove print statements * Fixed some minor issues with emulator tests * split change feed tests * Fixed emulator tests * updated changelog * Fixed emulator tests again * Fixed emulator tests and event loop * vector/fts query tests * Fix session token live tests * hybrid search query fixes * Fixed live test name * fallback to regional * fix ci tests * Update conftest.py * Database accounts call will timeout in 5 seconds * Change timeouts and update docs * call updates to endpoint policy and location cache * Health check for endpoitns * database account retry policy * Fix parameter error * Retry on cosmos error fix * Retry on service request error fix * None checks for request in retry utilities * lowercase constructed regional endpoint * fix global endpoint as unhealthy * fix parsing test * Added logic for swapping on health check failed * Fixed log statement * fix pylint, docs, and remove print statements * fix pylint * fix some tests * Prepared for release --------- Co-authored-by: Simon Moreno <[email protected]> Co-authored-by: Kushagra Thapar <[email protected]> Co-authored-by: Abhijeet Mohanty <[email protected]> Co-authored-by: Scott Beddall <[email protected]>
cRui861
pushed a commit
that referenced
this pull request
Feb 27, 2025
* add new policy, add logic to use policy * added small test file I was using * initial regional endpoint work * groundwork * re-add AzureError logic, refactor, fix tests * Update _retry_utility.py * Updated location_cache with new design * Fixed key error with most_preferred_location * Update test_cosmos_http_logging_policy.py * Update _retry_utility.py * Added logic to refresh cache on previous endpoint usage * Added business logic update the regional endpoint based on success or failures * implementation * Update _retry_utility_async.py * fix some tests * changelog, versions, fixes * fixes * fix some tests * remove fake logic, count fix * fix some tests * Update _service_request_retry_policy.py * Update _retry_utility_async.py * retry utilities fixing * Update _retry_utility.py * additional enhancements * Update setup.py * Update _retry_utility_async.py * add tests, remove previous retry logic for ServiceRequestExceptions * clean up with finally * tests * retry utilities * disable tests * add logging to policies * GetDatabaseAccount Fix * Update _base.py * retry utilities fixes * Update _retry_utility.py * retry utulities part 34 * Update _service_request_retry_policy.py * remove extra logs * policy updates * Update _service_response_retry_policy.py * Update _service_response_retry_policy.py * policies updates and update operation types * trying out fixes * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Update sdk/cosmos/azure-cosmos/CHANGELOG.md Co-authored-by: Abhijeet Mohanty <[email protected]> * Skipped proxy test for debugging * annotation fix * Fixed some tests cases * test fixes * Update test_service_retry_policies_async.py * Fixed some mocking behavior * fixed pylint issues * Added aiohttp minimum dependency * Updated changelog and setup.py * Updated changelog * Add changelog and fix tests. * Fix tests * bootstrapping with global endpoint as previous for writes * Add headers and cleanup * cleanup and retry all service request headers * Don't retry on a none previous * Updated the business logic with current and previous, fixed database account refresh and some retry policies * fix client id * Reacting to comments * Added print statements and fixed some retry logic * Revert getDatabase in mark endpoint * Fixed some pylint and changelog issues * Fixed version * fix bug with type check, update tests * Update test_service_retry_policies_async.py * sync tests updates * Reacting to comments and fixing service request retry policy * Code review comments and pylint issues * Fixed tests and pylint * more sync mock tests - missing async copies * Fixed min aiohttp requirements * Update _retry_utility_async.py * Change to check operation type in operations * push initial GEM mock test * Update test_service_retry_policies.py * Fixed extra retries * sync tests * Update test_service_retry_policies_async.py * Fixed extra retries and relevant tests * Only delay retry by one second * async tests - need to split up inheritance ones since endpoint unavailable stops extra retries * Change retry strategy * add sub-class errors tests * change old tests, refactoring, fix mocking bleed * Fix a test * clear last routed location pythonic * Removed aiohttp dependency * catch import errors * Skipped global endpoint manager test for debugging * Fixed tests * Removed skips * fix live tests and print statements for debugging * cleanup of few tests * updated globaldb mock * Moved some of the high offer throughput tests to live tests * Fixed global endpoint retry async test * Tried fixing global endpoint retry async test * no swaps on success test * fix import * Tried fixing global endpoint retry async test * Added separate split live tests * Added live platform matrix * some test fixes * Fixed live test pipeline * Moved test resource id to cosmosLong * Updated live tests * Running live tests with proper flag * testing logging experiments * fix tests * honor testmark argument through a safe environment variable, versus accessing the value directly * more test fixes * remove accidental log files * Fixed issues with swapping and retry policies * Fixed issues with swapping and retry policies * Marking endpoint as down fix * more test fixes * Remove print statements * Fixed some minor issues with emulator tests * split change feed tests * Fixed emulator tests * updated changelog * Fixed emulator tests again * Fixed emulator tests and event loop * vector/fts query tests * Fix session token live tests * hybrid search query fixes * Fixed live test name * fallback to regional * fix ci tests * Update conftest.py * Database accounts call will timeout in 5 seconds * Change timeouts and update docs * call updates to endpoint policy and location cache * Health check for endpoitns * database account retry policy * Fix parameter error * Retry on cosmos error fix * Retry on service request error fix * None checks for request in retry utilities * lowercase constructed regional endpoint * fix global endpoint as unhealthy * fix parsing test * Added logic for swapping on health check failed * Fixed log statement * fix pylint, docs, and remove print statements * fix pylint * fix some tests * Prepared for release --------- Co-authored-by: Simon Moreno <[email protected]> Co-authored-by: Kushagra Thapar <[email protected]> Co-authored-by: Abhijeet Mohanty <[email protected]> Co-authored-by: Scott Beddall <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Design
Currently in the SDK, every read request is retried 3 times in the same region before failing over to other region and marking the current region unavailable. This helps read requests, however, does not help write requests which are not retried at all in the same region and cannot be retired in other regions in case of single master accounts.
With this new feature to improve write requests availability, the SDK will now maintain a fallback endpoint. Both the current and fallback endpoint will point to the same write region. This will allow the SDK to retry write requests on the fallback endpoint if in case the current endpoint is unavailable because of any connectivity issues. Below are some of the feature implementation details and the testing that we have done so far.
The GetDatabaseAccount which gets called during bootstrapping and every 5 mins will now return the following uris for a region:
<account-name>-<region-name>.documents.azure.com
<account-name>.documents.azure.com
The service will randomly send variations of these endpoints in the getDatabaseAccount call (happens every 5 mins from the SDK).
Example – for a Cosmos DB account with name testAccount and hosted in region West US, the service will now send following endpoints in some round robin fashion:
testAccount-westus.documents.azure.com
testAccount.documents.azure.com
This allows clients to point to two different VIPs for the Gateway service to improve availability in scenarios when one gateway VIP goes down.
SDK maintains current and previous endpoints for these two endpoints returned by the gateway.
SDK Updated Retry policy:
SDK’s default retries policy has 3 in-region retries in addition to the original request. Default connection timeout is 5 seconds. Default read timeout is 65s.
ServiceRequestError:
This error happens when the client is trying to connect to the server, but for some reason cannot connect. In this case, since the SDK knows that request has not reached the service, SDK will retry both read and write requests.
For the write requests, the SDK will first retry 3 times on the current endpoint and then 3 times on the fallback endpoint. If the request still fails, it will mark the endpoints unavailable for write operations and retry on other regions (if there are any more write regions available (Multi-master case)), otherwise, the write request will fail.
For reads, it will retry 3 times on the current endpoint, and then will mark the region unavailable for reads, and will retry on other regions.
ServiceResponseError:
This error happens when the client has already connected to the server, but for some reason received an error during response. In this case, SDK will only retry read operations, since it's not safe to retry on writes as the SDK does not know if the write operation succeeded or not.
Implementation
Location cache will now have a new
RegionalEndpoint
object that will have a current and previous. The idea is the previous can be used in certain scenarios to retry. There will now be a health check for every 5 minute global database account refresh. The health check will reach out to the different endpoints using a global database account call because it is quick. We also limited database account calls to 3 seconds and 1 retry. If this health check, fails then we set the endpoint as unavailable.Pseudocode of new current and previous logic
Testing:
Testing with bringing the federations down in staging environment:
Account Regions: East US 2, North Central US
SDK Preferred Locations: East US 2, North Central US
Note: Both read and write requests are retried by default 3 times in the current region. By default, the connection / request timeout in the python SDK is 60 seconds.
Note: We have not yet tested Envoy Proxy, however, this is something we are setting up locally in house and trying to see how the new retry policies react to the Envoy Proxy. We would like to test this envoy proxy before shipping this hotfix to make sure this code change works reliably for customers using proxy.
Bootstrapping Scenario:
Global endpoint down:
If the global endpoint is down, the initial topology call fails and because of the default 60 seconds connection timeout and 3 in region retries, it currently takes around 4 mins to retry this topology call in other endpoints. This can be fixed by the customers by lowering down the request timeout - it should be set somewhere between 5 to 10 seconds. Meanwhile, we are also updating the timeout of the getDatabaseAccount call to 5 seconds so that the SDK can recover quicker.
If the global endpoint is down, and the gateway returns the global endpoint for the write region, the SDK will first try on global endpoint, and once it fails, it will construct the regional endpoint during bootstrapping case and will try there as the fallback.
Regional endpoint down:
In this case, the SDK will keep going to the global endpoint and keep refreshing the location cache every 5 mins (as usual) and if receives the regional endpoint, it will mark it unavailable after 4 attempts (1 original request + 3 retries) and will keep falling back to the global endpoint.
Both endpoints healthy:
In this case no issues observed, and the SDK is able to load balance it well.
Runtime Scenario:
Global Endpoint down:
For GetDatabaseAccount call - Because of the default 60 seconds request timeout and 3 number of retries, throughput decreases to a very low value, eventually getting down to almost 0 for few mins since connectivity issues with global endpoint for topology call. This can be fixed by lowering down the request timeout. (We are planning to reduce the timeout of this getDatabaseAccount call to 5 seconds in the current drop of the SDK).
For write requests -> they will retry on the fallback regional endpoint. If there is another write region on the preferred_regions, then writes will further go to those regions after marking the current region unavailable.
For read requests -> they will not be retried on the fallback endpoints and will go to the different region after marking the current region unavailable for further reads.
Regional Endpoint down:
Almost no effect on the throughput on both read requests and write requests, since the global endpoint is used for topology call, and since that's working, one less point of failure.