Skip to content

PYTHON-5505 Prototype system overload retry loop for all operations #2497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 21, 2025

Conversation

ShaneHarvey
Copy link
Member

@ShaneHarvey ShaneHarvey commented Aug 19, 2025

PYTHON-5505 Prototype system overload retry loop for all operations

Description:
All operations that fail with the "Retryable" error label will be retried up to 3 times. When the last failure includes the "SystemOverloaded" error label we apply exponential backoff with jitter before attempting a retry.

PR Tasks:

  • Validate command path
  • Validate retryable write path
  • Validate retryable write path with an otherwise non-retryable operation (eg update_many).
  • Validate retryable read path
  • Validate cursor getMore path

Passing patch build: https://spruce.mongodb.com/version/68a64f28268fea00074a72a4/tasks

@ShaneHarvey ShaneHarvey marked this pull request as ready for review August 19, 2025 23:03
@ShaneHarvey ShaneHarvey requested a review from a team as a code owner August 19, 2025 23:03
…d on overload

The server only adds the "Retryable" error label when it knows it is safe to retry.
RUN_ON_LOAD_BALANCER = True

@async_client_context.require_failCommand_appName
async def test_retry_overload_error_command(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note these scenarios will be later turned into unified spec tests.

@@ -2783,14 +2786,19 @@ async def run(self) -> T:
# most likely be a waste of time.
raise
except PyMongoError as exc:
overload = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: what's the purpose of this overload value? It's True when the error is an overload error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's my understanding of it -- and then when overload is true and we've reached max_retries, we raise the error, otherwise we backoff and try again. (I think)

And after this initial attempt, we only ever set overload to exc.error.has_error_label("Retryable"), is there a reason why we cant just set it to that here?

(side note: I like how _overload and _retrying are the same length, but I think _overloaded is a slightly more accurate var? i don't think this matters that much though since I feel like _overload does get the point across for me)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _overload is only set when the error is a result of the server being overloaded. If so, _overloaded is a clearer variable name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all correct. I've renamed it overloaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: this was renamed "always_retryable". "overloaded" is now based on the "SystemOverloaded" error label.

@ShaneHarvey ShaneHarvey requested a review from NoahStapp August 20, 2025 17:53
@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Aug 20, 2025

test_retry_overload_error_update_many fails on actions but passes locally. I'm investigating...

____________ TestBackpressure.test_retry_overload_error_update_many ____________

self = <test.asynchronous.test_backpressure.TestBackpressure testMethod=test_retry_overload_error_update_many>

    @async_client_context.require_failCommand_appName
    async def test_retry_overload_error_update_many(self):
        # Even though update_many is not a retryable write operation, it will
        # still be retried via the "Retryable" error label.
        await self.db.t.insert_one({"x": 1})
    
        # Ensure command is retried on overload error.
        fail_once = mock_overload_error.copy()
        fail_once["mode"] = {"times": _MAX_RETRIES}
        async with self.fail_point(fail_once):
>           await self.db.t.update_many({}, {"$set": {"x": 2}})

https://github.com/mongodb/mongo-python-driver/actions/runs/17106265661/job/48515851763?pr=2497

retryable_write_error_exc = isinstance(
exc.error, PyMongoError
) and exc.error.has_error_label("RetryableWriteError")
if isinstance(exc.error, PyMongoError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if-else block? Both branches are identical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this can be cleaned up. I this it's a problem with ClientBulkWriteException violating Liskov substitution principle. Perhaps ClientBulkWriteException should override has_error_label() to behave correctly. I suspect the NoWritesPerformed check below is buggy because of this issue too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@ShaneHarvey ShaneHarvey Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to remove the duplicate label checks. I decided not to do PYTHON-5511 yet to reduce the scope of this PR.

@@ -2783,14 +2786,21 @@ async def run(self) -> T:
# most likely be a waste of time.
raise
except PyMongoError as exc:
always_retryable = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always_retryable represents what exactly? That an error is retryable under all conditions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it means it's safe to retry even if the operation is otherwise non-retryable, like an update_many. There's probably a better name for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overloaded still takes precedence over always_retryable, correct?

@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Aug 20, 2025

Found the bug causing the update_many test to fail. This was only happening on standalone nodes which don't support retryable writes. The fix is to skip the check for sessions_supported when attempting to retry after a system overload error. I believe this is safe.

Passing patch build: https://spruce.mongodb.com/version/68a64f28268fea00074a72a4/tasks

@ShaneHarvey ShaneHarvey requested a review from NoahStapp August 20, 2025 23:21
@NoahStapp
Copy link
Contributor

Found the bug causing the update_many test to fail. This was only happening on standalone nodes which don't support retryable writes. The fix is to skip the check for sessions_supported when attempting to retry after a system overload error. I believe this is safe.

Passing patch build: spruce.mongodb.com/version/68a64f28268fea00074a72a4/tasks

We can't retry writes on standalone anyway, so how does skipping the sessions_supported check fix that? Wouldn't that cause us to attempt a retry when we can't?

@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Aug 21, 2025

We can't retry writes on standalone anyway, so how does skipping the sessions_supported check fix that? Wouldn't that cause us to attempt a retry when we can't?

This feature adds the ability to retry all commands that fail with the "Retryable" error label, even on standalone. The server only adds this label when it's safe to retry (eg the command was rejected before doing any work).

@ShaneHarvey ShaneHarvey merged commit 75eee91 into mongodb:backpressure Aug 21, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants