-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-5505 Prototype system overload retry loop for all operations #2497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d on overload The server only adds the "Retryable" error label when it knows it is safe to retry.
RUN_ON_LOAD_BALANCER = True | ||
|
||
@async_client_context.require_failCommand_appName | ||
async def test_retry_overload_error_command(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note these scenarios will be later turned into unified spec tests.
pymongo/asynchronous/mongo_client.py
Outdated
@@ -2783,14 +2786,19 @@ async def run(self) -> T: | |||
# most likely be a waste of time. | |||
raise | |||
except PyMongoError as exc: | |||
overload = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding: what's the purpose of this overload
value? It's True
when the error is an overload error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's my understanding of it -- and then when overload
is true and we've reached max_retries
, we raise the error, otherwise we backoff
and try again. (I think)
And after this initial attempt, we only ever set overload
to exc.error.has_error_label("Retryable")
, is there a reason why we cant just set it to that here?
(side note: I like how _overload
and _retrying
are the same length, but I think _overloaded
is a slightly more accurate var? i don't think this matters that much though since I feel like _overload
does get the point across for me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _overload
is only set when the error is a result of the server being overloaded. If so, _overloaded
is a clearer variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all correct. I've renamed it overloaded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: this was renamed "always_retryable". "overloaded" is now based on the "SystemOverloaded" error label.
test_retry_overload_error_update_many fails on actions but passes locally. I'm investigating... ____________ TestBackpressure.test_retry_overload_error_update_many ____________
self = <test.asynchronous.test_backpressure.TestBackpressure testMethod=test_retry_overload_error_update_many>
@async_client_context.require_failCommand_appName
async def test_retry_overload_error_update_many(self):
# Even though update_many is not a retryable write operation, it will
# still be retried via the "Retryable" error label.
await self.db.t.insert_one({"x": 1})
# Ensure command is retried on overload error.
fail_once = mock_overload_error.copy()
fail_once["mode"] = {"times": _MAX_RETRIES}
async with self.fail_point(fail_once):
> await self.db.t.update_many({}, {"$set": {"x": 2}}) https://github.com/mongodb/mongo-python-driver/actions/runs/17106265661/job/48515851763?pr=2497 |
pymongo/asynchronous/mongo_client.py
Outdated
retryable_write_error_exc = isinstance( | ||
exc.error, PyMongoError | ||
) and exc.error.has_error_label("RetryableWriteError") | ||
if isinstance(exc.error, PyMongoError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this if-else
block? Both branches are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this can be cleaned up. I this it's a problem with ClientBulkWriteException violating Liskov substitution principle. Perhaps ClientBulkWriteException should override has_error_label() to behave correctly. I suspect the NoWritesPerformed check below is buggy because of this issue too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to remove the duplicate label checks. I decided not to do PYTHON-5511 yet to reduce the scope of this PR.
@@ -2783,14 +2786,21 @@ async def run(self) -> T: | |||
# most likely be a waste of time. | |||
raise | |||
except PyMongoError as exc: | |||
always_retryable = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always_retryable
represents what exactly? That an error is retryable under all conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it means it's safe to retry even if the operation is otherwise non-retryable, like an update_many. There's probably a better name for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overloaded
still takes precedence over always_retryable
, correct?
Found the bug causing the update_many test to fail. This was only happening on standalone nodes which don't support retryable writes. The fix is to skip the check for Passing patch build: https://spruce.mongodb.com/version/68a64f28268fea00074a72a4/tasks |
We can't retry writes on standalone anyway, so how does skipping the |
This feature adds the ability to retry all commands that fail with the "Retryable" error label, even on standalone. The server only adds this label when it's safe to retry (eg the command was rejected before doing any work). |
PYTHON-5505 Prototype system overload retry loop for all operations
Description:
All operations that fail with the "Retryable" error label will be retried up to 3 times. When the last failure includes the "SystemOverloaded" error label we apply exponential backoff with jitter before attempting a retry.
PR Tasks:
Passing patch build: https://spruce.mongodb.com/version/68a64f28268fea00074a72a4/tasks