Skip to content

PYTHON-5504 Prototype exponential backoff in with_transaction #2492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 19, 2025

Conversation

ShaneHarvey
Copy link
Member

@ShaneHarvey ShaneHarvey commented Aug 19, 2025

PYTHON-5504 Prototype exponential backoff in with_transaction.

Using the repro scrip in jira which runs 200 concurrent transactions in 200 threads all updating the same document shows a significant reduction in wasted retry attempts and latency (from p50 to p100). Before this change:

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 4.8626720905303955 seconds
Total retry attempts: 8132
avg latency: 3.04s p50: 3.36s p90: 4.59s p99: 4.83s p100: 4.84s

After (with 50ms initial backoff, 1000ms max backoff, and full jitter, backoff starting on the second retry attempt):

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 4.251200914382935 seconds
Total retry attempts: 1089
avg latency: 1.53s p50: 1.45s p90: 2.77s p99: 3.69s p100: 4.24s

Backoff starting on the first retry attempt appears to work even better:

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 3.272695779800415 seconds
Total retry attempts: 886
avg latency: 1.33s p50: 1.21s p90: 2.50s p99: 3.22s p100: 3.25s

Note I'm using free-threaded mode to make this repro more similar to the behavior of other languages and other deployment types (eg many single threaded clients running on different machines).

@ShaneHarvey ShaneHarvey changed the base branch from master to backpressure August 19, 2025 15:44
@ShaneHarvey ShaneHarvey marked this pull request as ready for review August 19, 2025 17:17
@ShaneHarvey ShaneHarvey requested a review from a team as a code owner August 19, 2025 17:17
@ShaneHarvey ShaneHarvey requested a review from NoahStapp August 19, 2025 17:17
@ShaneHarvey ShaneHarvey changed the title DRIVERS-1934 POC exponential backoff in withTransaction PYTHON-5504 Prototype exponential backoff in with_transaction Aug 19, 2025
@NoahStapp
Copy link
Contributor

Can you add an async version of the benchmark? Having both APIs be tested before merging this into the backpressure branch would be ideal.

@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Aug 19, 2025

Done. Added the scrip to the Jira ticket. The async version still shows a significant reduction in the number of wasted retries and lower p50 to p90 latency but less/no benefit for p99 and p100 latency.

Before:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.467694044113159 seconds
Total retry attempts: 5950
avg latency: 1.87s p50: 2.04s p90: 3.22s p99: 3.45s p100: 3.46s

After:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.5634748935699463 seconds
Total retry attempts: 887
avg latency: 1.48s p50: 1.41s p90: 2.81s p99: 3.49s p100: 3.56s

@NoahStapp
Copy link
Contributor

Done. Added the scrip to the Jira ticket. The async version still shows a significant reduction in the number of wasted retries and lower p50 to p90 latency but less/no benefit for p99 and p100 latency.

Before:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.467694044113159 seconds
Total retry attempts: 5950
avg latency: 1.87s p50: 2.04s p90: 3.22s p99: 3.45s p100: 3.46s

After:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.5634748935699463 seconds
Total retry attempts: 887
avg latency: 1.48s p50: 1.41s p90: 2.81s p99: 3.49s p100: 3.56s

Async sees significantly less improvement with the backoff, but I'd say that's expected. Asyncio's cooperative multitasking structure already prevents a given operation from retrying before the other concurrent async tasks have had a chance to run (assuming the async/await code is written correctly).

@ShaneHarvey ShaneHarvey merged commit cf7a1aa into mongodb:backpressure Aug 19, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants