RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

JamieTsai1024 · 2025-06-12T20:38:15Z

Add assertions to check whether failed events occurred on the same or different mongos for retryable_reads and retryable_writes.
Rewrote implementation for retryable read/write test on different mongos. The prose updates introduced flakiness on mongo versions 4.2 and 4.4 sharded tasks on Macos-14.00 variant where the servers discovery was too slow.
- Solution: Instead of creating a client per server, we now have a single client that connects to all servers using predicates.

Links

Prose test diff to implement: mongodb/specifications@f5bb605

…tests

isabelatkinson

Can you add the necessary logs to show that the server was properly deprioritized in the server selection code back in? We can leave those in til all the code changes are approved, and then you can remove them as the last step before merging.

isabelatkinson · 2025-06-26T15:58:27Z

src/test/spec/retryable_reads.rs


+    let mut guards = Vec::new();
+    for address in hosts {


For future reference, can you add a comment here explaining why we set the failpoints this way rather than with separate clients? and ditto elsewhere

Done! let me know if you have any suggestions on the explanation!

Some of these details aren't quite accurate - the important distinction to note is that we're using the same client to set the failpoints on each mongos as we are for the find operation. The fundamental problem that we were encountering was a race between server discovery, which happens in the background after a client is created, and the server selection process for find, which was previously happening right after creating the client. Server discovery goes roughly as follows:

client gets created with two mongos addresses (localhost:27017 and localhost:27018) and stores each of these in its topology with an initial server type of Unknown. (Unknown servers are not eligible to be selected for operations)

client sends a hello message to each mongos and waits for a reply

each mongos replies to the hello message with information about itself, and client uses this information to update its server type from Unknown to Mongos

Executing an operation (in this case, enable_fail_point) on each individual mongos forces the client to complete its discovery of that mongos and select it for the operation. This means that once we get to the find operation, client has a list of two Mongos servers to select from. On the contrary, when we were creating a new client for each call to enable_fail_point and then the find operation, each of those clients was restarting the server discovery process from scratch.

The details here can be a little tricky to understand, so let me know if you have any questions about this and we can walk through it in more detail!

Thanks so much for the detailed explanation, Isabel! I hadn’t fully understood how server discovery works in the background or how using separate clients was restarting that process. I also realize now that some of my original terminology wasn’t quite accurate (e.g., implying it was about a single mongos instead of the client's discovery state), so I appreciate the correction.

I’ve updated the comment to reflect that. Let me know if it looks good now or if I should tweak anything further - would be happy to chat about it more if my understanding is still off!

looks great! thanks for making those changes.

src/test/spec/retryable_reads.rs

… mongos

isabelatkinson

lgtm! you can push a change to remove the logs and then I'll reapprove to merge.

isabelatkinson · 2025-07-03T19:50:38Z

src/test/spec/retryable_reads.rs


+    let mut guards = Vec::new();
+    for address in hosts {


looks great! thanks for making those changes.

JamieTsai1024 added 7 commits June 12, 2025 16:36

Assert that events occurred on the same/different mongos instance(s)

185616d

Add condition for CommandEvent::Succeeded

f04d6ee

Temporary logs to verify code paths

5a25e6f

Log considered servers for selection

42957d3

Replace close connection for different_mongos tests

dde0525

Remove close_connection for failpoints on different mongos retryable …

aac382c

…tests

Rewrite test for retryable reads/writes on different mongos

74b5e44

JamieTsai1024 marked this pull request as ready for review June 25, 2025 18:30

JamieTsai1024 requested a review from a team as a code owner June 25, 2025 18:30

JamieTsai1024 requested a review from isabelatkinson June 25, 2025 18:30

JamieTsai1024 added 7 commits June 26, 2025 09:28

Fix retry on same mongos tests

21fae26

Simplify logs for deprioritization check

4f354e1

Fix clippy lint check

75e6fee

Replace println with dbg

05cfe21

Add logs

ad61ffd

Only run retry read test

038cdfb

Remove debugging logs

6eb7a22

isabelatkinson reviewed Jul 1, 2025

View reviewed changes

JamieTsai1024 added 4 commits July 1, 2025 15:59

Add logs to show deprioritization

295c958

Rename s0 to client

1fbaed5

Add comment explaining test rewrite for retry operations on different…

23253b1

… mongos

Move comment

fee5f26

JamieTsai1024 requested a review from isabelatkinson July 2, 2025 14:25

Correct commented explanation

a2f855a

isabelatkinson previously approved these changes Jul 3, 2025

View reviewed changes

Remove debugging logs

a4d4c27

JamieTsai1024 dismissed isabelatkinson’s stale review via a4d4c27 July 3, 2025 19:55

JamieTsai1024 requested a review from isabelatkinson July 3, 2025 19:56

Remove whitespace

22fadca

isabelatkinson approved these changes Jul 4, 2025

View reviewed changes

JamieTsai1024 merged commit 87e9a59 into mongodb:main Jul 7, 2025
16 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

Uh oh!

JamieTsai1024 commented Jun 12, 2025 •

edited

Loading

Uh oh!

isabelatkinson left a comment

Uh oh!

isabelatkinson Jun 26, 2025

Uh oh!

JamieTsai1024 Jul 1, 2025 •

edited

Loading

Uh oh!

isabelatkinson Jul 2, 2025

Uh oh!

JamieTsai1024 Jul 3, 2025

Uh oh!

isabelatkinson Jul 3, 2025

Uh oh!

Uh oh!

isabelatkinson left a comment

Uh oh!

isabelatkinson Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

Uh oh!

Conversation

JamieTsai1024 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Links

Uh oh!

isabelatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

isabelatkinson Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

JamieTsai1024 Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isabelatkinson Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

JamieTsai1024 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

isabelatkinson Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isabelatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

isabelatkinson Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JamieTsai1024 commented Jun 12, 2025 •

edited

Loading

JamieTsai1024 Jul 1, 2025 •

edited

Loading