Skip to content

RUST-1842 Update prose tests for mongos deprioritization during retryable ops #1397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 7, 2025

Conversation

JamieTsai1024
Copy link
Collaborator

@JamieTsai1024 JamieTsai1024 commented Jun 12, 2025

  • Add assertions to check whether failed events occurred on the same or different mongos for retryable_reads and retryable_writes.
  • Rewrote implementation for retryable read/write test on different mongos. The prose updates introduced flakiness on mongo versions 4.2 and 4.4 sharded tasks on Macos-14.00 variant where the servers discovery was too slow.
    • Solution: Instead of creating a client per server, we now have a single client that connects to all servers using predicates.

Links

@JamieTsai1024 JamieTsai1024 marked this pull request as ready for review June 25, 2025 18:30
@JamieTsai1024 JamieTsai1024 requested a review from a team as a code owner June 25, 2025 18:30
Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the necessary logs to show that the server was properly deprioritized in the server selection code back in? We can leave those in til all the code changes are approved, and then you can remove them as the last step before merging.


let mut guards = Vec::new();
for address in hosts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, can you add a comment here explaining why we set the failpoints this way rather than with separate clients? and ditto elsewhere

Copy link
Collaborator Author

@JamieTsai1024 JamieTsai1024 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! let me know if you have any suggestions on the explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these details aren't quite accurate - the important distinction to note is that we're using the same client to set the failpoints on each mongos as we are for the find operation. The fundamental problem that we were encountering was a race between server discovery, which happens in the background after a client is created, and the server selection process for find, which was previously happening right after creating the client. Server discovery goes roughly as follows:

  • client gets created with two mongos addresses (localhost:27017 and localhost:27018) and stores each of these in its topology with an initial server type of Unknown. (Unknown servers are not eligible to be selected for operations)
  • client sends a hello message to each mongos and waits for a reply
  • each mongos replies to the hello message with information about itself, and client uses this information to update its server type from Unknown to Mongos

Executing an operation (in this case, enable_fail_point) on each individual mongos forces the client to complete its discovery of that mongos and select it for the operation. This means that once we get to the find operation, client has a list of two Mongos servers to select from. On the contrary, when we were creating a new client for each call to enable_fail_point and then the find operation, each of those clients was restarting the server discovery process from scratch.

The details here can be a little tricky to understand, so let me know if you have any questions about this and we can walk through it in more detail!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the detailed explanation, Isabel! I hadn’t fully understood how server discovery works in the background or how using separate clients was restarting that process. I also realize now that some of my original terminology wasn’t quite accurate (e.g., implying it was about a single mongos instead of the client's discovery state), so I appreciate the correction.

I’ve updated the comment to reflect that. Let me know if it looks good now or if I should tweak anything further - would be happy to chat about it more if my understanding is still off!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thanks for making those changes.

isabelatkinson
isabelatkinson previously approved these changes Jul 3, 2025
Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! you can push a change to remove the logs and then I'll reapprove to merge.


let mut guards = Vec::new();
for address in hosts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thanks for making those changes.

@JamieTsai1024 JamieTsai1024 merged commit 87e9a59 into mongodb:main Jul 7, 2025
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants