Skip to content

Retry segment download at the start of snap sync #3538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

teor2345
Copy link
Member

This PR is a temporary workaround for bug #3537, where some nodes can't download segments reliably on Taurus.

There are two places where segment downloads are used during snap sync:

  1. the first segment download, which is only tried once, and is fatal to the entire node if it fails
  2. the DSN sync, where segment downloads are retried

Until we fix the underlying issue, it seems like adding a retry to the first case would be useful. There's no good reason to exit immediately when we could retry a few times.

Code contributor checklist:

@teor2345 teor2345 self-assigned this May 14, 2025
@teor2345 teor2345 requested a review from nazar-pc as a code owner May 14, 2025 23:07
@teor2345 teor2345 added bug Something isn't working networking Subspace networking (DSN) labels May 14, 2025
@teor2345 teor2345 enabled auto-merge May 14, 2025 23:11
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be for temporary workarounds to be commented with a TODO as such. Still wondering why this happens exactly and whether 3 attempts would make a difference. Also this is a high-level retry that doesn't reuse already downloaded pieces, so the probability of success would be substantially higher if piece reuse was actually happening (but may require more invasive code changes).

}
}

if let Err(segment_error) = segment_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both segment_pieces and segment_error are in sync with each other, so there should probably be just one variable that stores Result. match here will handle Err case and there will be no need to use .expect() below at all.

@teor2345
Copy link
Member Author

Also this is a high-level retry that doesn't reuse already downloaded pieces, so the probability of success would be substantially higher if piece reuse was actually happening (but may require more invasive code changes).

Good point, it should be possible to re-use the pieces from the last try by passing a mutable array to the function. I think that's worth doing, I'll try to make time for it tomorrow.

@teor2345 teor2345 marked this pull request as draft May 28, 2025 05:59
auto-merge was automatically disabled May 28, 2025 05:59

Pull request was converted to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working networking Subspace networking (DSN)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants