-
Notifications
You must be signed in to change notification settings - Fork 254
Track segment indexes correctly during DSN sync #3521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
d3dc389
to
3ee66af
Compare
3ee66af
to
ccb28cb
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading the code I think this is broken in ways more than one, but I didn't try to run it.
we get more segment headers a later time through the DSN sync loop, but we can't reconstruct the partial block because we've dropped part of it, so there is a gap which causes block imports to hang
The explanation here doesn't look correct to me. It may fail, sure, but the reason is that tracking of blocks or segments is incorrect, otherwise it'd be able to restart, even if it means re-downloading previously downloaded segment again.
This error currently happens on taurus because:
Maybe a nit, but I'd not use "because" here. It may trigger the problem, but the root cause is elsewhere. Neither XDM volume not segment creation rate are the cause of this.
DSN sync only ends when the node is within the confirmation depth of the chain tip
This is incorrect. It is one of the ways for DSN sync to end, but not the only one. In fact since blocks are only archived at 100 blocks depth you'll probably never ever reach that branch during sync.
It is only there for a situation when you shut down the node and start it back (let's say with a new version of the software) shortly afterwards. If there was less than 100 blocks produced while node was offline, there is no reason to look for latest segment headers and do the lengthy DSN sync that we know will not yield any results. Same with node being offline for 10 minutes, DSN sync will kick in, but will be short-circuited if turns out there wasn't 100 blocks produced yet.
Overall I think there is still some misunderstanding how things work and why it was designed that way and the fact that some interesting scenarios are under-documented doesn't help, sorry about that.
@@ -238,8 +256,6 @@ where | |||
// Import queue handles verification and importing it into the client | |||
import_queue_service.import_blocks(BlockOrigin::NetworkInitialSync, blocks_to_import); | |||
} | |||
|
|||
*last_processed_segment_index = segment_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is in this line in the main
branch:
- when we update the segment index above, we check we've fully processed all blocks from that segment
- but when we update it here, we don't check for a partial block at the end of the segment
// Segments are only fully processed when all their blocks are processed.
if last_archived_block_partial {
*last_processed_segment_index = segment_index.saturating_sub(1);
} else {
*last_processed_segment_index = segment_index;
}
918288e
to
10a7a93
Compare
Finally found the segment index tracking bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.
@@ -365,19 +365,24 @@ where | |||
} | |||
} | |||
|
|||
debug!(target: LOG_TARGET, "Finished DSN sync"); | |||
while notifications.try_next().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it needed to reorder this? It was at the end to logically show that we discard all the extra notifications that were issues while we were processing the previous notification. I don't see how it fixes or breaks anything being moved here, but I also don't understand why it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not particularly committed to this part of the change.
The block gap was sometimes being hidden by substrate sync, which was one reason the bug was hard to diagnose and fix.
Reducing the amount of time that the atomic is set to "true" in each loop reduces the chance that substrate sync hides any bugs - if there are a lot of sync notifications being sent continuously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so the goal was to move it before atomic specifically. Wasn't obvious, but makes sense now, thanks.
I'm running it now, and I'll also ask Jim to run it with his unreliable hardware. |
Pull request was converted to draft
I also increased some retries where needed - some recent PRs had reduced the number of peers tried unintentionally. |
Those retries seem to be excessively large. There must be something really-really wrong somewhere for them to be needed that high. |
The edge case happened without any custom code (127/128 pieces from a segment, my network is unreliable) and was handled correctly. The new "downloading entire segment for one block" log was also triggered correctly, and the segment download was attempted (again, as it should be). I've kicked off a build and I've asked Jim to test this as well, but since I've seen the expected behaviour, I'm marking this as ready to review (but not merge yet). |
We're deliberately running it on some very unreliable or overloaded networks. I have no strong opinions on the retry changes, but we did effectively try that number of peers, before some recent fixes accidentally reduced the numbers. Please let me know if you need me to drop those commits (or keep them and open a follow up ticket) to get the other fixes merged. Randy's testing has show snap sync works, but his and Jim's testing also revealed a separate networking bug, see #3537. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good
@@ -416,7 +416,7 @@ where | |||
{ | |||
let block_number = *header.number(); | |||
|
|||
const STATE_SYNC_RETRIES: u32 = 5; | |||
const STATE_SYNC_RETRIES: u32 = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need such high state sync retries ?
So far 5
was sufficient.
Not opposing it but want to know the reason behind the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was sufficient for some nodes, but not others. Both Jim and I got failures here during testing, and the state sync is small, so it's cheaper to retry a few more times here, than give up and start the entire process again (or require manual operator intervention).
This is ready for review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like nothing major has changed since my last review
Track segment indexes correctly during DSN sync
Segment Index tracking bug
This PR fixes a rare "block has an unknown parent" error during DSN sync, which happens under the following circumstances:
We should track segment indexes correctly to fix this bug.
Substrate sync re-enabling race condition
There is also a race condition where standard substrate sync is quickly enabled then disabled in every DSN sync loop. This can sometimes fill the resulting block gap, but not reliably.
This is relatively harmless, but we shouldn't be re-enabling sync until we're actually waiting for a new notification.
Segment headers batch downloads
PR #3523 accidentally reduced the number of peers we're trying for segment header batches. Previously we were trying up to 20 peers we'd recently connected to, that PR reduced it to 10.
Since I found this edge case during testing, it seemed reasonable to restore the previous behaviour. This is consistent with other parts of snap and DSN sync, which try 20-40 peers before giving up.
Background
These errors currently happen more frequently on taurus because:
They also seem to happen more on unreliable networks where DSN sync is interrupted, or where we find a consensus of peers with a slightly earlier segment than the rest of the network.
Code contributor checklist: