Skip to content

Track segment indexes correctly during DSN sync #3521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into from

Conversation

teor2345
Copy link
Member

@teor2345 teor2345 commented May 8, 2025

Segment Index tracking bug

This PR fixes a rare "block has an unknown parent" error during DSN sync, which happens under the following circumstances:

  • we get some segment headers the first time through the DSN sync loop, and reconstruct them
  • we return to the caller and implicitly drop the partial block in the reconstructor state, but record that segment index as fully processed
  • we get more segment headers a later time through the DSN sync loop, but we don't reconstruct the partial block because we've dropped part of it, and recorded its segment as fully processed. So there is a block gap, which causes block imports to hang

We should track segment indexes correctly to fix this bug.

Substrate sync re-enabling race condition

There is also a race condition where standard substrate sync is quickly enabled then disabled in every DSN sync loop. This can sometimes fill the resulting block gap, but not reliably.

This is relatively harmless, but we shouldn't be re-enabling sync until we're actually waiting for a new notification.

Segment headers batch downloads

PR #3523 accidentally reduced the number of peers we're trying for segment header batches. Previously we were trying up to 20 peers we'd recently connected to, that PR reduced it to 10.

Since I found this edge case during testing, it seemed reasonable to restore the previous behaviour. This is consistent with other parts of snap and DSN sync, which try 20-40 peers before giving up.

Background

These errors currently happen more frequently on taurus because:

  • there are a large volume of XDMs and other transactions happening, so new segments are produced frequently
  • DSN sync will sometimes end when the node is within the confirmation depth of the chain tip, and fill the block gap. But the large volume of transactions makes that distance from the tip harder to reach the first time around (because the number of (large) blocks is increasing while DSN sync is happening)

They also seem to happen more on unreliable networks where DSN sync is interrupted, or where we find a consensus of peers with a slightly earlier segment than the rest of the network.

Code contributor checklist:

@teor2345 teor2345 self-assigned this May 8, 2025
@teor2345 teor2345 requested a review from nazar-pc as a code owner May 8, 2025 01:14
@teor2345 teor2345 added bug Something isn't working node Node (service library/node app) labels May 8, 2025
@vedhavyas

This comment was marked as resolved.

@teor2345

This comment was marked as resolved.

@teor2345 teor2345 changed the title Keep reconstructor state until snap sync has finished Keep reconstructor state until DSN sync has finished May 8, 2025
@teor2345

This comment was marked as resolved.

nazar-pc

This comment was marked as resolved.

@teor2345 teor2345 force-pushed the keep-reconstructor-state branch from d3dc389 to 3ee66af Compare May 12, 2025 06:34
@teor2345 teor2345 changed the title Keep reconstructor state until DSN sync has finished Only try DSN sync once during snap sync May 12, 2025
@teor2345 teor2345 requested review from nazar-pc and vedhavyas May 12, 2025 06:36
vedhavyas

This comment was marked as resolved.

@teor2345 teor2345 dismissed nazar-pc’s stale review May 12, 2025 22:25

Requested changes were made

@teor2345 teor2345 force-pushed the keep-reconstructor-state branch from 3ee66af to ccb28cb Compare May 12, 2025 23:17
@teor2345

This comment was marked as resolved.

@teor2345 teor2345 requested a review from vedhavyas May 12, 2025 23:18
vedhavyas
vedhavyas previously approved these changes May 13, 2025
@teor2345

This comment was marked as resolved.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the code I think this is broken in ways more than one, but I didn't try to run it.

we get more segment headers a later time through the DSN sync loop, but we can't reconstruct the partial block because we've dropped part of it, so there is a gap which causes block imports to hang

The explanation here doesn't look correct to me. It may fail, sure, but the reason is that tracking of blocks or segments is incorrect, otherwise it'd be able to restart, even if it means re-downloading previously downloaded segment again.

This error currently happens on taurus because:

Maybe a nit, but I'd not use "because" here. It may trigger the problem, but the root cause is elsewhere. Neither XDM volume not segment creation rate are the cause of this.

DSN sync only ends when the node is within the confirmation depth of the chain tip

This is incorrect. It is one of the ways for DSN sync to end, but not the only one. In fact since blocks are only archived at 100 blocks depth you'll probably never ever reach that branch during sync.

It is only there for a situation when you shut down the node and start it back (let's say with a new version of the software) shortly afterwards. If there was less than 100 blocks produced while node was offline, there is no reason to look for latest segment headers and do the lengthy DSN sync that we know will not yield any results. Same with node being offline for 10 minutes, DSN sync will kick in, but will be short-circuited if turns out there wasn't 100 blocks produced yet.


Overall I think there is still some misunderstanding how things work and why it was designed that way and the fact that some interesting scenarios are under-documented doesn't help, sorry about that.

@@ -238,8 +256,6 @@ where
// Import queue handles verification and importing it into the client
import_queue_service.import_blocks(BlockOrigin::NetworkInitialSync, blocks_to_import);
}

*last_processed_segment_index = segment_index;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is in this line in the main branch:

  • when we update the segment index above, we check we've fully processed all blocks from that segment
  • but when we update it here, we don't check for a partial block at the end of the segment
        // Segments are only fully processed when all their blocks are processed.
        if last_archived_block_partial {
            *last_processed_segment_index = segment_index.saturating_sub(1);
        } else {
           *last_processed_segment_index = segment_index;
        } 

@teor2345 teor2345 changed the title Only try DSN sync once during snap sync Track segment indexes correctly during DSN sync May 14, 2025
@teor2345 teor2345 force-pushed the keep-reconstructor-state branch from 918288e to 10a7a93 Compare May 14, 2025 02:06
@teor2345 teor2345 dismissed nazar-pc’s stale review May 14, 2025 02:09

Finally found the segment index tracking bug

@teor2345 teor2345 requested a review from nazar-pc May 14, 2025 02:09
nazar-pc
nazar-pc previously approved these changes May 14, 2025
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

@@ -365,19 +365,24 @@ where
}
}

debug!(target: LOG_TARGET, "Finished DSN sync");
while notifications.try_next().is_ok() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it needed to reorder this? It was at the end to logically show that we discard all the extra notifications that were issues while we were processing the previous notification. I don't see how it fixes or breaks anything being moved here, but I also don't understand why it is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly committed to this part of the change.

The block gap was sometimes being hidden by substrate sync, which was one reason the bug was hard to diagnose and fix.

Reducing the amount of time that the atomic is set to "true" in each loop reduces the chance that substrate sync hides any bugs - if there are a lot of sync notifications being sent continuously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so the goal was to move it before atomic specifically. Wasn't obvious, but makes sense now, thanks.

@teor2345
Copy link
Member Author

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

I'm running it now, and I'll also ask Jim to run it with his unreliable hardware.

@teor2345 teor2345 marked this pull request as draft May 14, 2025 02:19
auto-merge was automatically disabled May 14, 2025 02:19

Pull request was converted to draft

@teor2345
Copy link
Member Author

I also increased some retries where needed - some recent PRs had reduced the number of peers tried unintentionally.

@nazar-pc
Copy link
Member

Those retries seem to be excessively large. There must be something really-really wrong somewhere for them to be needed that high.

@teor2345
Copy link
Member Author

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

I'm running it now, and I'll also ask Jim to run it with his unreliable hardware.

The edge case happened without any custom code (127/128 pieces from a segment, my network is unreliable) and was handled correctly. The new "downloading entire segment for one block" log was also triggered correctly, and the segment download was attempted (again, as it should be).

I've kicked off a build and I've asked Jim to test this as well, but since I've seen the expected behaviour, I'm marking this as ready to review (but not merge yet).

@teor2345 teor2345 marked this pull request as ready for review May 14, 2025 05:19
@teor2345 teor2345 requested a review from NingLin-P as a code owner May 14, 2025 05:19
@teor2345 teor2345 requested a review from nazar-pc May 14, 2025 21:53
@teor2345
Copy link
Member Author

Those retries seem to be excessively large. There must be something really-really wrong somewhere for them to be needed that high.

We're deliberately running it on some very unreliable or overloaded networks. I have no strong opinions on the retry changes, but we did effectively try that number of peers, before some recent fixes accidentally reduced the numbers.

Please let me know if you need me to drop those commits (or keep them and open a follow up ticket) to get the other fixes merged.

Randy's testing has show snap sync works, but his and Jim's testing also revealed a separate networking bug, see #3537.

@teor2345 teor2345 enabled auto-merge May 14, 2025 22:32
Copy link
Contributor

@vedhavyas vedhavyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

@@ -416,7 +416,7 @@ where
{
let block_number = *header.number();

const STATE_SYNC_RETRIES: u32 = 5;
const STATE_SYNC_RETRIES: u32 = 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need such high state sync retries ?
So far 5 was sufficient.

Not opposing it but want to know the reason behind the change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was sufficient for some nodes, but not others. Both Jim and I got failures here during testing, and the state sync is small, so it's cheaper to retry a few more times here, than give up and start the entire process again (or require manual operator intervention).

@teor2345 teor2345 requested a review from vedhavyas May 27, 2025 22:01
@teor2345
Copy link
Member Author

This is ready for review and merge.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like nothing major has changed since my last review

@teor2345 teor2345 added this pull request to the merge queue May 28, 2025
Any commits made after this event will not be merged.
github-merge-queue bot pushed a commit that referenced this pull request May 28, 2025
Track segment indexes correctly during DSN sync
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 28, 2025
@vedhavyas vedhavyas added the audit-P1 High audit priority label May 28, 2025
@teor2345 teor2345 added this pull request to the merge queue May 28, 2025
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit-P1 High audit priority bug Something isn't working node Node (service library/node app)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants