Track segment indexes correctly during DSN sync #3521

teor2345 · 2025-05-08T01:14:24Z

Segment Index tracking bug

This PR fixes a rare "block has an unknown parent" error during DSN sync, which happens under the following circumstances:

we get some segment headers the first time through the DSN sync loop, and reconstruct them
we return to the caller and implicitly drop the partial block in the reconstructor state, but record that segment index as fully processed
we get more segment headers a later time through the DSN sync loop, but we don't reconstruct the partial block because we've dropped part of it, and recorded its segment as fully processed. So there is a block gap, which causes block imports to hang

We should track segment indexes correctly to fix this bug.

Substrate sync re-enabling race condition

There is also a race condition where standard substrate sync is quickly enabled then disabled in every DSN sync loop. This can sometimes fill the resulting block gap, but not reliably.

This is relatively harmless, but we shouldn't be re-enabling sync until we're actually waiting for a new notification.

Segment headers batch downloads

PR #3523 accidentally reduced the number of peers we're trying for segment header batches. Previously we were trying up to 20 peers we'd recently connected to, that PR reduced it to 10.

Since I found this edge case during testing, it seemed reasonable to restore the previous behaviour. This is consistent with other parts of snap and DSN sync, which try 20-40 peers before giving up.

Background

These errors currently happen more frequently on taurus because:

there are a large volume of XDMs and other transactions happening, so new segments are produced frequently
DSN sync will sometimes end when the node is within the confirmation depth of the chain tip, and fill the block gap. But the large volume of transactions makes that distance from the tip harder to reach the first time around (because the number of (large) blocks is increasing while DSN sync is happening)

They also seem to happen more on unreliable networks where DSN sync is interrupted, or where we find a consensus of peers with a slightly earlier segment than the rest of the network.

Code contributor checklist:

I have read, understood and followed contributing guide

Requested changes were made

nazar-pc

From reading the code I think this is broken in ways more than one, but I didn't try to run it.

we get more segment headers a later time through the DSN sync loop, but we can't reconstruct the partial block because we've dropped part of it, so there is a gap which causes block imports to hang

The explanation here doesn't look correct to me. It may fail, sure, but the reason is that tracking of blocks or segments is incorrect, otherwise it'd be able to restart, even if it means re-downloading previously downloaded segment again.

This error currently happens on taurus because:

Maybe a nit, but I'd not use "because" here. It may trigger the problem, but the root cause is elsewhere. Neither XDM volume not segment creation rate are the cause of this.

DSN sync only ends when the node is within the confirmation depth of the chain tip

This is incorrect. It is one of the ways for DSN sync to end, but not the only one. In fact since blocks are only archived at 100 blocks depth you'll probably never ever reach that branch during sync.

It is only there for a situation when you shut down the node and start it back (let's say with a new version of the software) shortly afterwards. If there was less than 100 blocks produced while node was offline, there is no reason to look for latest segment headers and do the lengthy DSN sync that we know will not yield any results. Same with node being offline for 10 minutes, DSN sync will kick in, but will be short-circuited if turns out there wasn't 100 blocks produced yet.

Overall I think there is still some misunderstanding how things work and why it was designed that way and the fact that some interesting scenarios are under-documented doesn't help, sorry about that.

crates/subspace-service/src/sync_from_dsn.rs

teor2345 · 2025-05-14T01:01:26Z

crates/subspace-service/src/sync_from_dsn/import_blocks.rs

@@ -238,8 +256,6 @@ where
            // Import queue handles verification and importing it into the client
            import_queue_service.import_blocks(BlockOrigin::NetworkInitialSync, blocks_to_import);
        }
-
-        *last_processed_segment_index = segment_index;


I think the problem is in this line in the main branch:

when we update the segment index above, we check we've fully processed all blocks from that segment

but when we update it here, we don't check for a partial block at the end of the segment

// Segments are only fully processed when all their blocks are processed. if last_archived_block_partial { *last_processed_segment_index = segment_index.saturating_sub(1); } else { *last_processed_segment_index = segment_index; }

Finally found the segment index tracking bug

nazar-pc

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

nazar-pc · 2025-05-14T02:15:41Z

crates/subspace-service/src/sync_from_dsn.rs

@@ -365,19 +365,24 @@ where
            }
        }

-        debug!(target: LOG_TARGET, "Finished DSN sync");
+        while notifications.try_next().is_ok() {


Why was it needed to reorder this? It was at the end to logically show that we discard all the extra notifications that were issues while we were processing the previous notification. I don't see how it fixes or breaks anything being moved here, but I also don't understand why it is needed.

I'm not particularly committed to this part of the change.

The block gap was sometimes being hidden by substrate sync, which was one reason the bug was hard to diagnose and fix.

Reducing the amount of time that the atomic is set to "true" in each loop reduces the chance that substrate sync hides any bugs - if there are a lot of sync notifications being sent continuously.

I see, so the goal was to move it before atomic specifically. Wasn't obvious, but makes sense now, thanks.

teor2345 · 2025-05-14T02:19:15Z

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

I'm running it now, and I'll also ask Jim to run it with his unreliable hardware.

teor2345 · 2025-05-14T05:15:42Z

I also increased some retries where needed - some recent PRs had reduced the number of peers tried unintentionally.

nazar-pc · 2025-05-14T05:17:02Z

Those retries seem to be excessively large. There must be something really-really wrong somewhere for them to be needed that high.

teor2345 · 2025-05-14T05:19:11Z

Code reordering doesn't make much sense to me, but the rest looks good. Please try to run DSN sync and ideally trigger the edge-cases to see if it works as expected.

I'm running it now, and I'll also ask Jim to run it with his unreliable hardware.

The edge case happened without any custom code (127/128 pieces from a segment, my network is unreliable) and was handled correctly. The new "downloading entire segment for one block" log was also triggered correctly, and the segment download was attempted (again, as it should be).

I've kicked off a build and I've asked Jim to test this as well, but since I've seen the expected behaviour, I'm marking this as ready to review (but not merge yet).

teor2345 · 2025-05-14T22:31:59Z

Those retries seem to be excessively large. There must be something really-really wrong somewhere for them to be needed that high.

We're deliberately running it on some very unreliable or overloaded networks. I have no strong opinions on the retry changes, but we did effectively try that number of peers, before some recent fixes accidentally reduced the numbers.

Please let me know if you need me to drop those commits (or keep them and open a follow up ticket) to get the other fixes merged.

Randy's testing has show snap sync works, but his and Jim's testing also revealed a separate networking bug, see #3537.

vedhavyas

Overall looks good

crates/subspace-service/src/sync_from_dsn.rs

vedhavyas · 2025-05-19T06:22:23Z

domains/client/domain-operator/src/snap_sync.rs

@@ -416,7 +416,7 @@ where
 {
    let block_number = *header.number();

-    const STATE_SYNC_RETRIES: u32 = 5;
+    const STATE_SYNC_RETRIES: u32 = 20;


why do we need such high state sync retries ?
So far 5 was sufficient.

Not opposing it but want to know the reason behind the change

It was sufficient for some nodes, but not others. Both Jim and I got failures here during testing, and the state sync is small, so it's cheaper to retry a few more times here, than give up and start the entire process again (or require manual operator intervention).

teor2345 · 2025-05-27T22:34:42Z

This is ready for review and merge.

nazar-pc

Looks like nothing major has changed since my last review

Track segment indexes correctly during DSN sync

teor2345 self-assigned this May 8, 2025

teor2345 requested a review from nazar-pc as a code owner May 8, 2025 01:14

teor2345 added bug Something isn't working node Node (service library/node app) labels May 8, 2025