-
Notifications
You must be signed in to change notification settings - Fork 5k
Forks on Testnet Causing Roots to Stall #30669
Description
Problem
We are occasionally seeing long lived (~2 minutes) forks on testnet that cause root creation to stall.
One such example started on 3/7 at 23:06:56 UTC
:
Last common ancestor is 184353487
. Looks like 184353488
was late getting to some nodes, likely due to requiring a couple dozens of repaired shreds on average.
This caused 184353492
to be built off 184353487
and start what ended up being the minority fork (we'll call this Fork B and the fork containing 184353488
to be fork A).
We see 1673 nodes vote for 184353488
around 23:06:56 UTC
.
We see 581 nodes vote for 184353492
around the same time.
But it seems that the votes for 184353488
were all sent to a leader building off Fork B, which caused the turbine (id=1
)vote transactions to be rejected with blockhash_not_found
.
This caused Fork B to look like the majority fork early on. Nodes that voted on 184353492
(e.g. 7mtKMUgM24GPTiR2krRimUiQgXRRmMPmmPkQBzMZak8a
) kept voting. Nodes that voted on 184353488
(e.g. 5D1fNXzvv5NjV1ysLjirC4WY92RNsVH18vjmcszZd8on
) stopped voting and presumably were waiting to switch over to the other fork (requires 38% votes observed to switch).
It seems Fork B never reached the switching threshold. Eventually (around 23:08:29 UTC
), The validators that originally voted on fork A refreshed their votes for slots 184353488
,184353489
,184353490
, and this tipped the scales to allow Fork B voters to switch, consensus to be achieved, and new roots to be confirmed.
Refresh occurred at this point because the blockhash for original votes expired. I.e. MAX_PROCESSING_AGE
slots had been created on fork A.
The big question is why didn't some leader on fork A ingest the original votes via gossip? Why did it take refreshing these votes?
Proposed Solution
Debug and Fix so that we can come to consensus sooner