Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

votes transmitted over gossip fail to acquire account locks #29690

Closed
mschneider opened this issue Jan 13, 2023 · 8 comments
Closed

votes transmitted over gossip fail to acquire account locks #29690

mschneider opened this issue Jan 13, 2023 · 8 comments
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week. votes_in_gossip

Comments

@mschneider
Copy link
Contributor

mschneider commented Jan 13, 2023

Problem

During investigation of transaction confirmation issues in 09/22 I noticed that the bank threads assigned to gossip votes barely manages to commit any transaction to the bank. The below chart was generated from chronograph data for slot 150122268 - 150133863 on mainnet-beta by grouping banking_stage-leader_slot_packet_counts by the field id which represents the banking thread id.

Screen Shot 2023-01-13 at 10 21 58 AM

Proposed Solution

This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.

@mschneider mschneider added the community Community contribution label Jan 13, 2023
@mschneider
Copy link
Contributor Author

mschneider commented Jan 13, 2023

related issues i could find:
#28092
#26819
#24887

@mvines
Copy link
Contributor

mvines commented Jan 13, 2023

Practically there could be a validator command-line flag added to disable pushing votes to gossip for easy experimentation across clusters

@sakridge
Copy link
Contributor

cc @behzadnouri

@sakridge
Copy link
Contributor

I think ideally we would have something that monitors the vote state and starts pushing votes to gossip only after slots of delinquency. Initially this could be a manual option which we then experiment with on testnet.

@behzadnouri
Copy link
Contributor

We have experimented with some patches to reduce gossip votes: #22949
#16245 includes the observations and additional discussion where the constraints and trade-offs are.
I have some thoughts to improve that #22949 experiment. Also once VoteStateUpdate is rolled out across all clusters gossip can be made more efficient w.r.t votes.

@behzadnouri
Copy link
Contributor

This might be an indication for votes over gossip being an obsolete mechanism, that is not required anymore due to improvements to turbine. We could investigate, what happens if we stop sending votes over gossup.

When there is forking, votes won't land in the blocks on the other forks, and so will not get propagated through tvu/turbine path. In that case future leaders will rely on gossip in order to ingest those votes and include them in their blocks. If gossip is turned off then resolving these forks would become harder.

From @carllin discussing recent forks on testnet: #30669

  1. The validators on the eventual major fork on 184353488 saw that the eventual fork was 184353492 heavier at the time, and so they stopped voting on the fork descended from 184353488 while waiting to switch to 184353492
  2. For some reason the votes for 184353488 did not land, even given the blockhash expiration duration. The initial turbine blast for these votes for 184353488 to the next leaders for slots 184353491 didn't land because they were on the other fork. The means these votes relied on leaders further in the future to ingest these votes into the block, but this didn't happen. The reason for this is probably something wrong with future leader's ingestion of these votes from gossip.
  3. Validators for 184353488 eventually refreshed their vote and those votes landed in block 184353729 , making the fork descended from 184353488 the heaviest fork so validators on that fork stopped waiting to switch to the fork descended from 184353492 and started voting again, allowing the cluster to continue

@bw-solana
Copy link
Contributor

bw-solana commented Mar 17, 2023

+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for already_processed because generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.

@t-nelson
Copy link
Contributor

+1 to what @behzadnouri said. From my observations, the vast majority (90%+) of gossip vote transactions error out for already_processed because generally turbine votes land faster. However, in the forking case (where turbine votes for Fork A get sent to leader building on Fork B), we potentially need those gossip votes to reach consensus w/o waiting for turbine vote refresh.

we've tossed around the idea of deferring sending votes down the gossip path unless we don't see them landing promptly via turbine. sticking point ofc is defining "promptly"

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 20, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week. votes_in_gossip
Projects
None yet
Development

No branches or pull requests

7 participants