Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Metrics show 2x more TPU votes than gossip #26819

Closed
sakridge opened this issue Jul 27, 2022 · 16 comments
Closed

Metrics show 2x more TPU votes than gossip #26819

sakridge opened this issue Jul 27, 2022 · 16 comments

Comments

@sakridge
Copy link
Contributor

Problem

The gossip votes seen in banking stage seem to be occuring at a rate almost 2x over the gossip votes, these are the relevant queries:

SELECT mean("receive_and_buffer_packets_count") AS "gossip_only_votes" FROM "mainnet-beta"."autogen"."banking_stage-loop-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND id =0

SELECT (mean("Vote-pull") + mean("Vote-push"))/2 AS total_votes_inserted_to_gossip FROM "mainnet-beta"."autogen"."cluster_info_crds_stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime:

Proposed Solution

Debug and explain why.

@sakridge
Copy link
Contributor Author

@brennanwatt does any of your analysis explain this?

@sakridge
Copy link
Contributor Author

Screen Shot 2022-07-27 at 9 32 43 PM

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

So the idea is SELECT (mean("Vote-pull") + mean("Vote-push"))/2 AS total_votes_inserted_to_gossip FROM "mainnet-beta"."autogen"."cluster_info_crds_stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: is supposed to find all the votes that arrived through gossip to the local node? Why do we need to divide by 2?

If I'm reading this correctly:

solana/gossip/src/crds.rs

Lines 665 to 666 in 8d69e8d

GossipRoute::PushMessage => self.push.record_insert(entry),
GossipRoute::PullResponse => self.pull.record_insert(entry),

  1. Vote-push is incremented every time there's a push to us
  2. Vote-pull is incremented every time there's a pull response to our pull request

@behzadnouri
Copy link
Contributor

behzadnouri commented Jul 27, 2022

Why do we need to divide by 2?

Divide by 2 is because gossip metrics are reported every 2 seconds.
Other metrics are apparently reported every 1 second.

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

Screen Shot 2022-07-27 at 9 32 43 PM

Man this query is taking me forever to run, I'm assuming the blue is the number of votes ingested via the vote gossip thread?

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

image

Seems like the leader's gossip banking thread is seeing about 3k-4k gossip votes per block, of which only 100 or so are committed. This aligns with the 8-10k spikes from Stephen's query above:
SELECT mean("receive_and_buffer_packets_count") AS "gossip_only_votes" FROM "mainnet-beta"."autogen"."banking_stage-loop-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND id =0

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

image

From this we can see the leader is attempting execution of almost 20k gossip votes per slot, but only 100 or so are making it. The rest are failing. This part requires some more investigation.

But if this is the behavior, then the amplification makes sense because gossip will retry sending votes to banking threads that didn't make it into the block, so the amplification of the number of total votes seen by banking threads should be equal to number of blocks per second, which is roughly 2.

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

image

Seems like a lot of the failures are retryable errors
  1. When the votes attempted is over 40k, mostly cost model errors
  2. Otherwise seems to be AccountInUse? Is there some other retryable error other than AccountInUse and the CostModel?

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

Ah I wonder if this is because the 128 votes from the BankingStage unprocessed heap are being popped off at the same time, 127 of which will be from the same validator, and will conflict with AccountInUse

@bw-solana
Copy link
Contributor

image

From this we can see the leader is attempting execution of almost 20k gossip votes per slot, but only 100 or so are making it. The rest are failing. This part requires some more investigation.

But if this is the behavior, then the amplification makes sense because gossip will retry sending votes to banking threads that didn't make it into the block, so the amplification of the number of total votes seen by banking threads should be equal to number of blocks per second, which is roughly 2.

This aligns with my observations recorded in #24887 . Here's some relevant data from that issue:

Collected some more interesting data to better understand the interaction between turbine and gossip votes. Data was collected for reporting nodes between slots 142426995 and 142428995 (~830 reported slots):

Successfully committed turbine votes = 1,172,497 (89.4% of total successful votes)
Successfully committed gossip votes = 138,804 (10.6% of total successful votes)
Non-retryable failures for turbine votes = 305,401
Non-retryable failures for gossip votes = 252,654
Turbine vote failures can be further broken down as follows:

account_not_found = 618
blockhash_not_found = 235,44
blockhash_too_old = 285
instruction_error = 240,631 --> VoteTooOld
already_processed = 40,323
Gossip vote failures can be further broken down as follows:

account_not_found = 53,934
blockhash_not_found = 25,473
blockhash_too_old = 11,214
instruction_error = 72,183 --> VoteTooOld
already_processed = 89,850

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

@AshwinSekar has a rework of how banking stage vote threads organize their unprocessed queues #26722. Could be useful for the retries here as well.

@bw-solana
Copy link
Contributor

bw-solana commented Jul 27, 2022

image

From this we can see the leader is attempting execution of almost 20k gossip votes per slot, but only 100 or so are making it. The rest are failing. This part requires some more investigation.

But if this is the behavior, then the amplification makes sense because gossip will retry sending votes to banking threads that didn't make it into the block, so the amplification of the number of total votes seen by banking threads should be equal to number of blocks per second, which is roughly 2.

Not sure I follow this. The banking_stage-loop-stats.receive_and_buffer_packets_count metric shouldn't increment for transactions that are retried. This should only be incremented for new packets coming in from ClusterInfoVoteListener -> bank_send_loop

Is it because we reset previously_sent_to_bank_votes in check_for_leader_bank_and_send_votes on slot transition?

@bw-solana
Copy link
Contributor

Took a look at these stats across the entire cluster (as opposed to limited validator set), and it appears to align with the data cited above:
image
You can also see number of votes sent from cluster_info_vote_listener lines up with what we're seeing received by the banking threads

@carllin
Copy link
Contributor

carllin commented Jul 27, 2022

Not sure I follow this. The banking_stage-loop-stats.receive_and_buffer_packets_count metric shouldn't increment for transactions that are retried. This should only be incremented for new packets coming in from ClusterInfoVoteListener -> bank_send_loop

@brennanwatt "because gossip will retry sending votes to banking", by this I meant the actual cluster_info_vote_listener thread will resend the votes to the banking thread

@bw-solana
Copy link
Contributor

bw-solana commented Jul 27, 2022

Not sure I follow this. The banking_stage-loop-stats.receive_and_buffer_packets_count metric shouldn't increment for transactions that are retried. This should only be incremented for new packets coming in from ClusterInfoVoteListener -> bank_send_loop

@brennanwatt "because gossip will retry sending votes to banking", by this I meant the actual cluster_info_vote_listener thread will resend the votes to the banking thread

Gotchu, yep, makes perfect sense.

Gossip --> ClusterInfoVoteListener --> Banking Stage
        ^                           ^
  Once per vote         Once per slot --> same vote can get re-sent to banking stage for each slot

The amplification factor seems to line up as well. I'm seeing ~1.67x amplification, which would correlate with 600ms slot times

@sakridge
Copy link
Contributor Author

sakridge commented Oct 5, 2022

Looks like this is explained. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants