-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Metrics show 2x more TPU votes than gossip #26819
Comments
@brennanwatt does any of your analysis explain this? |
So the idea is If I'm reading this correctly: Lines 665 to 666 in 8d69e8d
|
Divide by 2 is because gossip metrics are reported every 2 seconds. |
Seems like the leader's gossip banking thread is seeing about 3k-4k gossip votes per block, of which only 100 or so are committed. This aligns with the 8-10k spikes from Stephen's query above:
|
From this we can see the leader is attempting execution of almost 20k gossip votes per slot, but only 100 or so are making it. The rest are failing. This part requires some more investigation. But if this is the behavior, then the amplification makes sense because gossip will retry sending votes to banking threads that didn't make it into the block, so the amplification of the number of total votes seen by banking threads should be equal to number of blocks per second, which is roughly 2. |
Ah I wonder if this is because the 128 votes from the BankingStage unprocessed heap are being popped off at the same time, 127 of which will be from the same validator, and will conflict with AccountInUse |
This aligns with my observations recorded in #24887 . Here's some relevant data from that issue:
|
@AshwinSekar has a rework of how banking stage vote threads organize their unprocessed queues #26722. Could be useful for the retries here as well. |
@brennanwatt "because gossip will retry sending votes to banking", by this I meant the actual |
Gotchu, yep, makes perfect sense.
The amplification factor seems to line up as well. I'm seeing ~1.67x amplification, which would correlate with 600ms slot times |
Looks like this is explained. Thanks! |
Problem
The gossip votes seen in banking stage seem to be occuring at a rate almost 2x over the gossip votes, these are the relevant queries:
SELECT mean("receive_and_buffer_packets_count") AS "gossip_only_votes" FROM "mainnet-beta"."autogen"."banking_stage-loop-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND id =0
SELECT (mean("Vote-pull") + mean("Vote-push"))/2 AS total_votes_inserted_to_gossip FROM "mainnet-beta"."autogen"."cluster_info_crds_stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime:
Proposed Solution
Debug and explain why.
The text was updated successfully, but these errors were encountered: