Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions in-progress/13000-tx-availability/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
| | |
| -------------------- | --------------------------------------------------------------------------------------------- |
| Owners | @PhilWindle |
| Approvers | @Maddiah @alexghr @spalladino |
| Target Approval Date | 2025-05-23 |


## Executive Summary

This design attempts to solve the problem of ensuring that txs are made available to the network.


## Introduction

There are 3 primary actors on the Aztec Network.

1. Full Nodes, these participate in gossiping and usually provide RPC endpoints for clients.
2. Validators, these are full nodes that additionally propose and validate new blocks. Validators are organised into committees of 48. Every slot (currently 3 ETH slots) a committee member proposes a block and sends this proposal to the rest of the committee for attestation.
3. Provers, these are full nodes that also orchestrate the proving of epochs (batches of previously proposed blocks).

The network needs to ensure that txs are made available to validators and provers. From this we can derive a set of requirements:

1. Proposers are assumed to already have access to the transactions they are including in the block.
2. Validators need to have access to the transactions to re-execute and attest to the block. This needs to be achieved in a short period of time, validators have in the order of 10 seconds to attest to a block.
3. Provers also require transactions for re-executing and proving the epoch. This is less time critical as proving is performed over the course of 2 epochs.
4. Additionally, both validators and provers need to verify the proofs that accompany the transactions, these proofs dominate the size of the transaction payload.
5. From a network perspective, we are satisfied with 66% of validators successfully re-executing and attesting to a block. The network only requires 1 prover to submit a proof of an epoch.

When it comes to hardware and bandwidth requirements, it is desirable for people to be able to operate as validators with a single consumer grade machine and home broadband connectivity with potentially limited upload bandwidth. Provers can be expected to have significantly more resources, these are likely to be professional/institutional organisations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to specify the exact minimum supported up/down speeds.


## Current Architecture

The principal method by which transactions are distributed around the network is via the gossipsub protocol. Those transactions are then stored in a node's local database and will be subject to eviction based on their priority fee and the node's configuration around the maximum size of it's transactions pool. There are a number of reasons why at any given point in time a node may not have a transaction available, namely:

1. Being offline/disconnected at the time the transaction was gossiped
2. Having limited transaction storage causing transactions to be evicted
3. Imperfect gossiping
4. Client software using varying strategies for transaction eviction
5. Bugs in the implementation of the transaction pool

In addition to the gossipsub protocol, there are other mechanisms used to retrieve transactions:

1. Request/Response. Validators and provers will randomly select subsets of their total peers and directly request the transactions that they are missing. This is done repeatedly over a period of rounds until the transaction is found or an arbitrary time limit is reached. Nodes are careful when using this so as to not be too aggressive in their requests. Nodes employ peer-scoring techniques to rate limit peers that are requesting too much data. As a result, request-response is not a completely reliable mechanism for transaction retrieval.
2. Prover Coordination. Provers can be configured with any number of http urls that they can use to retrieve transactions. This enables provers to run a number of additional nodes on the network for the purpose of increasing their chances of receiving everything they require.

## Proposed Solution

Gossiping is already very effective at transaction propagation and it is assumed that with further reductions in proof size/increases in bandwidth availability this will continue to be the case. However, due to the reasons outlined above, we have to assume that there will be instances of nodes not having the transactions they require.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How "effective"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this doc: https://research.protocol.ai/publications/gossipsub-v1.1-evaluation-report/vyzovitis2020.pdf

The latency distribution shows that most nodes receive messages under the 150 ms threshold in gs-v1.1
with p99 at 165 ms, while in gs-v1.0 the threshold is 200 ms with p99 at 192 ms. This is because

The setup was 1k nodes, 100 nodes are active publishers, rest 900 hundred are passive

Copy link
Collaborator

@just-mitch just-mitch May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks like the messages in the test were 2KB. @PhilWindle How big are our proposals?


The ultimate solution would be to do as Ethereum does and bundle the transactions with the block proposal. This would ensure that everyone who requires the transactions would have them. However, this is not feasible. Transactions are approximately 60KB compressed. A block proposal containing transactions at a rate of 10 TPS will contain 360 transactions = ~22MB of data.

Our usage of Gossipsub works such that:

1. The original publisher of a message sends to all peers. This number is configurable but likely to be no less than 50.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'm not sure this is the case, but please correct me if I'm wrong; flooding is disabled by default.
I understand that publishing works similarly to forwarding. We'll send a message to direct and mash peers. However, I don't see the direct peers being set/used in the code, so TL;DR, we'll send only to mesh peers.
Source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this switched on by default. It's configuration on the node so can easily be switched off. Do you see any downside to having it on? It seems like quite a good feature to kickstart propagation around the network.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this switched on by default.

I completely missed this 🤦

Do you see any downside to having it on?

Yes. If we implement a solution where peer P' asks for the missing transactions from peer P (proposal sender), we could overburden it by ending up in a situation where a bunch (up to maxPeerCount, which is currently 100) are asking the proposer for the missing txs simultaneously.
Furthermore, some of these peers could be physically far away from the Proposer, so depending on the network, we could either waste bandwidth because they'll be receiving the proposal from other peers(which I don't think is a huge deal) or, worse, they could receive the proposal from the Proposer and ask it the transactions effectively loosing latency.

I feel relying on the mesh is a better choice, and I'd like to ask the validators to increase their mesh count (D) to 12 because it will give us better spread w/o losing much (at all?) latency.

Still, this is just my feeling/opinion/reasoning; I don't have any concrete numbers, experiments, or proofs. In practice, either option shouldn't make a huge difference.

2. All other peers propagate to a subset, around 5/6 peers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, nitpick: I would say around 8 peers. This is the target mash size, with the minimum being 4 and the maximum being 12.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you are right. I had the number 6 in my mind. Again, it's configurable with a default of 8.


Therefore, a proposer would need to send 50 * 22 MB = 1.1 GB of data and all other peers would need to send 5 * 22 MB = 110 MB. This would result in validators simply not having time to attest to block proposals.

Therefore, we propose a solution that adds additional layers/protocols and enhancements to the current gossipsub + request/response methods encapsulated with a new TxRetrieval module. TxRetrieval will continuously work to retrieve all transactions specified as part of block proposals until either the transaction has been retrieved or the transaction is not included in a mined block. This latter case is identifed by the publishing of a block that does contain the transaction and where the transaction has not been included in a prior mined block.

### Request/Response from proposal propagater

Upon receipt of a block proposal where the receiving peer P does not have all required transactions, P will start a request reponse process but will more aggresively target the peer propagating the block proposal. That is to say, that each round of peer sampling will always include the propagating peer. The block proposal will be propagated further immediately, regardless of whether P is missing transactions or not. This ensures that the proposal is not delayed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we just have the proposer send the proposal directly to the validators?

Then it is much easier to mitigate DoS when doing req/resp for transactions with the proposer- the proposer only replies to people it sent directly to.

Everyone still gossips thereafter, so you get eventual full propagation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we just have the proposer send the proposal directly to the validators?

What if the proposer is not connected via P2P to he validators? I don't think it would be easy to guarantee this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committees have an entire aztec epoch (30 minutes or so) to prepare for their active epoch. That would be plenty of time to troll through the network and form a decent mesh.

Though yes, we couldn't rely on it entirely: it would be best effort and some validators would still receive proposals on the slow path.

Regardless, if propagation time is truly ~200ms at p99 regardless, then this is moot.


To mitigate peers from being targetted by DOS attacks, we can attach the block proposal ID to any requests for transactions. The peers being asked can then verify that it is a known, recent proposal. Additionally, the peer can choose to only accept these requests from peers that it gossiped the block to in the first place and can limit the number of such requests that it is willing to service.

Targeting the propagating peer in this way ensures that the transaction 'should' propagate throughout the network albeit at a potentially slower rate than gossiping. Every level of propagation requires a round trip of request/response.

We don't wait for each level of propagtion to retrieve the transactions as doing so would delay the block proposal. It is possible (even likely) that a sufficient number of validators already have the transactions and can execute immediately.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c:
If we waited to propagate the block until the node had all the necessary transactions, we'd have much better guarantees regarding their availability.
Node N knows peer P has all the transactions because it got a block proposal from peer P, and once we request missing transactions, we are sure we are going to get them.
This is in contrast to the proposed solution, where once we request a transaction from the peer that sent us the proposal, we are highly likely to get the transaction.

I don't think this will have a substantial negative impact on the latency for the following reasons:

  1. We already expect that a sufficient number of nodes will have all transactions
  2. It takes approximately log(D, node_count) hops to propagate a block proposal through the whole network. Given D (the mesh size) of 8 and node_count of 10k, we are looking at 5 hops at most, which is not a lot.

Copy link

@mralj mralj May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we waited to propagate the block until the node had all the necessary transactions,

After internal discussion, we agreed it would be better not to wait on this, as it is hard to argue benefits of this.
So my comment above is not relevant anymore :)


### Request/Response directly from proposer

Performing request/response more aggresively from the propagator should further increase the likelihood of transactions being generally available to all peers. However, if this is found to still be insufficient then we propose a method of requesting directly from the proposer. This would only be an option for committee members.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should ask the propagator for the transactions. If we implement the requirement that the propagator must have all the transactions before broadcasting the proposal, I don't think the following is necessary:

However, if this is found to still be insufficient then we propose a method of requesting directly from the proposer. This would only be an option for committee members.

Validators would be required to post their Ethereum public key as part ...

Simlifying the protocol :)


Validators would be required to post their Ethereum public key as part of joining the validator set. The proposer can then optionally setup an auth'd endpoint through which it will serve requests for transactions in the block proposal. The details of this endpoint would be encrypted with each committee member's public key and included in the block proposal. The proposer has complete control over this endpoint enabling them to protect themselves against potential DOS attacks by malicious committee members.

### Make prover coordination urls universal

As a leftover from the previous method of prover coordination, provers have additional configuration allowing them to request transactions from other nodes via http. This can be made a universal option for all nodes and incorporated into the TxRetrieval workflow.

### Increasing prover transaction pool requirements

Provers are assumed to be organisations with access to considerable hardware resources and bandwidth capabilities. It is not unreasonable for them to:

1. Run multiple full nodes in addition to the prover node.
2. Configure every node with a very large transaction pool size, significantly reducing the transaction eviction rate.
3. Configure every node to have a high peer count, increasing the likelihood of request/response success.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could increase the mesh size to D = 12 to be more aggressive.
AFAIK the values above have no observable improvements

4. Configure every node to have a higher gossipsub degree, perhaps D = 12, increasing the likelihood of gossip success.

### Centralised transaction storage for provers

The committee is ultimately responsible for epochs being proven so it is in their interests for transactions to be made available to provers. This may encourage some/all validators to simply push transaction data to a centralised storage service for a period of time after each block is mined allowing provers access.