Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust simulation ends prematurely with Error: Missing input block #255

Open
bwbush opened this issue Mar 14, 2025 · 15 comments
Open

Rust simulation ends prematurely with Error: Missing input block #255

bwbush opened this issue Mar 14, 2025 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@bwbush
Copy link
Collaborator

bwbush commented Mar 14, 2025

I have two Rust simulations that end prematurelly: for example one ends at Slot 422 with the following message:

 INFO sim_cli::events: Slot 421 has begun.
 INFO sim_cli::events: Pool node-41 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-93 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-22 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-89 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Slot 422 has begun.
Error: Missing input block 180-57-0

Steps to reproduce

sim-cli --parameters config.json network.json --slots 600 sim.log
@bwbush bwbush added the bug Something isn't working label Mar 14, 2025
@SupernaviX
Copy link
Contributor

Looking into it, this can happen when there's enough traffic in the system that

  • an IB has reached a certified EB
  • the body of that IB hasn't reached every node
  • a node which doesn't have that body, but does have that certified EB, is allowed to mint an RB

I think the issue will go away if you switch ib-diffusion-strategy to oldest-first or peer-order, since then nodes will request older IBs from their backlog . And you might not be able to reproduce it next week anyway, because the new eb-max-age-slots setting makes us less likely to consider the older EBs which would have this problem. But I think it's a legitimate problem with the system being simulated.

What should the protocol do when a node has an endorsed certificate, but not all of the IBs it references? Should it just not attach a cert to that RB?

@Quantumplation
Copy link
Contributor

Quantumplation commented Mar 15, 2025

Probably an EB shouldn't be considered endorsed until the node has seen all of the IBs it references?

@SupernaviX
Copy link
Contributor

oh that makes sense, threw that in.

@Saizan
Copy link
Contributor

Saizan commented Mar 17, 2025

Possession of the IB body should only be relevant for Voting for the EB that references it, and for reconstructing the ledger state of an RB that references the EB?

I think you should be able to "trust" certificates on their own while generating RBs, the same as when validating them.

@will-break-it
Copy link
Contributor

I share @Saizan understanding of the protocol. Isn't the certificate proof that the majority of stake holders have seen & validated these IBs?

@pagio
Copy link
Contributor

pagio commented Mar 17, 2025

Isn't the certificate proof that the majority of stake holders have seen & validated these IBs?

A certificate ensures that at least on honest node has seen & validated the reference IBs.

Possession of the IB body should only be relevant for Voting for the EB that references it, and for reconstructing the ledger state of an RB that references the EB?

Here things start to get tricky. E.g., in one the ledger design proposals (reward accounts), the RB producer should include txs that satisfy some property w.r.t. the IBs referenced by the EB in the RB.

Assuming our network assumptions hold, Short Leios should guarantee that all IBs inside certified EBs are delivered by the end of the respective deliver 2 phase. Given that this may not be the case currently, as we are still exploring the networking part of the protocol, I would suggest not including EBs in RBs whose IB is missing. Instead include an older EB, or no EB if there is no valid EB satisfying the criteria outlined.

@SupernaviX
Copy link
Contributor

SupernaviX commented Mar 17, 2025

I think that the freshest first strategy for IBs, combined with the oldest-first strategy when choosing an endorsed EB, is what leads to this. FF means that with a constant stream of newer IBs, the node will never prioritize downloading older ones. Oldest-EB-first means that these older IBs are needed more urgently than newer IBs.

Assuming our network assumptions hold, Short Leios should guarantee that all IBs inside certified EBs are delivered by the end of the respective deliver 2 phase.

This assumption will be broken if IB generation is fast enough to overwhelm the network. And the speed/resilience of the network isn't under our control IRL, so it makes sense to define the behavior if it does.

The sims are using 1MB/sec bandwidth on each link, and generating ~10 IBs per second where each IB body is 150Kb. I think it's correct for a node's set of IBs to "fall behind" when it's fetching them from a single peer, and we generate more bytes of IB than fit in the link with that peer.

@Saizan
Copy link
Contributor

Saizan commented Mar 17, 2025

I think it's correct for a node's set of IBs to "fall behind" when it's fetching them from a single peer, and we generate more bytes of IB than fit in the link with that peer.

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

I don't disagree IB diffusion can fall behind in general.

@SupernaviX
Copy link
Contributor

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

The rust sim interprets it as requesting the body from the first peer which announces it (and RequestFromAll as requesting from each peer which announces it). I could make that more sophisticated, request from the first peer which announces it and has capacity to send it. Not sure if that's what haskell is doing

@Quantumplation
Copy link
Contributor

I think that the freshest first strategy for IBs, combined with the oldest-first strategy when choosing an endorsed EB, is what leads to this. FF means that with a constant stream of newer IBs, the node will never prioritize downloading older ones. Oldest-EB-first means that these older IBs are needed more urgently than newer IBs.

Perhaps we need a "endorsed-EB-then-freshest-first" strategy for IBs; i.e. first download IBs for EBs that have reached the vote threshold (in anticipation of including them in a RB), and then fetch freshest first.

@pagio
Copy link
Contributor

pagio commented Mar 17, 2025

Perhaps we need a "endorsed-EB-then-freshest-first" strategy for IBs; i.e. first download IBs for EBs that have reached the vote threshold (in anticipation of including them in a RB), and then fetch freshest first.

Changing the IB download strategy may introduce security issues and should not be done lightheartedly. The idea is that the IB rate is such so that with the freshest first policy IBs are delivered within some known window (most of the time) if released on time. If that is not the case, we should rethink the IB rate.

Of course, since sometimes for probabilistic reasons this may not happen, we should consider how the node should handle this situation.

@Saizan
Copy link
Contributor

Saizan commented Mar 17, 2025

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

The rust sim interprets it as requesting the body from the first peer which announces it (and RequestFromAll as requesting from each peer which announces it). I could make that more sophisticated, request from the first peer which announces it and has capacity to send it. Not sure if that's what haskell is doing

Haskell node is not evaluating the state of the sender, but since a node has different threads consuming IBs (or votes or EBs) from different peers those threads just race to be the first to request a particular body (there isn't a global view of the received announcements either, each thread only knows about their peer). As a thread requests a body it signals to the others not to.

Out of a 200s sim I could collect download counts like so

[15,38,49,50,57,74,76,85,87,111,122,125,131]
[6,29,56,62,67,118,119,231,335]
[8,56,74,81,97,135,162,195,208]
[104,119,165,287,347]
[54,56,75,104,120,127,214,268]
[61,96,114,129,140,196,285]
[10,42,51,52,59,60,61,64,80,83,102,106,121,133]
[25,30,74,86,108,132,169,181,215]
[56,82,126,175,182,195,205]
[16,90,133,162,256,358]
...

each row is a different node, each element in the array is how many bodies the node got from the peer with that index.

@SupernaviX
Copy link
Contributor

On closer inspection, that matches what the rust sim is doing. Each node tracks which IBs a given peer has announced, requests one IB at a time from each peer, and doesn't request the same IB from two peers at once.

@bwbush
Copy link
Collaborator Author

bwbush commented Mar 19, 2025

I haven't encountered this in tag leios-2025w12. @SupernaviX, please close this issue if you think the discussion is complete. Thanks!

@SupernaviX
Copy link
Contributor

I've fixed the error, but still want to understand more why IBs aren't propagating quickly enough. I suspect the root cause is going to be similar to another open issue, and I'll close it when I can narrow it down to one of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants