Rust simulation ends prematurely with `Error: Missing input block` #255

bwbush · 2025-03-14T17:06:32Z

I have two Rust simulations that end prematurelly: for example one ends at Slot 422 with the following message:

 INFO sim_cli::events: Slot 421 has begun.
 INFO sim_cli::events: Pool node-41 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-93 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-22 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Pool node-89 generated an IB with 1 transaction(s) in slot 421 (164.14 kB).
 INFO sim_cli::events: Slot 422 has begun.
Error: Missing input block 180-57-0

Steps to reproduce

Tag: leios-2025w11
Configuration: config.json
Network: network.json
Command line:

sim-cli --parameters config.json network.json --slots 600 sim.log

SupernaviX · 2025-03-15T03:30:56Z

Looking into it, this can happen when there's enough traffic in the system that

an IB has reached a certified EB
the body of that IB hasn't reached every node
a node which doesn't have that body, but does have that certified EB, is allowed to mint an RB

I think the issue will go away if you switch ib-diffusion-strategy to oldest-first or peer-order, since then nodes will request older IBs from their backlog . And you might not be able to reproduce it next week anyway, because the new eb-max-age-slots setting makes us less likely to consider the older EBs which would have this problem. But I think it's a legitimate problem with the system being simulated.

What should the protocol do when a node has an endorsed certificate, but not all of the IBs it references? Should it just not attach a cert to that RB?

Quantumplation · 2025-03-15T03:55:00Z

Probably an EB shouldn't be considered endorsed until the node has seen all of the IBs it references?

SupernaviX · 2025-03-16T14:15:37Z

oh that makes sense, threw that in.

Saizan · 2025-03-17T07:26:15Z

Possession of the IB body should only be relevant for Voting for the EB that references it, and for reconstructing the ledger state of an RB that references the EB?

I think you should be able to "trust" certificates on their own while generating RBs, the same as when validating them.

will-break-it · 2025-03-17T08:57:00Z

I share @Saizan understanding of the protocol. Isn't the certificate proof that the majority of stake holders have seen & validated these IBs?

pagio · 2025-03-17T13:05:47Z

Isn't the certificate proof that the majority of stake holders have seen & validated these IBs?

A certificate ensures that at least on honest node has seen & validated the reference IBs.

Possession of the IB body should only be relevant for Voting for the EB that references it, and for reconstructing the ledger state of an RB that references the EB?

Here things start to get tricky. E.g., in one the ledger design proposals (reward accounts), the RB producer should include txs that satisfy some property w.r.t. the IBs referenced by the EB in the RB.

Assuming our network assumptions hold, Short Leios should guarantee that all IBs inside certified EBs are delivered by the end of the respective deliver 2 phase. Given that this may not be the case currently, as we are still exploring the networking part of the protocol, I would suggest not including EBs in RBs whose IB is missing. Instead include an older EB, or no EB if there is no valid EB satisfying the criteria outlined.

SupernaviX · 2025-03-17T13:30:51Z

I think that the freshest first strategy for IBs, combined with the oldest-first strategy when choosing an endorsed EB, is what leads to this. FF means that with a constant stream of newer IBs, the node will never prioritize downloading older ones. Oldest-EB-first means that these older IBs are needed more urgently than newer IBs.

Assuming our network assumptions hold, Short Leios should guarantee that all IBs inside certified EBs are delivered by the end of the respective deliver 2 phase.

This assumption will be broken if IB generation is fast enough to overwhelm the network. And the speed/resilience of the network isn't under our control IRL, so it makes sense to define the behavior if it does.

The sims are using 1MB/sec bandwidth on each link, and generating ~10 IBs per second where each IB body is 150Kb. I think it's correct for a node's set of IBs to "fall behind" when it's fetching them from a single peer, and we generate more bytes of IB than fit in the link with that peer.

Saizan · 2025-03-17T13:54:33Z

I think it's correct for a node's set of IBs to "fall behind" when it's fetching them from a single peer, and we generate more bytes of IB than fit in the link with that peer.

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

I don't disagree IB diffusion can fall behind in general.

SupernaviX · 2025-03-17T14:08:12Z

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

The rust sim interprets it as requesting the body from the first peer which announces it (and RequestFromAll as requesting from each peer which announces it). I could make that more sophisticated, request from the first peer which announces it and has capacity to send it. Not sure if that's what haskell is doing

Quantumplation · 2025-03-17T14:53:26Z

I think that the freshest first strategy for IBs, combined with the oldest-first strategy when choosing an endorsed EB, is what leads to this. FF means that with a constant stream of newer IBs, the node will never prioritize downloading older ones. Oldest-EB-first means that these older IBs are needed more urgently than newer IBs.

Perhaps we need a "endorsed-EB-then-freshest-first" strategy for IBs; i.e. first download IBs for EBs that have reached the vote threshold (in anticipation of including them in a RB), and then fetch freshest first.

pagio · 2025-03-17T15:01:56Z

Perhaps we need a "endorsed-EB-then-freshest-first" strategy for IBs; i.e. first download IBs for EBs that have reached the vote threshold (in anticipation of including them in a RB), and then fetch freshest first.

Changing the IB download strategy may introduce security issues and should not be done lightheartedly. The idea is that the IB rate is such so that with the freshest first policy IBs are delivered within some known window (most of the time) if released on time. If that is not the case, we should rethink the IB rate.

Of course, since sometimes for probabilistic reasons this may not happen, we should consider how the node should handle this situation.

Saizan · 2025-03-17T15:53:32Z

Should it not be fetching them from more than one peer? Haskell sim interprets RequestFromFirst as requesting the body at the first opportunity and not request the same body from other peers. It doesn't force every body to be requested from the same upstream peer, and in practice I would expect some spread, though I haven't measured this.

The rust sim interprets it as requesting the body from the first peer which announces it (and RequestFromAll as requesting from each peer which announces it). I could make that more sophisticated, request from the first peer which announces it and has capacity to send it. Not sure if that's what haskell is doing

Haskell node is not evaluating the state of the sender, but since a node has different threads consuming IBs (or votes or EBs) from different peers those threads just race to be the first to request a particular body (there isn't a global view of the received announcements either, each thread only knows about their peer). As a thread requests a body it signals to the others not to.

Out of a 200s sim I could collect download counts like so

[15,38,49,50,57,74,76,85,87,111,122,125,131]
[6,29,56,62,67,118,119,231,335]
[8,56,74,81,97,135,162,195,208]
[104,119,165,287,347]
[54,56,75,104,120,127,214,268]
[61,96,114,129,140,196,285]
[10,42,51,52,59,60,61,64,80,83,102,106,121,133]
[25,30,74,86,108,132,169,181,215]
[56,82,126,175,182,195,205]
[16,90,133,162,256,358]
...

each row is a different node, each element in the array is how many bodies the node got from the peer with that index.

SupernaviX · 2025-03-17T19:37:53Z

On closer inspection, that matches what the rust sim is doing. Each node tracks which IBs a given peer has announced, requests one IB at a time from each peer, and doesn't request the same IB from two peers at once.

bwbush · 2025-03-19T18:29:20Z

I haven't encountered this in tag leios-2025w12. @SupernaviX, please close this issue if you think the discussion is complete. Thanks!

SupernaviX · 2025-03-19T20:57:50Z

I've fixed the error, but still want to understand more why IBs aren't propagating quickly enough. I suspect the root cause is going to be similar to another open issue, and I'll close it when I can narrow it down to one of those.

bwbush added the bug Something isn't working label Mar 14, 2025

bwbush assigned SupernaviX Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust simulation ends prematurely with `Error: Missing input block` #255

Rust simulation ends prematurely with `Error: Missing input block` #255

bwbush commented Mar 14, 2025

SupernaviX commented Mar 15, 2025

Quantumplation commented Mar 15, 2025 •

edited

Loading

SupernaviX commented Mar 16, 2025

Saizan commented Mar 17, 2025

will-break-it commented Mar 17, 2025

pagio commented Mar 17, 2025

SupernaviX commented Mar 17, 2025 •

edited

Loading

Saizan commented Mar 17, 2025

SupernaviX commented Mar 17, 2025

Quantumplation commented Mar 17, 2025

pagio commented Mar 17, 2025 •

edited

Loading

Saizan commented Mar 17, 2025

SupernaviX commented Mar 17, 2025

bwbush commented Mar 19, 2025

SupernaviX commented Mar 19, 2025

Rust simulation ends prematurely with Error: Missing input block #255

Rust simulation ends prematurely with Error: Missing input block #255

Comments

bwbush commented Mar 14, 2025

Steps to reproduce

SupernaviX commented Mar 15, 2025

Quantumplation commented Mar 15, 2025 • edited Loading

SupernaviX commented Mar 16, 2025

Saizan commented Mar 17, 2025

will-break-it commented Mar 17, 2025

pagio commented Mar 17, 2025

SupernaviX commented Mar 17, 2025 • edited Loading

Saizan commented Mar 17, 2025

SupernaviX commented Mar 17, 2025

Quantumplation commented Mar 17, 2025

pagio commented Mar 17, 2025 • edited Loading

Saizan commented Mar 17, 2025

SupernaviX commented Mar 17, 2025

bwbush commented Mar 19, 2025

SupernaviX commented Mar 19, 2025

Rust simulation ends prematurely with `Error: Missing input block` #255

Rust simulation ends prematurely with `Error: Missing input block` #255

Quantumplation commented Mar 15, 2025 •

edited

Loading

SupernaviX commented Mar 17, 2025 •

edited

Loading

pagio commented Mar 17, 2025 •

edited

Loading