TQ: Add support for alarms in the protocol #8753

andrewjstone · 2025-08-01T23:01:18Z

This builds on #8741

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node. I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot. On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

It's not actually an error to receive a `CommitAdvance` while coordinating for the same epoch. The `GetShare` from the coordinator could have been delayed in the network` and the node that received it already committed before the coordinator knew it was done preparing. In essence, the following would happen: 1. The coordinator would send GetShare requests for the prior epoch 2. Enough nodes would reply so that the coordinator would start sending prepares. 3. Enough nodes would ack prepares to commit 4. Nexus would poll and send commits. Other nodes would get those commits, but not the coordinator 5. A node that hadn't yet received the `GetShare` would get a `CommitAdvance` or see the `Commit` from nexus and get it's configuration and recompute it's own share and commit. It may have been a prior coordinator with delayed deliveries to other nodes of `GetShare` messages. 6. The node that just committed finally receives the `GetShare` and sends back a `CommitAdvance` to the coordinator This is all valid, and was similar to a proptest counterexample

andrewjstone requested review from sunshowers and plotnick August 1, 2025 23:01

andrewjstone force-pushed the tq-alarms branch from bab40c6 to 568fab1 Compare August 1, 2025 23:04

andrewjstone force-pushed the tq-alarms branch from 568fab1 to 681e851 Compare August 1, 2025 23:13

andrewjstone force-pushed the tq-alarms branch from 8c5b6bd to ad388eb Compare August 2, 2025 23:39

Fix CommitAdvance behavior for expunged nodes

9465f5a

andrewjstone mentioned this pull request Aug 8, 2025

TQ: Introduce tqdb #8801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Add support for alarms in the protocol #8753

TQ: Add support for alarms in the protocol #8753

Uh oh!

andrewjstone commented Aug 1, 2025

Uh oh!

Uh oh!

TQ: Add support for alarms in the protocol #8753

Are you sure you want to change the base?

TQ: Add support for alarms in the protocol #8753

Uh oh!

Conversation

andrewjstone commented Aug 1, 2025

Uh oh!

Uh oh!