Skip to content

TQ: Add support for alarms in the protocol #8753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: tq-reconfigure
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

This builds on #8741

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

An alarm represents a protocol invariant violation. It's unclear exactly
what should be done about these other than recording them and allowing
them to be reported upstack, which is what is done in this PR. An
argument could be made for "freezing" the state machine such that trust
quorum nodes stop working and the only thing they can do is report alarm
status. However, that would block the trust quorum from operating at
all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state
as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from
operating with an invariant violation. This could potentially be risky,
and since we shouldn't ever see these maybe pausing for a support call
is the right thing. TBD, once more work is done on the protocol.
It's not actually an error to receive a `CommitAdvance` while
coordinating for the same epoch. The `GetShare` from the coordinator
could have been delayed in the network` and the node that received it
already committed before the coordinator knew it was done preparing. In
essence, the following would happen:

1. The coordinator would send GetShare requests for the prior epoch
2. Enough nodes would reply so that the coordinator would start sending
prepares.
3. Enough nodes would ack prepares to commit
4. Nexus would poll and send commits. Other nodes would get those
commits, but not the coordinator
5. A node that hadn't yet received the `GetShare` would get
a `CommitAdvance` or see the `Commit` from nexus and get it's
configuration and recompute it's own share and commit. It may have been
a prior coordinator with delayed deliveries to other nodes of `GetShare`
messages.
6. The node that just committed finally receives the `GetShare` and
sends back a `CommitAdvance` to the coordinator

This is all valid, and was similar to a proptest counterexample
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant