-
Notifications
You must be signed in to change notification settings - Fork 50
TQ: Introduce tqdb #8801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tq-alarms
Are you sure you want to change the base?
TQ: Introduce tqdb #8801
Conversation
This PR introduces a new test tool for the trust-quorum protocol: tqdb. tqdb is a repl that takes event traces produced by the `cluster` proptest and uses them for deterministic replay of actions against test state. The test state includes a "universe" of real protocol nodes, a fake nexus, and fake networks. The proptest and debugging state is shared and contained in the `trust-quorum-test-utils`. The debugger allows a variety of functionality including stepping through individual events, setting breakpoints, snapshotting and diffing states and viewing the event log itself. The purpose of tqdb is twofold: 1. Allow for debugging of failed proptests. This is non-trivial in some cases, even with shrunken tests, because the generated actions are high-level and are all generated up front. The actual operations such as reconfigurations are derived from these high level random generations in conjunction with the current state of the system. Therefore the set of failing generated actions doesn't really tell you much. You have to look at the logs, and the assertion that fired and reason about it with incomplete information. Now, for each concrete action taken, we record the event in a log. In the case of a failure an event log can be loaded into tqdb, with a breakpoint set right before the failure. A snapshot of the state can be taken, and then the failing event can be applied. The diff will tell you what changed and allow you to inspect the actual state of the system. Full visibility into your failure is now possible. 2. The trust quorum protocol is non-trivial. Tqdb allows developers to see in detail how the protocol behaves and understand what is happening in certain situations. Event logs can be created by hand (or script) for particularly interesting scenarios and then run through tqdb. In order to get the diff functionality to work as I wanted, I had to implement `Eq` for types that implemented `subtle::ConstantTimeEq` in both `gfss` (our secret sharing library), and `trust-quorum` crates. However the safety in terms of the compiler breaking the constant time guarantees is unknown. Therefore, a feature flag was added such that only `test-utils` and `tqdb` crates are able to use these implementations. They are not used in the production codebase. Feature unification is not at play here because neither `test-utils` or `tqdb` are part of the product.
Here's a sample of what the tool can do, utilizing the example event log.
|
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! Nexus related types for trust-quorum testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this file was moved from the proptest in cluster.rs
// The state of our entire system including the system under test and | ||
// test specific infrastructure. | ||
#[derive(Debug, Clone, Diffable)] | ||
pub struct TqState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type was moved from the cluster proptest. No changes were made.
impl TqState { | ||
pub fn new(log: Logger) -> TqState { | ||
// We'll fill this in when applying the initial_config | ||
let sut = Sut::empty(); | ||
let member_universe = vec![]; | ||
TqState { | ||
log, | ||
sut, | ||
bootstrap_network: BTreeMap::new(), | ||
underlay_network: Vec::new(), | ||
nexus: NexusState::new(), | ||
member_universe, | ||
faults: Faults::default(), | ||
all_coordinated_configs: IdOrdMap::new(), | ||
expunged: BTreeSet::new(), | ||
} | ||
} | ||
|
||
/// Send the latest `ReconfigureMsg` from `Nexus` to the coordinator node | ||
/// | ||
/// If the node is not available, then abort the configuration at nexus | ||
pub fn send_reconfigure_msg(&mut self) { | ||
let (coordinator, msg) = self.nexus.reconfigure_msg_for_latest_config(); | ||
let epoch_to_config = msg.epoch; | ||
if self.faults.crashed_nodes.contains(coordinator) { | ||
// We must abort the configuration. This mimics a timeout. | ||
self.nexus.abort_reconfiguration(); | ||
} else { | ||
let (node, ctx) = self | ||
.sut | ||
.nodes | ||
.get_mut(coordinator) | ||
.expect("coordinator exists"); | ||
|
||
node.coordinate_reconfiguration(ctx, msg) | ||
.expect("valid configuration"); | ||
|
||
// Do we have a `Configuration` for this epoch yet? | ||
// | ||
// For most reconfigurations, shares for the last committed | ||
// configuration must be retrieved before the configuration is | ||
// generated and saved in the persistent state. | ||
let latest_persisted_config = | ||
ctx.persistent_state().latest_config().expect("config exists"); | ||
if latest_persisted_config.epoch == epoch_to_config { | ||
// Save the configuration for later | ||
self.all_coordinated_configs | ||
.insert_unique(latest_persisted_config.clone()) | ||
.expect("unique"); | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved. No changes.
/// Check postcondition assertions after initial configuration | ||
pub fn postcondition_initial_configuration(&mut self) { | ||
let (coordinator, msg) = self.nexus.reconfigure_msg_for_latest_config(); | ||
|
||
// The coordinator should have received the `ReconfigureMsg` from Nexus | ||
if !self.faults.crashed_nodes.contains(coordinator) { | ||
let (node, ctx) = self | ||
.sut | ||
.nodes | ||
.get_mut(coordinator) | ||
.expect("coordinator exists"); | ||
let mut connected_members = 0; | ||
// The coordinator should start preparing by sending a `PrepareMsg` to all | ||
// connected nodes in the membership set. | ||
for member in | ||
msg.members.iter().filter(|&id| id != coordinator).cloned() | ||
{ | ||
if self.faults.is_connected(coordinator.clone(), member.clone()) | ||
{ | ||
connected_members += 1; | ||
let msg_found = ctx.envelopes().any(|envelope| { | ||
envelope.to == member | ||
&& envelope.from == *coordinator | ||
&& matches!( | ||
envelope.msg.kind, | ||
PeerMsgKind::Prepare { .. } | ||
) | ||
}); | ||
assert!(msg_found); | ||
} | ||
} | ||
assert_eq!(connected_members, ctx.envelopes().count()); | ||
|
||
// The coordinator should be in the prepare phase | ||
let cs = node.get_coordinator_state().expect("is coordinating"); | ||
assert!(matches!(cs.op(), CoordinatorOperation::Prepare { .. })); | ||
|
||
// The persistent state should have changed | ||
assert!(ctx.persistent_state_change_check_and_reset()); | ||
assert!(ctx.persistent_state().has_prepared(msg.epoch)); | ||
assert!(ctx.persistent_state().latest_committed_epoch().is_none()); | ||
} | ||
} | ||
|
||
/// Put any outgoing coordinator messages from the latest configuration on the wire | ||
pub fn send_envelopes_from_coordinator(&mut self) { | ||
let coordinator = { | ||
let (coordinator, _) = | ||
self.nexus.reconfigure_msg_for_latest_config(); | ||
coordinator.clone() | ||
}; | ||
self.send_envelopes_from(&coordinator); | ||
} | ||
|
||
pub fn send_envelopes_from(&mut self, id: &PlatformId) { | ||
let (_, ctx) = self.sut.nodes.get_mut(id).expect("node exists"); | ||
for envelope in ctx.drain_envelopes() { | ||
let msgs = | ||
self.bootstrap_network.entry(envelope.to.clone()).or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all moved as well. GitHub select when scrolling does not seem to be working as I'd hoped.
} | ||
} | ||
|
||
pub fn apply_event(&mut self, event: Event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for all the apply
functions was largely just moved, but it is called now based on events instead of inlined in the action handlers in the test.
/// Broken out of `TqState` to alleviate borrow checker woes | ||
fn send_envelopes( | ||
ctx: &mut NodeCtx, | ||
bootstrap_network: &mut BTreeMap<PlatformId, Vec<Envelope>>, | ||
) { | ||
for envelope in ctx.drain_envelopes() { | ||
let envelopes = | ||
bootstrap_network.entry(envelope.to.clone()).or_default(); | ||
envelopes.push(envelope); | ||
} | ||
} | ||
|
||
/// The system under test | ||
/// | ||
/// This is our real code. | ||
#[derive(Debug, Clone, Diffable)] | ||
pub struct Sut { | ||
/// All nodes in the member universe | ||
pub nodes: BTreeMap<PlatformId, (Node, NodeCtx)>, | ||
} | ||
|
||
impl Sut { | ||
pub fn empty() -> Sut { | ||
Sut { nodes: BTreeMap::new() } | ||
} | ||
|
||
pub fn new(log: &Logger, universe: Vec<PlatformId>) -> Sut { | ||
let nodes = universe | ||
.into_iter() | ||
.map(|id| { | ||
let mut ctx = NodeCtx::new(id.clone()); | ||
let node = Node::new(log, &mut ctx); | ||
(id, (node, ctx)) | ||
}) | ||
.collect(); | ||
Sut { nodes } | ||
} | ||
} | ||
|
||
/// Faults in our system. It's useful to keep these self contained and not | ||
/// in separate fields in `TestState` so that we can access them all at once | ||
/// independently of other `TestState` fields. | ||
#[derive(Default, Debug, Clone, Diffable)] | ||
pub struct Faults { | ||
// We allow nodes to crash and restart and therefore track crashed nodes here. | ||
// | ||
// A crashed node is implicitly disconnected from every other node. We don't | ||
// bother storing the pairs in `disconnected_nodes`, but instead check both | ||
// fields when necessary. | ||
pub crashed_nodes: BTreeSet<PlatformId>, | ||
|
||
/// The set of disconnected nodes | ||
pub disconnected_nodes: DisconnectedNodes, | ||
} | ||
|
||
impl Faults { | ||
pub fn is_connected(&self, node1: PlatformId, node2: PlatformId) -> bool { | ||
!self.crashed_nodes.contains(&node1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All moved.
/// For cardinality purposes, we assume all nodes are connected and explicitly | ||
/// disconnect some of them. This allows us to track and compare much less data. | ||
#[derive(Default, Debug, Clone, Diffable)] | ||
pub struct DisconnectedNodes { | ||
// We sort each pair on insert for quick lookups | ||
pairs: BTreeSet<(PlatformId, PlatformId)>, | ||
} | ||
|
||
impl DisconnectedNodes { | ||
// Return true if the pair is newly inserted | ||
pub fn insert(&mut self, node1: PlatformId, node2: PlatformId) -> bool { | ||
assert_ne!(node1, node2); | ||
|
||
let pair = if node1 < node2 { (node1, node2) } else { (node2, node1) }; | ||
self.pairs.insert(pair) | ||
} | ||
|
||
// Return true if the pair of nodes is disconnected, false otherwise | ||
pub fn contains(&self, node1: PlatformId, node2: PlatformId) -> bool { | ||
assert_ne!(node1, node2); | ||
let pair = if node1 < node2 { (node1, node2) } else { (node2, node1) }; | ||
self.pairs.contains(&pair) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
} | ||
} | ||
|
||
/***************************************************************************** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the start of the new code for this file. I wouldn't worry much about everything above here.
I tried to make the diff output useful for a human without being overwhelming. And of course it is tedious to translate. I'd love a library that could translate daft diffs into reasonable output, but I have a feeling that's not really possible in an automated way - outside of some LLM hackery in the loop. I had to think judiciously about what should go where.
This PR builds on #8753
This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it is in the example log file in the second commit. There's also some moved and unmodified code that I'll point out.
This PR introduces a new test tool for the trust-quorum protocol:
tqdb. tqdb is a repl that takes event traces produced by the
cluster
proptest and uses them for deterministic replay of actions against test
state.
The test state includes a "universe" of real protocol nodes, a fake
nexus, and fake networks. The proptest and debugging state is shared and
contained in the
trust-quorum-test-utils
.The debugger allows a variety of functionality including stepping
through individual events, setting breakpoints, snapshotting and diffing
states and viewing the event log itself.
The purpose of tqdb is twofold:
Allow for debugging of failed proptests. This is non-trivial in
some cases, even with shrunken tests, because the generated
actions are high-level and are all generated up front. The actual
operations such as reconfigurations are derived from these high
level random generations in conjunction with the current state
of the system. Therefore the set of failing generated actions
doesn't really tell you much. You have to look at the logs, and
the assertion that fired and reason about it with incomplete
information. Now, for each concrete action taken, we record the
event in a log. In the case of a failure an event log can be
loaded into tqdb, with a breakpoint set right before the failure. A
snapshot of the state can be taken, and then the failing event can
be applied. The diff will tell you what changed and allow you to
inspect the actual state of the system. Full visibility into your
failure is now possible.
The trust quorum protocol is non-trivial. Tqdb allows developers
to see in detail how the protocol behaves and understand what is
happening in certain situations. Event logs can be created by hand
(or script) for particularly interesting scenarios and then run
through tqdb.
In order to get the diff functionality to work as I wanted, I had to
implement
Eq
for types that implementedsubtle::ConstantTimeEq
inboth
gfss
(our secret sharing library), andtrust-quorum
crates.However the safety in terms of the compiler breaking the constant
time guarantees is unknown. Therefore, a feature flag was added
such that only
test-utils
andtqdb
crates are able to use theseimplementations. They are not used in the production codebase. Feature
unification is not at play here because neither
test-utils
ortqdb
are part of the product.