Skip to content

TQ: Introduce tqdb #8801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: tq-alarms
Choose a base branch
from
Open

TQ: Introduce tqdb #8801

wants to merge 2 commits into from

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Aug 8, 2025

This PR builds on #8753

This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it is in the example log file in the second commit. There's also some moved and unmodified code that I'll point out.

This PR introduces a new test tool for the trust-quorum protocol:
tqdb. tqdb is a repl that takes event traces produced by the cluster
proptest and uses them for deterministic replay of actions against test
state.

The test state includes a "universe" of real protocol nodes, a fake
nexus, and fake networks. The proptest and debugging state is shared and
contained in the trust-quorum-test-utils.

The debugger allows a variety of functionality including stepping
through individual events, setting breakpoints, snapshotting and diffing
states and viewing the event log itself.

The purpose of tqdb is twofold:

  1. Allow for debugging of failed proptests. This is non-trivial in
    some cases, even with shrunken tests, because the generated
    actions are high-level and are all generated up front. The actual
    operations such as reconfigurations are derived from these high
    level random generations in conjunction with the current state
    of the system. Therefore the set of failing generated actions
    doesn't really tell you much. You have to look at the logs, and
    the assertion that fired and reason about it with incomplete
    information. Now, for each concrete action taken, we record the
    event in a log. In the case of a failure an event log can be
    loaded into tqdb, with a breakpoint set right before the failure. A
    snapshot of the state can be taken, and then the failing event can
    be applied. The diff will tell you what changed and allow you to
    inspect the actual state of the system. Full visibility into your
    failure is now possible.

  2. The trust quorum protocol is non-trivial. Tqdb allows developers
    to see in detail how the protocol behaves and understand what is
    happening in certain situations. Event logs can be created by hand
    (or script) for particularly interesting scenarios and then run
    through tqdb.

In order to get the diff functionality to work as I wanted, I had to
implement Eq for types that implemented subtle::ConstantTimeEq in
both gfss (our secret sharing library), and trust-quorum crates.
However the safety in terms of the compiler breaking the constant
time guarantees is unknown. Therefore, a feature flag was added
such that only test-utils and tqdb crates are able to use these
implementations. They are not used in the production codebase. Feature
unification is not at play here because neither test-utils or tqdb
are part of the product.

This PR introduces a new test tool for the trust-quorum protocol:
tqdb. tqdb is a repl that takes event traces produced by the `cluster`
proptest and uses them for deterministic replay of actions against test
state.

The test state includes a "universe" of real protocol nodes, a fake
nexus, and fake networks. The proptest and debugging state is shared and
contained in the `trust-quorum-test-utils`.

The debugger allows a variety of functionality including stepping
through individual events, setting breakpoints, snapshotting and diffing
states and viewing the event log itself.

The purpose of tqdb is twofold:

  1. Allow for debugging of failed proptests. This is non-trivial in
     some cases, even with shrunken tests, because the generated
     actions are high-level and are all generated up front. The actual
     operations such as reconfigurations are derived from these high
     level random generations  in conjunction with the current state
     of the system. Therefore the set of failing generated actions
     doesn't really tell you much. You have to look at the logs, and
     the assertion that fired and reason about it with incomplete
     information. Now, for each concrete action taken, we record the
     event in a log. In the case of a failure an event log can be
     loaded into tqdb, with a breakpoint set right before the failure. A
     snapshot of the state can be taken, and then the failing event can
     be applied. The diff will tell you what changed and allow you to
     inspect the actual state of the system. Full visibility into your
     failure is now possible.

 2. The trust quorum protocol is non-trivial. Tqdb allows developers
    to see in detail how the protocol behaves and understand what is
    happening in certain situations. Event logs can be created by hand
    (or script) for particularly interesting scenarios and then run
    through tqdb.

In order to get the diff functionality to work as I wanted, I had to
implement `Eq` for types that implemented `subtle::ConstantTimeEq` in
both `gfss` (our secret sharing library), and `trust-quorum` crates.
However the safety in terms of the compiler breaking the constant
time guarantees is unknown. Therefore, a feature flag was added
such that only `test-utils` and `tqdb` crates are able to use these
implementations. They are not used in the production codebase. Feature
unification is not at play here because neither `test-utils` or `tqdb`
are part of the product.
@andrewjstone
Copy link
Contributor Author

Here's a sample of what the tool can do, utilizing the example event log.

tqdb〉events
error: no events loaded. Please call 'open' on a valid file
tqdb〉open /tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json
loaded event log: /tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json
399 events.
tqdb〉events
0  InitialSetup {
    member_universe_size: 40,
    config: NexusConfig {
        op: Preparing,
        epoch: Epoch(
            1,
        ),
        last_committed_epoch: None,
        coordinator: PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        members: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
            PlatformId {
                part_number: "test",
                serial_number: "15",
            },
            PlatformId {
                part_number: "test",
                serial_number: "25",
            },
            PlatformId {
                part_number: "test",
                serial_number: "27",
            },
            PlatformId {
                part_number: "test",
                serial_number: "3",
            },
            PlatformId {
                part_number: "test",
                serial_number: "32",
            },
            PlatformId {
                part_number: "test",
                serial_number: "34",
            },
            PlatformId {
                part_number: "test",
                serial_number: "37",
            },
            PlatformId {
                part_number: "test",
                serial_number: "39",
            },
            PlatformId {
                part_number: "test",
                serial_number: "4",
            },
            PlatformId {
                part_number: "test",
                serial_number: "5",
            },
            PlatformId {
                part_number: "test",
                serial_number: "7",
            },
            PlatformId {
                part_number: "test",
                serial_number: "9",
            },
        },
        threshold: Threshold(
            2,
        ),
        commit_crash_tolerance: 3,
        prepared_members: {},
        committed_members: {},
    },
    crashed_nodes: {
        PlatformId {
            part_number: "test",
            serial_number: "11",
        },
        PlatformId {
            part_number: "test",
            serial_number: "16",
        },
        PlatformId {
            part_number: "test",
            serial_number: "3",
        },
        PlatformId {
            part_number: "test",
            serial_number: "7",
        },
    },
}

tqdb〉s
INFO Starting coordination on uninitialized node, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, component: tq-coordinator-state, epoch: 1
0  InitialSetup {
    member_universe_size: 40,
    config: NexusConfig {
        op: Preparing,
        epoch: Epoch(
            1,
        ),
        last_committed_epoch: None,
        coordinator: PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        members: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
            PlatformId {
                part_number: "test",
                serial_number: "15",
            },
            PlatformId {
                part_number: "test",
                serial_number: "25",
            },
            PlatformId {
                part_number: "test",
                serial_number: "27",
            },
            PlatformId {
                part_number: "test",
                serial_number: "3",
            },
            PlatformId {
                part_number: "test",
                serial_number: "32",
            },
            PlatformId {
                part_number: "test",
                serial_number: "34",
            },
            PlatformId {
                part_number: "test",
                serial_number: "37",
            },
            PlatformId {
                part_number: "test",
                serial_number: "39",
            },
            PlatformId {
                part_number: "test",
                serial_number: "4",
            },
            PlatformId {
                part_number: "test",
                serial_number: "5",
            },
            PlatformId {
                part_number: "test",
                serial_number: "7",
            },
            PlatformId {
                part_number: "test",
                serial_number: "9",
            },
        },
        threshold: Threshold(
            2,
        ),
        commit_crash_tolerance: 3,
        prepared_members: {},
        committed_members: {},
    },
    crashed_nodes: {
        PlatformId {
            part_number: "test",
            serial_number: "11",
        },
        PlatformId {
            part_number: "test",
            serial_number: "16",
        },
        PlatformId {
            part_number: "test",
            serial_number: "3",
        },
        PlatformId {
            part_number: "test",
            serial_number: "7",
        },
    },
}
done: applied 1 events

tqdb〉events
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)

tqdb〉events next 4
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)
2  DeliverNexusReply
3  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "37",
    },
}
4  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "25",
    },
}

tqdb〉step 4
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "37" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "25" }, from: test:1, epoch: 1
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)
2  DeliverNexusReply
3  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "37",
    },
}
4  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "25",
    },
}
done: applied 4 events

tqdb〉snapshot 33
Setting pending snapshot
tqdb〉b 99
breakpoint set at event 99
tqdb〉run
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "9" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "32" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "34" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "5" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "39" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "27" }, from: test:1, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:27, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:39, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "4" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "15" }, from: test:1, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:15, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:4, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:5, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:34, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:32, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:9, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:25, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:37, epoch: 1
INFO Configuration being coordinated changed, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, previous_epoch: 1, new_epoch: 2
INFO Starting coordination on uninitialized node, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, component: tq-coordinator-state, epoch: 2
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "0" }, from: test:1, epoch: 2
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "2" }, from: test:1, epoch: 2
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:2, epoch: 2
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:0, epoch: 2
stopped at breakpoint 99 after applying 94 events
tqdb〉snapshot-list
Snapshot indexes:
33

tqdb〉diff 33
Node changed: test:0
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

Node changed: test:1
  coordinator state changed:
    epoch: 1 -> 2
    added members:
        test:0
        test:2
    removed members:
        test:15
        test:25
        test:27
        test:32
        test:34
        test:37
        test:39
        test:4
        test:5
        test:7
        test:9
    coordinator: test:1 -> test:1
    received prepare acks differ
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

Node changed: test:2
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

  all messages delivered from bootstrap network:
    destination:  test:1
  1 new nexus replies in flight on underlay network
  0 nexus replies delivered to nexus from underlay network
  nexus state changed:
    config added at epoch 2, op: Preparing
    config added at epoch 3, op: Preparing
    config modified at epoch 1
      new prepare ack received: test:25
      new prepare ack received: test:32
      new prepare ack received: test:37
      new prepare ack received: test:9

tqdb〉summary
event log path: "/tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json"
total events in log: 399
next event to apply: 99
    AbortConfiguration(
    Epoch(
        3,
    ),
)
total nodes under test: 40
bootstrap network messages in flight: 0
nexus config:
    epoch: 3
    op: Preparing
    coordinator: 3
    total members: 4
    prepared members: 0
    committed members: 0
    threshold: 2
    commit crash tolerance: 1

tqdb〉node-show 2
Node {
    log: Logger(platform_id, component, component),
    coordinator_state: None,
    key_share_computer: None,
}
NodeCtx {
    platform_id: PlatformId {
        part_number: "test",
        serial_number: "2",
    },
    persistent_state: PersistentState {
        lrtq: None,
        configs: {
            Epoch(
                2,
            ): Configuration {
                rack_id: 4ea41e14-9da7-4783-af0a-9cf85f8c46db (rack),
                epoch: Epoch(
                    2,
                ),
                coordinator: PlatformId {
                    part_number: "test",
                    serial_number: "1",
                },
                members: {
                    PlatformId {
                        part_number: "test",
                        serial_number: "0",
                    }: sha3 digest: d1d315a27b395a1be95638c0362042692cbf39911d9b26a531655868e3f5,
                    PlatformId {
                        part_number: "test",
                        serial_number: "1",
                    }: sha3 digest: dc564317181cf569b639adaba0144b9129529af7f82efac7f8eceb2ecd6816,
                    PlatformId {
                        part_number: "test",
                        serial_number: "2",
                    }: sha3 digest: c2696e54ef160b03d877cb7b19b9b886d14f95df2992663a8d843474fbec9c,
                    PlatformId {
                        part_number: "test",
                        serial_number: "3",
                    }: sha3 digest: f4b19a92ff3dd8a20a74b1744f864de858b10a978ea95e2f86c641beb2844b,
                },
                threshold: Threshold(
                    2,
                ),
                encrypted_rack_secrets: None,
            },
        },
        shares: {
            Epoch(
                2,
            ): KeyShareGf256,
        },
        commits: {},
        expunged: None,
    },
    persistent_state_changed: false,
    outgoing: [],
    connected: {
        PlatformId {
            part_number: "test",
            serial_number: "0",
        },
        PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        PlatformId {
            part_number: "test",
            serial_number: "10",
        },
        PlatformId {
            part_number: "test",
            serial_number: "12",
        },
        PlatformId {
            part_number: "test",
            serial_number: "13",
        },
        PlatformId {
            part_number: "test",
            serial_number: "14",
        },
        PlatformId {
            part_number: "test",
            serial_number: "15",
        },
        PlatformId {
            part_number: "test",
            serial_number: "17",
        },
        PlatformId {
            part_number: "test",
            serial_number: "18",
        },
        PlatformId {
            part_number: "test",
            serial_number: "19",
        },
        PlatformId {
            part_number: "test",
            serial_number: "20",
        },
        PlatformId {
            part_number: "test",
            serial_number: "21",
        },
        PlatformId {
            part_number: "test",
            serial_number: "22",
        },
        PlatformId {
            part_number: "test",
            serial_number: "23",
        },
        PlatformId {
            part_number: "test",
            serial_number: "24",
        },
        PlatformId {
            part_number: "test",
            serial_number: "25",
        },
        PlatformId {
            part_number: "test",
            serial_number: "26",
        },
        PlatformId {
            part_number: "test",
            serial_number: "27",
        },
        PlatformId {
            part_number: "test",
            serial_number: "28",
        },
        PlatformId {
            part_number: "test",
            serial_number: "29",
        },
        PlatformId {
            part_number: "test",
            serial_number: "30",
        },
        PlatformId {
            part_number: "test",
            serial_number: "31",
        },
        PlatformId {
            part_number: "test",
            serial_number: "32",
        },
        PlatformId {
            part_number: "test",
            serial_number: "33",
        },
        PlatformId {
            part_number: "test",
            serial_number: "34",
        },
        PlatformId {
            part_number: "test",
            serial_number: "35",
        },
        PlatformId {
            part_number: "test",
            serial_number: "36",
        },
        PlatformId {
            part_number: "test",
            serial_number: "37",
        },
        PlatformId {
            part_number: "test",
            serial_number: "38",
        },
        PlatformId {
            part_number: "test",
            serial_number: "39",
        },
        PlatformId {
            part_number: "test",
            serial_number: "4",
        },
        PlatformId {
            part_number: "test",
            serial_number: "5",
        },
        PlatformId {
            part_number: "test",
            serial_number: "6",
        },
        PlatformId {
            part_number: "test",
            serial_number: "8",
        },
        PlatformId {
            part_number: "test",
            serial_number: "9",
        },
    },
    alarms: {},
}

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Nexus related types for trust-quorum testing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this file was moved from the proptest in cluster.rs

// The state of our entire system including the system under test and
// test specific infrastructure.
#[derive(Debug, Clone, Diffable)]
pub struct TqState {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type was moved from the cluster proptest. No changes were made.

Comment on lines +69 to +121
impl TqState {
pub fn new(log: Logger) -> TqState {
// We'll fill this in when applying the initial_config
let sut = Sut::empty();
let member_universe = vec![];
TqState {
log,
sut,
bootstrap_network: BTreeMap::new(),
underlay_network: Vec::new(),
nexus: NexusState::new(),
member_universe,
faults: Faults::default(),
all_coordinated_configs: IdOrdMap::new(),
expunged: BTreeSet::new(),
}
}

/// Send the latest `ReconfigureMsg` from `Nexus` to the coordinator node
///
/// If the node is not available, then abort the configuration at nexus
pub fn send_reconfigure_msg(&mut self) {
let (coordinator, msg) = self.nexus.reconfigure_msg_for_latest_config();
let epoch_to_config = msg.epoch;
if self.faults.crashed_nodes.contains(coordinator) {
// We must abort the configuration. This mimics a timeout.
self.nexus.abort_reconfiguration();
} else {
let (node, ctx) = self
.sut
.nodes
.get_mut(coordinator)
.expect("coordinator exists");

node.coordinate_reconfiguration(ctx, msg)
.expect("valid configuration");

// Do we have a `Configuration` for this epoch yet?
//
// For most reconfigurations, shares for the last committed
// configuration must be retrieved before the configuration is
// generated and saved in the persistent state.
let latest_persisted_config =
ctx.persistent_state().latest_config().expect("config exists");
if latest_persisted_config.epoch == epoch_to_config {
// Save the configuration for later
self.all_coordinated_configs
.insert_unique(latest_persisted_config.clone())
.expect("unique");
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved. No changes.

Comment on lines +122 to +180
/// Check postcondition assertions after initial configuration
pub fn postcondition_initial_configuration(&mut self) {
let (coordinator, msg) = self.nexus.reconfigure_msg_for_latest_config();

// The coordinator should have received the `ReconfigureMsg` from Nexus
if !self.faults.crashed_nodes.contains(coordinator) {
let (node, ctx) = self
.sut
.nodes
.get_mut(coordinator)
.expect("coordinator exists");
let mut connected_members = 0;
// The coordinator should start preparing by sending a `PrepareMsg` to all
// connected nodes in the membership set.
for member in
msg.members.iter().filter(|&id| id != coordinator).cloned()
{
if self.faults.is_connected(coordinator.clone(), member.clone())
{
connected_members += 1;
let msg_found = ctx.envelopes().any(|envelope| {
envelope.to == member
&& envelope.from == *coordinator
&& matches!(
envelope.msg.kind,
PeerMsgKind::Prepare { .. }
)
});
assert!(msg_found);
}
}
assert_eq!(connected_members, ctx.envelopes().count());

// The coordinator should be in the prepare phase
let cs = node.get_coordinator_state().expect("is coordinating");
assert!(matches!(cs.op(), CoordinatorOperation::Prepare { .. }));

// The persistent state should have changed
assert!(ctx.persistent_state_change_check_and_reset());
assert!(ctx.persistent_state().has_prepared(msg.epoch));
assert!(ctx.persistent_state().latest_committed_epoch().is_none());
}
}

/// Put any outgoing coordinator messages from the latest configuration on the wire
pub fn send_envelopes_from_coordinator(&mut self) {
let coordinator = {
let (coordinator, _) =
self.nexus.reconfigure_msg_for_latest_config();
coordinator.clone()
};
self.send_envelopes_from(&coordinator);
}

pub fn send_envelopes_from(&mut self, id: &PlatformId) {
let (_, ctx) = self.sut.nodes.get_mut(id).expect("node exists");
for envelope in ctx.drain_envelopes() {
let msgs =
self.bootstrap_network.entry(envelope.to.clone()).or_default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all moved as well. GitHub select when scrolling does not seem to be working as I'd hoped.

}
}

pub fn apply_event(&mut self, event: Event) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for all the apply functions was largely just moved, but it is called now based on events instead of inlined in the action handlers in the test.

Comment on lines +352 to +409
/// Broken out of `TqState` to alleviate borrow checker woes
fn send_envelopes(
ctx: &mut NodeCtx,
bootstrap_network: &mut BTreeMap<PlatformId, Vec<Envelope>>,
) {
for envelope in ctx.drain_envelopes() {
let envelopes =
bootstrap_network.entry(envelope.to.clone()).or_default();
envelopes.push(envelope);
}
}

/// The system under test
///
/// This is our real code.
#[derive(Debug, Clone, Diffable)]
pub struct Sut {
/// All nodes in the member universe
pub nodes: BTreeMap<PlatformId, (Node, NodeCtx)>,
}

impl Sut {
pub fn empty() -> Sut {
Sut { nodes: BTreeMap::new() }
}

pub fn new(log: &Logger, universe: Vec<PlatformId>) -> Sut {
let nodes = universe
.into_iter()
.map(|id| {
let mut ctx = NodeCtx::new(id.clone());
let node = Node::new(log, &mut ctx);
(id, (node, ctx))
})
.collect();
Sut { nodes }
}
}

/// Faults in our system. It's useful to keep these self contained and not
/// in separate fields in `TestState` so that we can access them all at once
/// independently of other `TestState` fields.
#[derive(Default, Debug, Clone, Diffable)]
pub struct Faults {
// We allow nodes to crash and restart and therefore track crashed nodes here.
//
// A crashed node is implicitly disconnected from every other node. We don't
// bother storing the pairs in `disconnected_nodes`, but instead check both
// fields when necessary.
pub crashed_nodes: BTreeSet<PlatformId>,

/// The set of disconnected nodes
pub disconnected_nodes: DisconnectedNodes,
}

impl Faults {
pub fn is_connected(&self, node1: PlatformId, node2: PlatformId) -> bool {
!self.crashed_nodes.contains(&node1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All moved.

Comment on lines +415 to +439
/// For cardinality purposes, we assume all nodes are connected and explicitly
/// disconnect some of them. This allows us to track and compare much less data.
#[derive(Default, Debug, Clone, Diffable)]
pub struct DisconnectedNodes {
// We sort each pair on insert for quick lookups
pairs: BTreeSet<(PlatformId, PlatformId)>,
}

impl DisconnectedNodes {
// Return true if the pair is newly inserted
pub fn insert(&mut self, node1: PlatformId, node2: PlatformId) -> bool {
assert_ne!(node1, node2);

let pair = if node1 < node2 { (node1, node2) } else { (node2, node1) };
self.pairs.insert(pair)
}

// Return true if the pair of nodes is disconnected, false otherwise
pub fn contains(&self, node1: PlatformId, node2: PlatformId) -> bool {
assert_ne!(node1, node2);
let pair = if node1 < node2 { (node1, node2) } else { (node2, node1) };
self.pairs.contains(&pair)
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

}
}

/*****************************************************************************
Copy link
Contributor Author

@andrewjstone andrewjstone Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the start of the new code for this file. I wouldn't worry much about everything above here.
I tried to make the diff output useful for a human without being overwhelming. And of course it is tedious to translate. I'd love a library that could translate daft diffs into reasonable output, but I have a feeling that's not really possible in an automated way - outside of some LLM hackery in the loop. I had to think judiciously about what should go where.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant