TQ: Introduce tqdb #8801

andrewjstone · 2025-08-08T05:54:36Z

This PR builds on #8753

This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it is in the example log file in the second commit. There's also some moved and unmodified code that I'll point out.

This PR introduces a new test tool for the trust-quorum protocol:
tqdb. tqdb is a repl that takes event traces produced by the cluster
proptest and uses them for deterministic replay of actions against test
state.

The test state includes a "universe" of real protocol nodes, a fake
nexus, and fake networks. The proptest and debugging state is shared and
contained in the trust-quorum-test-utils.

The debugger allows a variety of functionality including stepping
through individual events, setting breakpoints, snapshotting and diffing
states and viewing the event log itself.

The purpose of tqdb is twofold:

Allow for debugging of failed proptests. This is non-trivial in
some cases, even with shrunken tests, because the generated
actions are high-level and are all generated up front. The actual
operations such as reconfigurations are derived from these high
level random generations in conjunction with the current state
of the system. Therefore the set of failing generated actions
doesn't really tell you much. You have to look at the logs, and
the assertion that fired and reason about it with incomplete
information. Now, for each concrete action taken, we record the
event in a log. In the case of a failure an event log can be
loaded into tqdb, with a breakpoint set right before the failure. A
snapshot of the state can be taken, and then the failing event can
be applied. The diff will tell you what changed and allow you to
inspect the actual state of the system. Full visibility into your
failure is now possible.
The trust quorum protocol is non-trivial. Tqdb allows developers
to see in detail how the protocol behaves and understand what is
happening in certain situations. Event logs can be created by hand
(or script) for particularly interesting scenarios and then run
through tqdb.

In order to get the diff functionality to work as I wanted, I had to
implement Eq for types that implemented subtle::ConstantTimeEq in
both gfss (our secret sharing library), and trust-quorum crates.
However the safety in terms of the compiler breaking the constant
time guarantees is unknown. Therefore, a feature flag was added
such that only test-utils and tqdb crates are able to use these
implementations. They are not used in the production codebase. Feature
unification is not at play here because neither test-utils or tqdb
are part of the product.

This PR introduces a new test tool for the trust-quorum protocol: tqdb. tqdb is a repl that takes event traces produced by the `cluster` proptest and uses them for deterministic replay of actions against test state. The test state includes a "universe" of real protocol nodes, a fake nexus, and fake networks. The proptest and debugging state is shared and contained in the `trust-quorum-test-utils`. The debugger allows a variety of functionality including stepping through individual events, setting breakpoints, snapshotting and diffing states and viewing the event log itself. The purpose of tqdb is twofold: 1. Allow for debugging of failed proptests. This is non-trivial in some cases, even with shrunken tests, because the generated actions are high-level and are all generated up front. The actual operations such as reconfigurations are derived from these high level random generations in conjunction with the current state of the system. Therefore the set of failing generated actions doesn't really tell you much. You have to look at the logs, and the assertion that fired and reason about it with incomplete information. Now, for each concrete action taken, we record the event in a log. In the case of a failure an event log can be loaded into tqdb, with a breakpoint set right before the failure. A snapshot of the state can be taken, and then the failing event can be applied. The diff will tell you what changed and allow you to inspect the actual state of the system. Full visibility into your failure is now possible. 2. The trust quorum protocol is non-trivial. Tqdb allows developers to see in detail how the protocol behaves and understand what is happening in certain situations. Event logs can be created by hand (or script) for particularly interesting scenarios and then run through tqdb. In order to get the diff functionality to work as I wanted, I had to implement `Eq` for types that implemented `subtle::ConstantTimeEq` in both `gfss` (our secret sharing library), and `trust-quorum` crates. However the safety in terms of the compiler breaking the constant time guarantees is unknown. Therefore, a feature flag was added such that only `test-utils` and `tqdb` crates are able to use these implementations. They are not used in the production codebase. Feature unification is not at play here because neither `test-utils` or `tqdb` are part of the product.

andrewjstone · 2025-08-08T06:02:58Z

Here's a sample of what the tool can do, utilizing the example event log.

tqdb〉events
error: no events loaded. Please call 'open' on a valid file
tqdb〉open /tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json
loaded event log: /tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json
399 events.
tqdb〉events
0  InitialSetup {
    member_universe_size: 40,
    config: NexusConfig {
        op: Preparing,
        epoch: Epoch(
            1,
        ),
        last_committed_epoch: None,
        coordinator: PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        members: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
            PlatformId {
                part_number: "test",
                serial_number: "15",
            },
            PlatformId {
                part_number: "test",
                serial_number: "25",
            },
            PlatformId {
                part_number: "test",
                serial_number: "27",
            },
            PlatformId {
                part_number: "test",
                serial_number: "3",
            },
            PlatformId {
                part_number: "test",
                serial_number: "32",
            },
            PlatformId {
                part_number: "test",
                serial_number: "34",
            },
            PlatformId {
                part_number: "test",
                serial_number: "37",
            },
            PlatformId {
                part_number: "test",
                serial_number: "39",
            },
            PlatformId {
                part_number: "test",
                serial_number: "4",
            },
            PlatformId {
                part_number: "test",
                serial_number: "5",
            },
            PlatformId {
                part_number: "test",
                serial_number: "7",
            },
            PlatformId {
                part_number: "test",
                serial_number: "9",
            },
        },
        threshold: Threshold(
            2,
        ),
        commit_crash_tolerance: 3,
        prepared_members: {},
        committed_members: {},
    },
    crashed_nodes: {
        PlatformId {
            part_number: "test",
            serial_number: "11",
        },
        PlatformId {
            part_number: "test",
            serial_number: "16",
        },
        PlatformId {
            part_number: "test",
            serial_number: "3",
        },
        PlatformId {
            part_number: "test",
            serial_number: "7",
        },
    },
}

tqdb〉s
INFO Starting coordination on uninitialized node, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, component: tq-coordinator-state, epoch: 1
0  InitialSetup {
    member_universe_size: 40,
    config: NexusConfig {
        op: Preparing,
        epoch: Epoch(
            1,
        ),
        last_committed_epoch: None,
        coordinator: PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        members: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
            PlatformId {
                part_number: "test",
                serial_number: "15",
            },
            PlatformId {
                part_number: "test",
                serial_number: "25",
            },
            PlatformId {
                part_number: "test",
                serial_number: "27",
            },
            PlatformId {
                part_number: "test",
                serial_number: "3",
            },
            PlatformId {
                part_number: "test",
                serial_number: "32",
            },
            PlatformId {
                part_number: "test",
                serial_number: "34",
            },
            PlatformId {
                part_number: "test",
                serial_number: "37",
            },
            PlatformId {
                part_number: "test",
                serial_number: "39",
            },
            PlatformId {
                part_number: "test",
                serial_number: "4",
            },
            PlatformId {
                part_number: "test",
                serial_number: "5",
            },
            PlatformId {
                part_number: "test",
                serial_number: "7",
            },
            PlatformId {
                part_number: "test",
                serial_number: "9",
            },
        },
        threshold: Threshold(
            2,
        ),
        commit_crash_tolerance: 3,
        prepared_members: {},
        committed_members: {},
    },
    crashed_nodes: {
        PlatformId {
            part_number: "test",
            serial_number: "11",
        },
        PlatformId {
            part_number: "test",
            serial_number: "16",
        },
        PlatformId {
            part_number: "test",
            serial_number: "3",
        },
        PlatformId {
            part_number: "test",
            serial_number: "7",
        },
    },
}
done: applied 1 events

tqdb〉events
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)

tqdb〉events next 4
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)
2  DeliverNexusReply
3  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "37",
    },
}
4  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "25",
    },
}

tqdb〉step 4
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "37" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "25" }, from: test:1, epoch: 1
1  SendNexusReplyOnUnderlay(
    AckedPreparesFromCoordinator {
        epoch: Epoch(
            1,
        ),
        acks: {
            PlatformId {
                part_number: "test",
                serial_number: "1",
            },
        },
    },
)
2  DeliverNexusReply
3  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "37",
    },
}
4  DeliverEnvelope {
    destination: PlatformId {
        part_number: "test",
        serial_number: "25",
    },
}
done: applied 4 events

tqdb〉snapshot 33
Setting pending snapshot
tqdb〉b 99
breakpoint set at event 99
tqdb〉run
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "9" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "32" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "34" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "5" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "39" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "27" }, from: test:1, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:27, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:39, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "4" }, from: test:1, epoch: 1
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "15" }, from: test:1, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:15, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:4, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:5, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:34, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:32, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:9, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:25, epoch: 1
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:37, epoch: 1
INFO Configuration being coordinated changed, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, previous_epoch: 1, new_epoch: 2
INFO Starting coordination on uninitialized node, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, component: tq-coordinator-state, epoch: 2
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "0" }, from: test:1, epoch: 2
INFO Prepared configuration, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "2" }, from: test:1, epoch: 2
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:2, epoch: 2
INFO Received prepare ack, component: tqdb, component: trust-quorum, platform_id: PlatformId { part_number: "test", serial_number: "1" }, from: test:0, epoch: 2
stopped at breakpoint 99 after applying 94 events
tqdb〉snapshot-list
Snapshot indexes:
33

tqdb〉diff 33
Node changed: test:0
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

Node changed: test:1
  coordinator state changed:
    epoch: 1 -> 2
    added members:
        test:0
        test:2
    removed members:
        test:15
        test:25
        test:27
        test:32
        test:34
        test:37
        test:39
        test:4
        test:5
        test:7
        test:9
    coordinator: test:1 -> test:1
    received prepare acks differ
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

Node changed: test:2
  config added to persistent state:
    epoch: 2
  our share added to persistent state:
    epoch: 2

  all messages delivered from bootstrap network:
    destination:  test:1
  1 new nexus replies in flight on underlay network
  0 nexus replies delivered to nexus from underlay network
  nexus state changed:
    config added at epoch 2, op: Preparing
    config added at epoch 3, op: Preparing
    config modified at epoch 1
      new prepare ack received: test:25
      new prepare ack received: test:32
      new prepare ack received: test:37
      new prepare ack received: test:9

tqdb〉summary
event log path: "/tmp/cluster-49df2a4b903c778a-test_trust_quorum_protocol.14368.453-events.json"
total events in log: 399
next event to apply: 99
    AbortConfiguration(
    Epoch(
        3,
    ),
)
total nodes under test: 40
bootstrap network messages in flight: 0
nexus config:
    epoch: 3
    op: Preparing
    coordinator: 3
    total members: 4
    prepared members: 0
    committed members: 0
    threshold: 2
    commit crash tolerance: 1

tqdb〉node-show 2
Node {
    log: Logger(platform_id, component, component),
    coordinator_state: None,
    key_share_computer: None,
}
NodeCtx {
    platform_id: PlatformId {
        part_number: "test",
        serial_number: "2",
    },
    persistent_state: PersistentState {
        lrtq: None,
        configs: {
            Epoch(
                2,
            ): Configuration {
                rack_id: 4ea41e14-9da7-4783-af0a-9cf85f8c46db (rack),
                epoch: Epoch(
                    2,
                ),
                coordinator: PlatformId {
                    part_number: "test",
                    serial_number: "1",
                },
                members: {
                    PlatformId {
                        part_number: "test",
                        serial_number: "0",
                    }: sha3 digest: d1d315a27b395a1be95638c0362042692cbf39911d9b26a531655868e3f5,
                    PlatformId {
                        part_number: "test",
                        serial_number: "1",
                    }: sha3 digest: dc564317181cf569b639adaba0144b9129529af7f82efac7f8eceb2ecd6816,
                    PlatformId {
                        part_number: "test",
                        serial_number: "2",
                    }: sha3 digest: c2696e54ef160b03d877cb7b19b9b886d14f95df2992663a8d843474fbec9c,
                    PlatformId {
                        part_number: "test",
                        serial_number: "3",
                    }: sha3 digest: f4b19a92ff3dd8a20a74b1744f864de858b10a978ea95e2f86c641beb2844b,
                },
                threshold: Threshold(
                    2,
                ),
                encrypted_rack_secrets: None,
            },
        },
        shares: {
            Epoch(
                2,
            ): KeyShareGf256,
        },
        commits: {},
        expunged: None,
    },
    persistent_state_changed: false,
    outgoing: [],
    connected: {
        PlatformId {
            part_number: "test",
            serial_number: "0",
        },
        PlatformId {
            part_number: "test",
            serial_number: "1",
        },
        PlatformId {
            part_number: "test",
            serial_number: "10",
        },
        PlatformId {
            part_number: "test",
            serial_number: "12",
        },
        PlatformId {
            part_number: "test",
            serial_number: "13",
        },
        PlatformId {
            part_number: "test",
            serial_number: "14",
        },
        PlatformId {
            part_number: "test",
            serial_number: "15",
        },
        PlatformId {
            part_number: "test",
            serial_number: "17",
        },
        PlatformId {
            part_number: "test",
            serial_number: "18",
        },
        PlatformId {
            part_number: "test",
            serial_number: "19",
        },
        PlatformId {
            part_number: "test",
            serial_number: "20",
        },
        PlatformId {
            part_number: "test",
            serial_number: "21",
        },
        PlatformId {
            part_number: "test",
            serial_number: "22",
        },
        PlatformId {
            part_number: "test",
            serial_number: "23",
        },
        PlatformId {
            part_number: "test",
            serial_number: "24",
        },
        PlatformId {
            part_number: "test",
            serial_number: "25",
        },
        PlatformId {
            part_number: "test",
            serial_number: "26",
        },
        PlatformId {
            part_number: "test",
            serial_number: "27",
        },
        PlatformId {
            part_number: "test",
            serial_number: "28",
        },
        PlatformId {
            part_number: "test",
            serial_number: "29",
        },
        PlatformId {
            part_number: "test",
            serial_number: "30",
        },
        PlatformId {
            part_number: "test",
            serial_number: "31",
        },
        PlatformId {
            part_number: "test",
            serial_number: "32",
        },
        PlatformId {
            part_number: "test",
            serial_number: "33",
        },
        PlatformId {
            part_number: "test",
            serial_number: "34",
        },
        PlatformId {
            part_number: "test",
            serial_number: "35",
        },
        PlatformId {
            part_number: "test",
            serial_number: "36",
        },
        PlatformId {
            part_number: "test",
            serial_number: "37",
        },
        PlatformId {
            part_number: "test",
            serial_number: "38",
        },
        PlatformId {
            part_number: "test",
            serial_number: "39",
        },
        PlatformId {
            part_number: "test",
            serial_number: "4",
        },
        PlatformId {
            part_number: "test",
            serial_number: "5",
        },
        PlatformId {
            part_number: "test",
            serial_number: "6",
        },
        PlatformId {
            part_number: "test",
            serial_number: "8",
        },
        PlatformId {
            part_number: "test",
            serial_number: "9",
        },
    },
    alarms: {},
}

andrewjstone · 2025-08-08T06:10:13Z

trust-quorum/test-utils/src/nexus.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Nexus related types for trust-quorum testing


Everything in this file was moved from the proptest in cluster.rs

andrewjstone · 2025-08-08T06:12:44Z