Skip to content

node: Add Notary package for verifying MessagePublications #4315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

johnsaigle
Copy link
Contributor

@johnsaigle johnsaigle commented Mar 26, 2025

Overview

New Notary package

This PR adds a "Notary" package that can evaluates the verification state of a Message Publication. It provides a Verdict on what should occur, e.g. it reports that an Anomalous message should be delayed.

It stores messages that should be Delayed or Blackholed in a database (BadgerDB)

Details are in #4452

Changes to the Processor

The processor now inspects the Notary's queue of message to delay and skips publishing them until their ReleaseTime has passed. Once they are ready to release, it passes them to the Governor and Accountant as normal (assuming Guardians have them enabled).

Design decisions + supporting tasks

Database

The package uses the BadgerDB key-value store to delay and persist messages, similar to the Governor. This allows the Notary's decisions to persist across restarts.

An implementation is possible that would use only in-memory representations, but this has some drawbacks:

  • We would lose messages that are marked as delayed on a node restart
  • Messages with a Rejected status should never be processed until the end of time. We need a way to track this.

Rather than use the existing Database struct (like Governor and Accountant), a new NotaryDB struct was added that is a wrapper for the BadgerDB connection. The goal here is to isolate the Notary-related API from other modules.

Creation of min-heap data structure PendingMessageQueue

Conceptually, it makes sense to store delayed messages according to their release time. If the collection is always sorted, we can access elements from one side or the other until one message is too recent to process. In that case, all of the other messages will also be too new, and we can exit early.

This also allows us to be flexible with our delay time, compared with storing the time at which the message is stored rather than when it should be delayed. We can delay a message for one minute, a day, or a year. As long as the sorting holds, we don't need to iterate over many items.

In order to achieve this, I've added a min-heap data structure with a safe API that allows for storing messages this way.

Background

Follow-up task on: #4233

Future Work

Add drop/release/reset admin commands

This package should allow a super-minority of Guardians to forcibly drop, release, or reset the delay time of delayed or blackholed messages. This can help when the Notary has delayed an Anomalous message that is in fact fraudulent - in this case, the message can be moved to the 'blackhole' deny-list. Alternatively, if the Transfer Verifier or Notary has a bug, this allows the Guardians to release a harmless message.

Add Guardian p2p feature flags

Similar to: #4317

This change will be done in a follow-up PR.

Use the priority message queue min-heap for the Governor

The new data structure could be made generic or copied for the types used by the Governor and could grant similar benefits. See also #4029.

@johnsaigle johnsaigle changed the title Notary node: Add Notary package for verifying MessagePublications Mar 26, 2025
@johnsaigle johnsaigle added enhancement New feature or request node governor labels Mar 26, 2025
@johnsaigle johnsaigle force-pushed the notary branch 4 times, most recently from cdbdccd to 43b0fa5 Compare April 10, 2025 16:04
@johnsaigle johnsaigle marked this pull request as ready for review April 10, 2025 18:34
Verdict uint8
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have the terminology be similar to the transfer verification? I feel having "delay" and "anomalous" have similar purposes and "Rejected" and "Blackhole".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I think verificaiton state is a status and the verdict is an action. For example, in the future we may want to Blackhole Anomalous messages instead. It's also possible that the Notary could be extended to give different verdicts on different types of messages.

err = n.blackhole(msg)
v = Blackhole
case common.NotVerified, common.NotApplicable, common.Valid:
// NOTE: All other statuses are simply approved for now. In the future, it may be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the message is being processed at this point, it should never be in this state, right? I feel like we should just bail if this happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Valid and NotApplicable, this looks correct though.

Copy link
Contributor Author

@johnsaigle johnsaigle Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by bail? Do you want the function to return an error? My idea here was to fail open because ultimately we only care about Anomalous and Rejected right now.

NotVerified will be the status for all messages coming from chains without a transfer verifier so I think we want these to be approved as well.


// Iterate over all ready messages. Hand-off to the Governor or the Accountant
// if they're enabled. If not, publish.
for msg := range slices.Values(p.notary.Ready()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's repeated in three spots:

  • Here
  • msgC channel
  • Below for just the Governor.

I think a generalized function refactor to handle all of these cases would be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I believe that the flow is correct in all of these cases. However, it's complicated and duplicates some code, which indicates that a refactor may be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm a bit nervous to dig into a big refactor of the processor, might want to do that in a follow-up PR instead.

n.logger.Debug("notary: processing message", msg.ZapFields()...)

// Only token transfers are currently supported.
if !vaa.IsTransfer(msg.Payload) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I think that the case below for NotApplicable for probably handle this case right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does, I was thinking this could be an extra guard in case the Transfer Verifier logic gets changed but the Notary gets forgotten.

)
now := time.Now()
for n.delayed.Len() != 0 {
next := n.delayed.Peek()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using the heap for processing! It's clever.

Will this be compatible with when we want to move a message from delayed to Approved or Blackhole? I believe that's a future PR on the admin commands to handle these situations but just want to make sure we're thinking through the design of that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll go from Delayed to Approved automatically when the time runs out, but yeah the admin command will have to handle the case from moving manually from one place to the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a RemoveItem function to the heap and added unit tests to show that it preserves the ordering of elements even if we remove an arbitrary one.

@johnsaigle johnsaigle marked this pull request as draft May 1, 2025 21:28
@johnsaigle
Copy link
Contributor Author

Converting to draft because the behaviour of the Transfer Verifier is changing to add a new Verification State that will be used for the EVM implementation. Given this, we may want to change how the Processor responds to those messages.

#4362

@johnsaigle
Copy link
Contributor Author

Note: I've moved the MessagePublication changes from this PR into a separate PR: #4428

Hopefully this makes things easier to digest and test. As a result, this PR is blocked but should be easily rebased after the other PR is merged.

@johnsaigle johnsaigle marked this pull request as ready for review July 28, 2025 15:25
Copy link
Collaborator

@djb15 djb15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

// This is the main message processing loop. It is responsible for handling messages that are
// received on the message channel. Depending on the configuration, a message may be processed
// by the Notary, the Governor, and/or the Accountant.
// This loop effectively causes each of these components to be process messages in a modular
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This loop effectively causes each of these components to be process messages in a modular
// This loop effectively causes each of these components to process messages in a modular

buf := new(bytes.Buffer)

// Compare with [PendingTransfer.Marshal].
//nolint:gosec // uint64 and int64 have the same number of bytes, and Unix time won't be negative.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nosec pls 🙏 Like the pattern of ignoring as little as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yeah this was authored before you left the similar comment on #4428 and i didn't come back and also apply that suggestion here. I'll fix this

// UnmarshalBinary implements BinaryUnmarshaler for [PendingMessage].
func (p *PendingMessage) UnmarshalBinary(data []byte) error {

if len(data) < marshaledMsgLenMin {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add 8 to account for the release time?

len(data) < marshaledMsgLenMin + 8

Comment on lines 79 to 85
// SECURITY: Only the functions labelled Safe should be called directly. The other
// functions must be public to satisfy the heap interface but do not perform
// safety checks.
// This follows the recommendations for using heap: heap.Pop() and heap.Push()
// should be used rather than calling the below functions directly, and the
// Safe functions do this.
// See: https://pkg.go.dev/container/heap#Interface
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any SafePush functions? Is this out of date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be, I'll take a look and rework this comment. thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so in an earlier phase of this I was going to write a SafePush and SafePop in order to avoid the panics in the normal Push and Pop that you can see here.
Instead the approach I took is to make all the relevant methods private and wrap them in the PendingMessageQueue struct. That struct's Push and Pop methods do the relevant checks before calling the unsafe methods.

I updated the comment.

case common.Rejected:
err = n.blackhole(msg)
v = Blackhole
case common.CouldNotVerify, common.NotVerified, common.NotApplicable, common.Valid:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth splitting these into 2 sets and adding some debug logging for testnet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, let me know what you think

// Update database.
err := n.database.DeleteDelayed(pMsg)
if err != nil {
n.logger.Error("delete pending message from notary database", pMsg.Msg.ZapFields(zap.Error(err))...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this delete failed, the next time we release ready messages, the same message will be included again. Haven't checked yet myself, but would this have any downstream side effects of double processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This iterates over and consumes elements from the in-memory queue rather than the database. So, that situation would only happen when the Guardian restarts and it loads the delayed messages from the DB into the queue.

In general, if a message is processed more than once it would need to flow through the Governor and the Accountant successfully in order to be processed into a VAA. If a VAA has already been produced for that message, then making it all the way through those checks should result in a duplicated VAA.

We should definitely think through any new edge-cases that occur here, but with the caveat that the Governor and the Guardians themselves already do not have a permanent record of already-processed messages or VAAs, so I think this new code doesn't change that model.

}

// Store in in-memory slice. This should happen even if a database error occurs.
n.delayed.Push(pMsg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking the database/heap first to see if the message has already been delayed? For example, we don't want the same message to get delayed twice, where one then gets manually blackholed, and then the other gets released as normal after the delay, effectively circumventing the blackhole

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit I added a uniqueness check in the queue's Push method so that duplicate MessagePublications can't be added.

continue
case guardianNotary.Unknown:
p.logger.Error("notary returned Unknown verdict", k.ZapFields(zap.String("verdict", verdict.String()))...)
case guardianNotary.Approve:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a default case that logs an error message? We shouldn't hit the default case so would be good to know if that happens

if p.acct != nil {
shouldPub, err := p.acct.SubmitObservation(msg)
if err != nil {
return fmt.Errorf("accountant: failed to process message `%s`: %w", msg.MessageIDString(), err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return here? It'll mean we never get to the Governor pending messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mimic the existing logic for Accountant errors.
The Governor is already processed just above in this branch. The governor-related code below is what runs if the Notary is not enabled.
I don't like how this code is structured (hence the TODO above) but I didn't want to mess around with the processor logic too much.

…bridge validation

- Replace TxID-based deduplication with MessageID for pending message queue
- Add emitter address validation for token bridge transfers
- Fix test helper functions to generate unique messages with proper sequence numbers
- Add duplicate prevention tests and improve error handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request governor node
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants