-
Notifications
You must be signed in to change notification settings - Fork 784
node: Add Notary package for verifying MessagePublications #4315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cdbdccd
to
43b0fa5
Compare
Verdict uint8 | ||
) | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have the terminology be similar to the transfer verification? I feel having "delay" and "anomalous" have similar purposes and "Rejected" and "Blackhole".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, I think verificaiton state is a status and the verdict is an action. For example, in the future we may want to Blackhole Anomalous messages instead. It's also possible that the Notary could be extended to give different verdicts on different types of messages.
err = n.blackhole(msg) | ||
v = Blackhole | ||
case common.NotVerified, common.NotApplicable, common.Valid: | ||
// NOTE: All other statuses are simply approved for now. In the future, it may be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the message is being processed at this point, it should never be in this state, right? I feel like we should just bail if this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Valid and NotApplicable, this looks correct though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by bail? Do you want the function to return an error? My idea here was to fail open because ultimately we only care about Anomalous and Rejected right now.
NotVerified
will be the status for all messages coming from chains without a transfer verifier so I think we want these to be approved as well.
node/pkg/processor/processor.go
Outdated
|
||
// Iterate over all ready messages. Hand-off to the Governor or the Accountant | ||
// if they're enabled. If not, publish. | ||
for msg := range slices.Values(p.notary.Ready()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's repeated in three spots:
- Here
- msgC channel
- Below for just the Governor.
I think a generalized function refactor to handle all of these cases would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I believe that the flow is correct in all of these cases. However, it's complicated and duplicates some code, which indicates that a refactor may be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm a bit nervous to dig into a big refactor of the processor, might want to do that in a follow-up PR instead.
node/pkg/notary/notary.go
Outdated
n.logger.Debug("notary: processing message", msg.ZapFields()...) | ||
|
||
// Only token transfers are currently supported. | ||
if !vaa.IsTransfer(msg.Payload) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I think that the case below for NotApplicable for probably handle this case right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it does, I was thinking this could be an extra guard in case the Transfer Verifier logic gets changed but the Notary gets forgotten.
) | ||
now := time.Now() | ||
for n.delayed.Len() != 0 { | ||
next := n.delayed.Peek() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like using the heap for processing! It's clever.
Will this be compatible with when we want to move a message from delayed to Approved
or Blackhole
? I believe that's a future PR on the admin commands to handle these situations but just want to make sure we're thinking through the design of that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll go from Delayed to Approved automatically when the time runs out, but yeah the admin command will have to handle the case from moving manually from one place to the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a RemoveItem function to the heap and added unit tests to show that it preserves the ordering of elements even if we remove an arbitrary one.
Converting to draft because the behaviour of the Transfer Verifier is changing to add a new Verification State that will be used for the EVM implementation. Given this, we may want to change how the Processor responds to those messages. |
Note: I've moved the MessagePublication changes from this PR into a separate PR: #4428 Hopefully this makes things easier to digest and test. As a result, this PR is blocked but should be easily rebased after the other PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
node/pkg/processor/processor.go
Outdated
// This is the main message processing loop. It is responsible for handling messages that are | ||
// received on the message channel. Depending on the configuration, a message may be processed | ||
// by the Notary, the Governor, and/or the Accountant. | ||
// This loop effectively causes each of these components to be process messages in a modular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This loop effectively causes each of these components to be process messages in a modular | |
// This loop effectively causes each of these components to process messages in a modular |
node/pkg/common/pendingmessage.go
Outdated
buf := new(bytes.Buffer) | ||
|
||
// Compare with [PendingTransfer.Marshal]. | ||
//nolint:gosec // uint64 and int64 have the same number of bytes, and Unix time won't be negative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nosec
pls 🙏 Like the pattern of ignoring as little as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks yeah this was authored before you left the similar comment on #4428 and i didn't come back and also apply that suggestion here. I'll fix this
node/pkg/common/pendingmessage.go
Outdated
// UnmarshalBinary implements BinaryUnmarshaler for [PendingMessage]. | ||
func (p *PendingMessage) UnmarshalBinary(data []byte) error { | ||
|
||
if len(data) < marshaledMsgLenMin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add 8 to account for the release time?
len(data) < marshaledMsgLenMin + 8
node/pkg/common/pendingmessage.go
Outdated
// SECURITY: Only the functions labelled Safe should be called directly. The other | ||
// functions must be public to satisfy the heap interface but do not perform | ||
// safety checks. | ||
// This follows the recommendations for using heap: heap.Pop() and heap.Push() | ||
// should be used rather than calling the below functions directly, and the | ||
// Safe functions do this. | ||
// See: https://pkg.go.dev/container/heap#Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any SafePush
functions? Is this out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be, I'll take a look and rework this comment. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so in an earlier phase of this I was going to write a SafePush and SafePop in order to avoid the panics in the normal Push and Pop that you can see here.
Instead the approach I took is to make all the relevant methods private and wrap them in the PendingMessageQueue struct. That struct's Push and Pop methods do the relevant checks before calling the unsafe methods.
I updated the comment.
node/pkg/notary/notary.go
Outdated
case common.Rejected: | ||
err = n.blackhole(msg) | ||
v = Blackhole | ||
case common.CouldNotVerify, common.NotVerified, common.NotApplicable, common.Valid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth splitting these into 2 sets and adding some debug logging for testnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, let me know what you think
// Update database. | ||
err := n.database.DeleteDelayed(pMsg) | ||
if err != nil { | ||
n.logger.Error("delete pending message from notary database", pMsg.Msg.ZapFields(zap.Error(err))...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this delete failed, the next time we release ready messages, the same message will be included again. Haven't checked yet myself, but would this have any downstream side effects of double processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterates over and consumes elements from the in-memory queue rather than the database. So, that situation would only happen when the Guardian restarts and it loads the delayed messages from the DB into the queue.
In general, if a message is processed more than once it would need to flow through the Governor and the Accountant successfully in order to be processed into a VAA. If a VAA has already been produced for that message, then making it all the way through those checks should result in a duplicated VAA.
We should definitely think through any new edge-cases that occur here, but with the caveat that the Governor and the Guardians themselves already do not have a permanent record of already-processed messages or VAAs, so I think this new code doesn't change that model.
} | ||
|
||
// Store in in-memory slice. This should happen even if a database error occurs. | ||
n.delayed.Push(pMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking the database/heap first to see if the message has already been delayed? For example, we don't want the same message to get delayed twice, where one then gets manually blackholed, and then the other gets released as normal after the delay, effectively circumventing the blackhole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit I added a uniqueness check in the queue's Push method so that duplicate MessagePublications can't be added.
continue | ||
case guardianNotary.Unknown: | ||
p.logger.Error("notary returned Unknown verdict", k.ZapFields(zap.String("verdict", verdict.String()))...) | ||
case guardianNotary.Approve: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a default case that logs an error message? We shouldn't hit the default case so would be good to know if that happens
if p.acct != nil { | ||
shouldPub, err := p.acct.SubmitObservation(msg) | ||
if err != nil { | ||
return fmt.Errorf("accountant: failed to process message `%s`: %w", msg.MessageIDString(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return here? It'll mean we never get to the Governor pending messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mimic the existing logic for Accountant errors.
The Governor is already processed just above in this branch. The governor-related code below is what runs if the Notary is not enabled.
I don't like how this code is structured (hence the TODO above) but I didn't want to mess around with the processor logic too much.
…bridge validation - Replace TxID-based deduplication with MessageID for pending message queue - Add emitter address validation for token bridge transfers - Fix test helper functions to generate unique messages with proper sequence numbers - Add duplicate prevention tests and improve error handling
Overview
New Notary package
This PR adds a "Notary" package that can evaluates the verification state of a Message Publication. It provides a Verdict on what should occur, e.g. it reports that an
Anomalous
message should be delayed.It stores messages that should be Delayed or Blackholed in a database (BadgerDB)
Details are in #4452
Changes to the Processor
The processor now inspects the Notary's queue of message to delay and skips publishing them until their ReleaseTime has passed. Once they are ready to release, it passes them to the Governor and Accountant as normal (assuming Guardians have them enabled).
Design decisions + supporting tasks
Database
The package uses the BadgerDB key-value store to delay and persist messages, similar to the Governor. This allows the Notary's decisions to persist across restarts.
An implementation is possible that would use only in-memory representations, but this has some drawbacks:
Rejected
status should never be processed until the end of time. We need a way to track this.Rather than use the existing
Database
struct (like Governor and Accountant), a newNotaryDB
struct was added that is a wrapper for the BadgerDB connection. The goal here is to isolate the Notary-related API from other modules.Creation of min-heap data structure PendingMessageQueue
Conceptually, it makes sense to store delayed messages according to their release time. If the collection is always sorted, we can access elements from one side or the other until one message is too recent to process. In that case, all of the other messages will also be too new, and we can exit early.
This also allows us to be flexible with our delay time, compared with storing the time at which the message is stored rather than when it should be delayed. We can delay a message for one minute, a day, or a year. As long as the sorting holds, we don't need to iterate over many items.
In order to achieve this, I've added a min-heap data structure with a safe API that allows for storing messages this way.
Background
Follow-up task on: #4233
Future Work
Add drop/release/reset admin commands
This package should allow a super-minority of Guardians to forcibly drop, release, or reset the delay time of delayed or blackholed messages. This can help when the Notary has delayed an Anomalous message that is in fact fraudulent - in this case, the message can be moved to the 'blackhole' deny-list. Alternatively, if the Transfer Verifier or Notary has a bug, this allows the Guardians to release a harmless message.
Add Guardian p2p feature flags
Similar to: #4317
This change will be done in a follow-up PR.
Use the priority message queue min-heap for the Governor
The new data structure could be made generic or copied for the types used by the Governor and could grant similar benefits. See also #4029.