-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for driving multiple EL nodes from a single Nimbus BN #4465
Conversation
2583cc0
to
dcc4458
Compare
all documentation needs to be updated accordingly
ditto, needs an update to the docs
does anything test the plain and simple validators on mainnet?
with config file support available, we should be moving away from these - ie they're crutches at best and make the manual hard to write (because they're not consistently used for all commands such as trusted node sync) |
since it's a new parameter, it's a good opportunity to default it to localhost:8551 so that a default "localhost" setup just works with as few command line parameters as possible (to be further improved with ethereum/execution-apis#302) - it should also be possible to remove the default so that no el connection is attempted |
beacon_chain/eth1/eth1_monitor.nim
Outdated
@@ -266,27 +356,18 @@ when hasGenesisDetection: | |||
withdrawal_credentials: deposit.withdrawal_credentials) | |||
m.genesisValidatorKeyToIndex[pubkey] = idx | |||
|
|||
proc processGenesisDeposit*(m: Eth1Monitor, newDeposit: DepositData) = | |||
m.depositsChain.db.genesisDeposits.add newDeposit | |||
proc processGenesisDeposit*(m: ELManager, newDeposit: DepositData) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the el manager managing multiple EL:s, it makes even more sense to move the deposit processing logic to a separate module that can be unit tested in isolation - ie there should a module whose role is to take as input eth1 blocks and produce a deposit structure / state, similar to the chaindag - all of this is synchronous logic that is orthogonal to the multiplexing of several connections - it's a very natural line along which this code could be simplified and rendered robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie "processing deposits" and "managing the live payload interaction between el and CL" are to completely stand-alone tasks - if this was designed like the VC, these would be separate services living on top of a manager whose role is solely to multiplex - a design like this would significantly untangle what right now is a monolith that's hard to work with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These activities are already independent enough. They run as different "asynchronous threads" of execution which both share access to the Eth1Chain
structure which is indeed a bit like the chaindag.
Otherwise, I've wanted to reduce the required inputs of this particular function further in order to make it more reusable and easier to test, but the main problem is that it has multiple different responsibilities:
- It must download all deposits, while forming a sparse chain of all blocks with deposits
- It must download all recent blocks, even if they don't have deposits
- It must obtain the timestamps for some of the blocks (currently, it's not doing this optimally and I plan to submit a follow-up PR fixing this).
As a result, it has a somewhat awkward monolithic internal design in order to allow these goals to be achieved with the minimal number of requests. I'll add some comments on the top to explain this. Removing the ELManager
input is a worthy goal which is relatively within reach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These activities are already independent enough.
this is a comment about the code complexity - the monolithic design has regularly caused bugs and confusion when trying to approach the code for changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code has quite a bit more complex control flow because it was relying on things like async events which can be fired from either the websocket header subscriptions or HTTP polling. The current version is quite a bit simpler and introducing a notion of "services" will only bring back some of the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really - there's a portion of the logic that is not asynchronous and not dependent on EL communication - it checks for consistency of a new block with respect to the others and has an independent API with error handling and logic - it would be exactly the same if it was driven by an internal eth1 engine without async comms, or indeed a feed of blocks read from disk - this is the part that doesn't belong here, but rather in a separate module which receives tests etc - the code in elmanager at that point is a dumb data router, as it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sure - the Eth1Chain
object can be moved to a separate module. It's already used in the non-asynchronous way that you describe as part of the block_sim
simulation which injects a random number of deposits and uses them during block production.
beacon_chain/eth1/eth1_monitor.nim
Outdated
elif bestPayloadIdx.isNone: | ||
bestPayloadIdx = some idx | ||
else: | ||
if cmpGetPayloadResponses(req.read, requests[bestPayloadIdx.get].read) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the payload comparison done in elmanager? it seems to me this should be the responsibility of the same module that compares blocks from all sources, ie blinded blocks included thus concentrating the block selection logic in one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple utility function. I agree that it can be moved to another module once it's applied in more places.
Since you really want the best block, it doesn't matter if comparisons are done in multiple places. It would be silly to return multiple block candidates from the EL manager, just so there could be a single place where all blocks will be compared at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the point is to concentrate the selection strategy to a single place - at that point it becomes feasible to create other strategies for block selection (ie ensuring that a block contains a particular tx etc) - the elmanager already has too many responsibilities, many of which are not tested - this is one more which could be moved to a place where it could become a regular part of the test suite - since this is a recurring design problem in the elmanager, it's a good opportunity to get it done sooner rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I still find the requested design silly. If you want to have a single block selection algorithm, one easy way to implement this would be to inject a BlockSelector
dependency in the EL manager that will be called in place of the cmpGetPayloadResponses
function. Since any block selection algorithm is likely to be a simple functions will very little configurable state, it's trivial to test in unit tests without depending on any part of the ELManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be one way to do it, but ideally it would be able to also take into consideration blinded blocks (for example, a popular strategy is to use a blinded block iff the reward is greater by a percentage compared to an unblinded block, so as to give slight preference to censorship resistance, but not at the expense of large profits).
It is the mixing of all these responsiblities that has led to the poor track record of elmanager over time - the tests don't get written because it is too complex and does too many things and the way to solve that is to split its responsiblities into parts that relate to logic and parts that relate to connection management and handling.
turning off the EL puts the the beacon node in a busy loop (see timing), with this PR |
dcc4458
to
2b6f4e2
Compare
Future testnets will, by all appearances, start merged and in Bellatrix or newer. |
4956482
to
6a3dc07
Compare
Full list of changes: * Eth1Monitor has been renamed to ELManager to match its current responsibilities better. * The ELManager is no longer optional in the code (it won't have a nil value under any circumstances). * The support for subscribing for headers was removed as it only worked with WebSockets and contributed significant complexity while bringing only a very minor advantage. * The `--web3-url` parameter has been deprecated in favor of a new `--el` parameter. The new parameter has a reasonable default value and supports specifying a different JWT for each connection. Each connection can also be configured with a different set of responsibilities (e.g. download deposits, validate blocks and/or produce blocks). On the command-line, these properties can be configured through URL properties stored in the #anchor part of the URL. In TOML files, they come with a very natural syntax (althrough the URL scheme is also supported). * The previously scattered EL-related state and logic is now moved to `eth1_monitor.nim` (this module will be renamed to `el_manager.nim` in a follow-up commit). State is assigned properly either to the `ELManager` or the to individual `ELConnection` objects where appropriate. The ELManager executes all Engine API requests against all attached EL nodes, in parallel. It compares their results and if there is a disagreement regarding the validity of a certain payload, this is detected and the beacon node is protected from publishing a block with a potential execution layer consensus bug in it. The BN provides metrics per EL node for the number of successful or failed requests for each type Engine API requests. If an EL node goes offline and connectivity is resoted later, we report the problem and the remedy in edge-triggered fashion. * More progress towards implementing Deneb block production in the VC and comparing the value of blocks produced by the EL and the builder API. * Adds a Makefile target for the zhejiang testnet
6a3dc07
to
98586e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed a block with Failed to obtain execution payload from EL
, despite Geth logging Stopping work on payload
only 0.679s after Starting work on payload
.
But, Geth also logged Chain head was updated
0.858s before finishing the payload, which is probably the await rpcClient.forkchoiceUpdated
within getPayloadFromSingleEL
.
The 1s timeout is probably too aggressive with the current implementation, as it spans the entire block building process and not just the actual getPayload
call.
raise newException(CatchableError, "Head block is not a valid payload") | ||
|
||
# Give the EL some time to assemble the block | ||
await sleepAsync(chronos.milliseconds 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine_api.getPayload
further down here may take another 1000 ms, so if we hit this path, the outer await requestsCompleted or deadline
may timeout on spec compliant ELs.
# TODO We should follow the spec and track the timeouts of | ||
# the individual engine API calls inside `getPayloadFromSingleEL`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done even in the pre-Deneb case. Currently, the 1s timeout includes all of getPayloadFromSingleEL
and doesn't just wrap the getPayload
message.
Furthermore, we should probably start deadline
only after all requests have been sent, so that all ELs have the full 1s available for getPayload
.
Also, @tersec, when using an external VC, we don't seem to eagerly trigger the |
Well, it's not intended. But it seems like it at least should be set up to work, so would have to look at why it might not. nimbus-eth2/beacon_chain/validators/action_tracker.nim Lines 236 to 239 in 187e1a0
nimbus-eth2/beacon_chain/consensus_object_pools/consensus_manager.nim Lines 289 to 299 in 187e1a0
nimbus-eth2/beacon_chain/consensus_object_pools/consensus_manager.nim Lines 342 to 349 in 187e1a0
nimbus-eth2/beacon_chain/gossip_processing/block_processor.nim Lines 630 to 631 in 187e1a0
So this should only be failing to handle VC validators if |
That's correct, so probably something else going on :-/ It's indeed calling the same source as is used in the log |
@tersec Found the culprit: # In Capella and later, computing correct withdrawals would mean creating a
# proposal state. Instead, only do that at proposal time.
if nextWallSlot.is_epoch:
debug "runProposalForkchoiceUpdated: not running early fcU for epoch-aligned proposal slot",
nextWallSlot, validatorIndex, nextProposer
return Actually it has to be this one, as I don't see any fcU sent to Geth in the slot prior to my missed block. Which is consistent with the behaviour when Should we simply fall back to a regular fcu in that case, so that at least the attestations are correct? |
|
Another problem was that I added an additional delay to the slot processing, to wait until shortly before end of slot before calling |
In #4465, a regression was introduced by deleting the periodic call to `engine_exchangeTransitionConfiguration` in the Nimbus light client. This led to the light client never treating the EL as online and, subsequently, not sending `engine_newPayload` requests for new blocks. Strangely, `engine_forkchoiceUpdated` requests still make it through :) Geth still requires both `engine_newPayload` and `fcU` to be called. By restoring the `exchangeTransitionConfiguration` loop, `newPayload` requests are once more issued to the EL.
In #4465, a regression was introduced by deleting the periodic call to `engine_exchangeTransitionConfiguration` in the Nimbus light client. This led to the light client never treating the EL as online and, subsequently, not sending `engine_newPayload` requests for new blocks. Strangely, `engine_forkchoiceUpdated` requests still make it through :) Geth still requires both `engine_newPayload` and `fcU` to be called. By restoring the `exchangeTransitionConfiguration` loop, `newPayload` requests are once more issued to the EL.
Full list of changes:
Eth1Monitor has been renamed to ELManager to match its current responsibilities better.
The ELManager is no longer optional in the code (it won't have a nil value under any circumstances).
The support for subscribing for headers was removed as it only worked with WebSockets and contributed significant complexity while bringing only a very minor advantage.
Transition configuration is now exchanged 2 weeks prior to the Bellatrix fork (at this point, relevant only for future testnets).
The
--web3-url
parameter has been deprecated in favor of a new--el
parameter. The new parameter supports specifying the JWT token per connection. Each connection can also be configured with a different set of responsibilities (e.g. download deposits, validate blocks and/or produce blocks). On the command-line, these properties can be configured through URL properties stored in the #anchor part of the URL. In TOML files, they come with a very natural syntax (althrough the URL scheme is also supported).The previously scattered EL-related state and logic is now moved to
eth1_monitor.nim
(this module will be renamed toel_manager.nim
in a follow-up commit). State is assigned properly either to theELManager
or the to individualELConnection
objects where appropriate.The ELManager executes all Engine API requests against all attached EL nodes, in parallel. It compares their results and if there is a disagreement regarding the validity of a certain payload, this is detected and the beacon node is protected from publishing a block with a potential execution layer consensus bug in it.
The BN provides metrics per EL node for the number of successful or failed requests for each type Engine API requests. If an EL node goes offline and connectivity is resoted later, we report the problem and the remedy in edge-triggered fashion.
The client can now produce Capella execution payloads
The local testnet simulation script now runs Geth and Nimbus by default. The minimal network simulation uses a Nimbus remote signing node for 50% of the validators. The mainnet simulation uses the official web3signer in a 2 out of 3 threshold signing config.
The local testnet simulation can now use a payload builder. This is currently not activated in CI due to lack of automated procedures for installing third-party relays or builders.
You are adviced to use mergemock for now, but for most realistic results, we can create a simple builder based on the nimbus-eth1 codebase that will be able to propose transactions from the regular network mempool.
Updated the
run-geth-el.sh
script to work with the latest Capella and EIP-4844 builds of Geth and added therun-nimbus-eth2-in-withdrawal-testnet.sh
script that can be used to launch Nimbus instances in the available withdrawal devnets.