Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync error handling #133

Merged
merged 25 commits into from
Feb 19, 2025
Merged

Sync error handling #133

merged 25 commits into from
Feb 19, 2025

Conversation

Zacholme7
Copy link
Member

Issue Addressed

#119

Proposed Changes

This PR introduces a way to handle rpc errors and signal if sync is stalled.

If a websocket goes down or an rpc endpoint is having issues,OPERATIONAL_STATUS is set to false. The rest of the application can be conditioned on this value to determine if the execution layer is having sync issues.

If there is an rpc error, there is nothing we can do until the endpoint is operational again. The simplest way to test this is just to continuously poll for a block number with exponential backoffs. There is no longer a set number of retries. It just keeps retrying until it is valid again.

@Zacholme7 Zacholme7 added enhancement New feature or request execution layer labels Feb 11, 2025
@Zacholme7 Zacholme7 marked this pull request as ready for review February 14, 2025 13:22
@dknopik
Copy link
Member

dknopik commented Feb 14, 2025

I retested it locally. Could not get it to crash now. Nice!

But I noticed that right now, this error condition does not properly apply any backoff:

// If we get here, the stream ended (likely due to disconnect)
error!("WebSocket stream ended, reconnecting...");

This got me thinking once more about the approach. How do you feel about, instead of having to apply troubleshoot_rpc at multiple appropriate locations throughout the file, just returning an error in the live and historical functions and doing the reconnect logic at the top level? This also handles cases where we want to fall back to historical sync logic after being offline for a considerable amount of time, and helps us assert that we really do not crash out anymore if something fails. There might be some disadvantages I am not thinking of right now though.

@Zacholme7
Copy link
Member Author

@dknopik Threw together a quick POC. I think I that approach works very nice. Take a look when you have a sec and if thats along the lines of what you were thinking ill clean it up and make sure it works.

@dknopik
Copy link
Member

dknopik commented Feb 14, 2025

Threw together a quick POC. I think I that approach works very nice. Take a look when you have a sec and if thats along the lines of what you were thinking ill clean it up and make sure it works.

yea, seems good! This is what I meant. :​)

@dknopik
Copy link
Member

dknopik commented Feb 17, 2025

Something seems a bit funky still, when disconnecting during live sync, backoff works great, but when reconnecting, we enter a weird state where we loop like this without backoff:

2025-02-17T08:24:03.006389Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}: eth::sync: Starting historical sync
2025-02-17T08:24:03.006411Z DEBUG sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync: alloy_rpc_client::call: sending request method=eth_blockNumber id=349
2025-02-17T08:24:03.006536Z DEBUG sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync:ReqwestTransport{url=http://127.0.0.1:8545/}: hyper_util::client::legacy::pool: reuse idle connection for ("http", 127.0.0.1:8545)
2025-02-17T08:24:03.026185Z DEBUG sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync:ReqwestTransport{url=http://127.0.0.1:8545/}: hyper_util::client::legacy::pool: pooling idle connection for ("http", 127.0.0.1:8545)
2025-02-17T08:24:03.026239Z DEBUG sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync:ReqwestTransport{url=http://127.0.0.1:8545/}: alloy_transport_http::reqwest_transport: received response from server status=200 OK
2025-02-17T08:24:03.026283Z DEBUG sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync:ReqwestTransport{url=http://127.0.0.1:8545/}: alloy_transport_http::reqwest_transport: retrieved response body. Use `trace` for full body bytes=47
2025-02-17T08:24:03.026347Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync: eth::sync: Synced up to the tip of the chain, breaking
2025-02-17T08:24:03.026364Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:historical_sync: eth::sync: Historical sync completed
2025-02-17T08:24:03.026390Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}: eth::sync: Starting live sync
2025-02-17T08:24:03.026415Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:live_sync: eth::sync: Network up to sync..
2025-02-17T08:24:03.026430Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:live_sync: eth::sync: Current state
2025-02-17T08:24:03.026444Z  INFO sync:try_sync{contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa deployment_block=181612}:live_sync: eth::sync: Starting live sync contract_address=0x38a4794cced47d3baf7370ccc43b560d3a1beefa
2025-02-17T08:24:03.026489Z ERROR sync: eth::sync: Sync failed, attempting recovery e=WsError("Failed to subscribe to block stream: backend connection task has stopped")

@Zacholme7
Copy link
Member Author

@dknopik should be fixed now, was matching against wrong error

Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. if you consider it ready, feel free to merge

@Zacholme7 Zacholme7 requested a review from dknopik February 19, 2025 14:20
@Zacholme7 Zacholme7 merged commit 1d1991f into sigp:unstable Feb 19, 2025
10 checks passed
@Zacholme7 Zacholme7 deleted the sync-error-handling branch February 19, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request execution layer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants