-
Notifications
You must be signed in to change notification settings - Fork 4
feat(remote-comms): Add explicit connection management for intentional disconnects #699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| logger.log(`${channel.peerId}:: remote disconnected`); | ||
| logger.log(`${channel.peerId}:: remote intentionally disconnected`); | ||
| // Mark as intentionally closed and don't trigger reconnection | ||
| intentionallyClosed.add(channel.peerId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Remote Disconnects Permanently Block Connections
When a remote peer initiates a graceful disconnect (SCTP_USER_INITIATED_ABORT), the local kernel incorrectly adds that peer to intentionallyClosed, preventing future outbound connections. The intentionallyClosed set should only track peers that the local kernel explicitly closed via closeConnection(), not peers that remotely disconnected. This causes the local kernel to permanently refuse sending messages to peers that gracefully restarted, even though the local kernel never called closeConnection().
FUDCo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo my question below about what the "re" in "reconnect" actually means. If you are comfortable with whatever your answer to my question is, go ahead and land the PR, but at least consider whether what you've got here is actually what we want.
Also, this PR really drove home the amount of boilerplate involved in adding to the remote comms protocol. I'm not sure there's much if anything to be done about this, as doing something about it would probably be an awful lot of work, and I expect the protocol itself to be quite stable in general and therefor going through the effort of tweaking the boilerplate will (hopefully) be relatively rare, at least once we converge on the One True API. But it did give me pause.
| /** | ||
| * Launch a new worker with a specific vat id. | ||
| * | ||
| * @param vatId - The vat id of the worker to launch. | ||
| * @param vatConfig - The configuration for the worker. | ||
| * @returns A promise for a duplex stream connected to the worker | ||
| * which rejects if a worker with the given vat id already exists. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding these doc comments. It's interesting that our lint rules demand jsdoc comments on functions but not on methods. I wonder if we can change that (or did it actually change and that's what motivated these?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We follow the MetaMask eslint rule which doesn't demand jsdoc comments on methods. We certainly can enable it though
| */ | ||
| async reconnectPeer(peerId: string, hints: string[] = []): Promise<void> { | ||
| await this.#rpcClient.call('reconnectPeer', { peerId, hints }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it considered reconnection (as one would do after a network outage) or is it just a new connection to the same endpoint as before? In particular, do we expect clist entries to survive a trip through manual disconnect/manual reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a reconnection not a new connection. It reuses the existing RemoteHandle and remote endpoint. Clist entries and other endpoint state persist across the disconnect/reconnect cycle.
Closes #685
Add
closeConnectionandreconnectPeermethods to allow explicit control over remote peer connections. The system now detects intentional disconnects (SCTP_USER_INITIATED_ABORT) and prevents automatic reconnection for intentionally closed peers. Users can manually reconnect usingreconnectPeerwhen needed. This change propagates through the entire kernel architecture including network layer, RemoteManager, Kernel, PlatformServices, and RPC handlers, with comprehensive test coverage.Note
Introduce
closeConnectionandreconnectPeerAPIs end-to-end and treat intentional disconnects as non-retryable, with manual reconnection support.closeConnection(peerId)andreconnectPeer(peerId, hints)with per-peer "intentionally closed" tracking.closeConnection,reconnectPeerinplatform-services.PlatformServicesto network layer; clear funcs onstopRemoteComms.Kernel.closeConnection()andKernel.reconnectPeer()viaRemoteManagerandRemoteCommstypes.remote-comms.ts, refactors) and adjust coverage thresholds.Written by Cursor Bugbot for commit 252f5d7. This will update automatically on new commits. Configure here.