-
Notifications
You must be signed in to change notification settings - Fork 601
client-api: Rewrite websocket loop #2906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Split the websocket stream into send and receive halves and spawns a new tokio task to handle the sending. Also move message serialization + compression to a blocking task if the message appears to be large. This addresses two issues: 1. The `select!` loop is not blocked on sending messages, and can thus react to auxiliary events. Namely, when a module exits, we want to terminate the connection as soon as possible in order to release any database handles. 2. Large outgoing messages should not occupy tokio worker threads, in particular when there are a large number of clients receiving large intial updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to figure out what's going on with the SerializeBuffer
and fix it before merging, but otherwise this looks good to me.
Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.
Updated to:
I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.
Also fixes the actual resource hog, which is that the ws_actor never terminated because all receive errors were ignored.
Update to:
|
Updated to:
|
Updated to:
|
Pong frames sit in line with previously sent messages, and so may not be received in time if the server is backlogging. We also want to time out the connection in "cable plugged" scenarios, where the kernel doesn't consider the connection gone until `tcp_keepalive_time`.
Also losen `Unpin` requirements and use long names for type variables denoting futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge improvement, many thanks. The tests are very encouraging. I've left a number of requests for new or expanded comments, as I'm wary that in a few months or years we'll end up with a similarly opaque tangle of logic to the previous version if multiple collaborators don't fully understand what's happening.
Because there is no inbound buffering, the queue length metric will not be updated.
Should we remove this metric? Or is that going to break our dashboards in some frustrating way?
I'd like to put this up for discussion. The effect of not buffering is that we have no or reduced visibility into a situation where the database is lagging behind. We may alternatively insert a bounded queue, where the effect on the metric is likely that the queue is reported as either full or empty. But maybe that's good enough? |
Co-authored-by: Phoebe Goldman <[email protected]> Signed-off-by: Kim Altintop <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks!
Signed-off-by: Kim Altintop <[email protected]> Co-authored-by: Phoebe Goldman <[email protected]>
Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.
This addresses two issues:
The
select!
loop is not blocked on sending messages, and can thusreact to auxiliary events. Namely, when a module exits, we want to
terminate the connection as soon as possible in order to release any
database handles.
Large outgoing messages should not occupy tokio worker threads, in
particular when there are a large number of clients receiving large
intial updates.
EDIT: This patch started out to address the above issues, but evolved into a rewrite as more issues where discovered. Namely:
API and ABI breaking changes
Because there is no inbound buffering, the queue length metric will not be updated.
Expected complexity level and risk
4 - The state transitions remain hard to follow.
Testing
and observed no hangs / delays (which I did before this patch).
In reconnection scenarios, all clients where disconnected timely, but
could reconnect almost immediately.