-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Migrate away from handling each packet in a dedicated goroutine #71
Comments
I usually don't think of adding buffering as a good thing for performance - you can look to the world of network routing and Bufferbloat to see how hard they've been working to eliminate standard FIFO buffers due to the latency they add - but you've already mentioned some of these effects so it's good that we're considering it. I think in the networking world the latest strategy is to throttle high-bandwidth connections to keep the latency low for everyone else; but I'm not sure how that maps to our problem space. Reserve some workers for "easy" things like ping and SecureConnection, and queue those separately? Could we have a goroutine (or even real process using fork) per user? Then if you're waiting 3 seconds for some db-heavy Friends call it's at least won't affect other users. Then we can know when a goroutine is definitely not needed (the user it belongs to has their connection time out or disconnect). I can see this still ballooning out of control if we have a lot of simultaneous users, though. Dropping packets might honestly be fine, UDP is designed for it and it gives some backpressure to the client when we're under heavy load (potentially buying us more processing time too). I'd be curious how other programs that handle lots of connections with a worker pool model (e.g. nginx) handle this. I just worry that we might end up actually adding more latency under load if more requests come in than 100 goroutines can handle. |
To be clear here the main purpose of this issue is about system performance, not necessarily network performance. It's about managing server resources more effectively to prevent issues like what you saw with Splatoon on Discord
To be clear here I am talking about buffered channels in the context of Go, which is a core concept in Go's concurrency model. Channels can be either buffered or unbuffered, where buffered channels are better for preserving system resources as it limits the amount of data inside the channel at the cost of possibly losing additional data if the channel is full, and where an unbuffered channel are worse at preserving system resources as they continue to allocate more and more as needed to add more data to the channel but with the benefit of ensuring no data is actively lost just by adding it to the channel Each has their pros and cons and are better/worse in different situations, which is why I mentioned both
This is, effectively, what a buffered channel allows you to do. When more bandwidth comes in than we can handle, the channel just simply sheds off the data that is attempting to be added (if using
This can be done but it comes at the cost of now processing packet decoding synchronously. We don't know anything about the packet at the time the data comes in from the socket, it has to be decoded to know it's type (such as
This also works but as you mentioned has some issues. Having a real dedicated process to each user would definitely start to balloon a bit. Having a goroutine per user sort of works but that only helps once we know the user (unless you mean "user" to mean "physical socket connection"), which has the same issues about processing packets synchronously as mentioned before. Additionally, a connection can send multiple packets at once. So limiting it to just one goroutine per connection (I'm ignoring the issues this would cause for virtual connections, btw) would mean that each packet the client sent would be processed (effectively) synchronously (
I agree this is fine, I just also noticed that Nintendo seems to not really have an issue with dropped packets? Dropped packets seem pretty rare in the dumps I've looked at, so they must have a better system in place than "just shed whatever we can't handle". Such as what I mentioned on Discord, with regards to splitting the PRUDP and RMC handling systems into their own processes connected with a message queue
In most trivial cases it's a worker pool + channels (buffered or unbuffered depending on the needs of the application) for balancing data. It's why I brought this issue up in the first place, we're kinda just chucking goroutines at everything which is typically frowned upon and a sign of less-than-optimal design For non-trivial cases, this largely depends on the context and requirements of the application. I linked to a couple sources for such cases in the original issue in the
To reiterate, this issue is not necessarily about latency. The main concern was about system resources, with the goal being to find a balance in the implementation that produces acceptable levels of latency and resource usage |
I believe the idea I mentioned on Discord could also be a good alternative to this proposal and honestly I'm voting we should move to this design regardless of what happens in this issue, as I think it's just overall a better design for these servers. For context, here's a copy-paste of my Discord message:
|
Checked Existing
What enhancement would you like to see?
Currently we handle each request as its own goroutine. This was fine originally when there were fewer requests, but now that we're handling thousands upon thousands of requests this can create an ungodly number of goroutines. This can, under heavy load, create a lot of system resource usage. Goroutines are not "free", even if small they still take up memory and can create slowdowns during context switching. This can, under the right circumstances, create a situation where clients have connection issues or large amounts of memory ends up being used.
To account for this, it may be worth moving to a worker pool-based approach with a channel queue. This would mean creating a pool of premade goroutines, say 100 of them, to handle incoming packets. A channel queue can move the packets into a worker as soon as one becomes ready to handle the packet. This would keep things under a somewhat more consistent predefined level of usage.
Any other details to share? (OPTIONAL)
Considerations
Some more accurate testing needs to be done to determine how much of an issue the current design is. We've fixed a number of memory related issues, however we are still abusing goroutines pretty heavily which goes against recommendations and sometimes I do see the friends server get some high-ish memory usage.
There are a number of edge cases and caveats to consider when using this sort of design as well. For instance, whether or not to buffer the channel or not.
Buffering the channel will mean it has a predefined size and will drop any new data put into it if it's reached its max size, called "load shedding". This has the benefit of controlling memory usage better but means we will begin to drop packets if the channel becomes overloaded. We handle instances where packets may be dropped already, since that can happen no matter what, so maybe that's not a huge deal, but it does mean that if we are consistently under load then the servers will consistently drop packets.
Channels in Go can be unbuffered, and thus have an infinite size, however this means that under heavy load we will get more memory usage as data backs up. This also means that clients may actually have a higher risk of timing out. In a system where the channel is buffered if the packet is dropped then the next time the packet is sent by the client it may have a chance to get handled sooner as there may be less packets in the queue at that time. However, with an unbuffered queue the amount of time before the packet is processed is variable. This means the queue may have many instances of the same packet if it's resent multiple times, and if the queue is very long then it may take a while before it becomes processed.
Another thing to consider is the number of workers. Unlike most languages we can safely have a number of workers higher than CPU cores, since the Go runtime is very good at multiplexing them and context switching. However, since some operations may take a while to complete, we could run into an issue where all workers are busy. Say we have 100 workers and we get 101 requests, all of which are operations which take a second or more (this is a real possibility, as we had some friends server requests taking multiple seconds in the past). Suddenly that 101st request must now wait an extra second to be processed because there are no workers available.
Alternatives
If we wanted to get really low level, we could do things like what Tailscale does. Tailscale uses some lower-level Linux systems to improve performance and reliability, such as recvmmsg and netpoll to handle incoming packets and finding free goroutines to process them. This can be incredibly powerful and would likely help us a lot here, but is also fairly complex and would likely break on other platforms outside of Linux
The text was updated successfully, but these errors were encountered: