-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jemalloc benchmarks #635
Comments
Hello and welcome! I have no experience as NUMA is a pretty niche architecture and I don’t have access to such hardware, but we certainly would welcome a write-up if you have the time and are willing to share your experience. |
We have two servers connected with a dual-port 200Gbit mellanox interface. One is an intel dual-socket 26-core Intel Xeon Gold, and the other is a dual-socket 48-core AMD EPYC. So, on the Intel you would see 104 logical CPUs, and on the AMD you would see 192 logical CPUs, due to hyper-threading. Both servers are split in two with two separate but still globally accessible memory banks on both sides, each connected to a CPU. https://gist.github.com/fwsGonzo/dc5706ad8002211bad7dd122cbd20e16 I've added two benchmarks with jemalloc disabled for comparison. In both cases the performance was lower. jemalloc was added by just linking to the system installed one on CentOS 8. I also updated the main post with the benchmark for jemalloc disabled. I was going from memory last time, but it was clearly not right. Still, jemalloc provided a significant gain. The benchmarks are done by honing in on the highest stable req/s. Usually that means 1-2ms avg latency. Whenever you increase the load too much, one or more CPU cores will struggle and the benchmarks will have extreme spikes in them. I call it the breaking point, and I've been using it for a while now. The current bottleneck is most likely the kernel-side TCP acceptors. I think it may be possible to get even more speed just by creating more listeners on the same port, using REUSE_PORT, but we will see. Another potential benefit is being able to map a listener to a CPU and keep the packet on-CPU and possibly in-cache all the way. Regardless, worker threads are under-utilized while listeners look like they are always overloaded. |
@fwsGonzo Thanks for the write-up. That’s some detailed and very informative summary. Would it be OK for you if I would reuse your text, and include it in the official documentation? @an-tao As this is performance related, and there are also measurable improvements >10%, I think it would be good information for people who want to further improve Drogon’s performance on their specific bare metal. Or do you think this shouldn’t be part of the docs, because it’s not primarily framework related? What do you think? |
I made a feeble attempt at allowing the same port to be bound twice within Drogon, but I have a suspicion that threads are somehow tied to ports as a key, because this is failing:
I ended up just running Drogon twice, and I saw no performance gains at all. I have no idea what those 4 threads that seem to be the bottleneck is then. Since I'm running two instances of Drogon where I'm assuming Linux will round-robin to each of them as they are listening on the same port, I can only assume that it's something else that is bottlenecked right now. Image of the 4 threads that bottleneck the system: https://cloud.nwcs.no/index.php/s/NT7spdpRybY2Ha9 |
@fwsGonzo Thanks so much for sharing your benchmark details. Drogon enables the SO_REUSEPORT option on linux, that means every IO thread in drogon listens on the same port, so you don't need to run multiple processes for performance. |
We have a new Intel server, and this is a repeatable synthetic benchmark:
The URL returns a simple string:
It's using all the CPUs and the kernel is working very hard! And yes, that is indeed 4M req(s. |
Hey, I recently benchmarked using jemalloc on my NUMA servers, and the difference was quite large.
Using the system allocator I got 710k req/s, while with jemalloc I got 830k req/s. That's 17% faster just by dynamically linking with a library.
Anyone else tried this?
The text was updated successfully, but these errors were encountered: