jemalloc benchmarks #635

fwsGonzo · 2020-11-25T08:36:30Z

Hey, I recently benchmarked using jemalloc on my NUMA servers, and the difference was quite large.
Using the system allocator I got 710k req/s, while with jemalloc I got 830k req/s. That's 17% faster just by dynamically linking with a library.

Anyone else tried this?

rbugajewski · 2020-11-25T08:42:54Z

Hello and welcome! I have no experience as NUMA is a pretty niche architecture and I don’t have access to such hardware, but we certainly would welcome a write-up if you have the time and are willing to share your experience.

fwsGonzo · 2020-11-29T18:12:55Z

We have two servers connected with a dual-port 200Gbit mellanox interface. One is an intel dual-socket 26-core Intel Xeon Gold, and the other is a dual-socket 48-core AMD EPYC. So, on the Intel you would see 104 logical CPUs, and on the AMD you would see 192 logical CPUs, due to hyper-threading.

Both servers are split in two with two separate but still globally accessible memory banks on both sides, each connected to a CPU.

https://gist.github.com/fwsGonzo/dc5706ad8002211bad7dd122cbd20e16

I've added two benchmarks with jemalloc disabled for comparison. In both cases the performance was lower. jemalloc was added by just linking to the system installed one on CentOS 8. I also updated the main post with the benchmark for jemalloc disabled. I was going from memory last time, but it was clearly not right. Still, jemalloc provided a significant gain.

The benchmarks are done by honing in on the highest stable req/s. Usually that means 1-2ms avg latency. Whenever you increase the load too much, one or more CPU cores will struggle and the benchmarks will have extreme spikes in them. I call it the breaking point, and I've been using it for a while now. wrk is wrk2 compiled from sources. It's not the most scientific approach but we can observe a difference between jemalloc and (I'm assuming) glibc's malloc.

The current bottleneck is most likely the kernel-side TCP acceptors. I think it may be possible to get even more speed just by creating more listeners on the same port, using REUSE_PORT, but we will see. Another potential benefit is being able to map a listener to a CPU and keep the packet on-CPU and possibly in-cache all the way. Regardless, worker threads are under-utilized while listeners look like they are always overloaded.

rbugajewski · 2020-11-29T18:32:11Z

@fwsGonzo Thanks for the write-up. That’s some detailed and very informative summary. Would it be OK for you if I would reuse your text, and include it in the official documentation?

@an-tao As this is performance related, and there are also measurable improvements >10%, I think it would be good information for people who want to further improve Drogon’s performance on their specific bare metal. Or do you think this shouldn’t be part of the docs, because it’s not primarily framework related? What do you think?

fwsGonzo · 2020-11-30T11:28:42Z

I made a feeble attempt at allowing the same port to be bound twice within Drogon, but I have a suspicion that threads are somehow tied to ports as a key, because this is failing:

trantor::EventLoop::isInLoopThread (this=0x7ffff4731d88) at ../drogon/trantor/trantor/net/EventLoop.h:105
105	        return threadId_ == std::this_thread::get_id();

I ended up just running Drogon twice, and I saw no performance gains at all. I have no idea what those 4 threads that seem to be the bottleneck is then. Since I'm running two instances of Drogon where I'm assuming Linux will round-robin to each of them as they are listening on the same port, I can only assume that it's something else that is bottlenecked right now.

Image of the 4 threads that bottleneck the system: https://cloud.nwcs.no/index.php/s/NT7spdpRybY2Ha9

an-tao · 2020-11-30T12:42:22Z

@fwsGonzo Thanks so much for sharing your benchmark details. Drogon enables the SO_REUSEPORT option on linux, that means every IO thread in drogon listens on the same port, so you don't need to run multiple processes for performance.
What did you set the threads_num option in the configuration file?
Actually I used the mimalloc lib on the tfb benchmark, and it's very helpful for performance, I am very interested in comparing the effect of mimalloc and jemalloc on performance improvement
@rbugajewski I agree with you. I think performance is always a consideration for Drogon users. It is a good idea to add this content to the drogon documentation.

fwsGonzo · 2021-08-26T13:06:27Z

We have a new Intel server, and this is a repeatable synthetic benchmark:

$ ./wrk -c 520 -t 520 -d 15s http://192.168.0.10:8080/
Running 15s test @ http://192.168.0.10:8080/
  520 threads and 520 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   133.59us  129.44us  16.21ms   96.06%
    Req/Sec     7.75k     1.60k   18.59k    75.09%
  60776112 requests in 15.10s, 12.11GB read
Requests/sec: 4024610.36
Transfer/sec:    821.37MB

The URL returns a simple string:

$ curl -D - http://192.168.1.10:8080/
HTTP/1.1 200 OK
Content-Length: 76
Content-Type: text/html; charset=utf-8
Server: drogon/1.1.0
Date: Thu, 26 Aug 2021 13:05:31 GMT

	this is a very
	long string if I had the
	energy to type more and more ...

It's using all the CPUs and the kernel is working very hard! And yes, that is indeed 4M req(s.
I was not using jemalloc at the time. With jemalloc enabled I get a very slightly performance reduction and I land repeatedly at 3.9M req/s for unknown reasons. jemalloc is very programmable so I guess that it could be solved and maybe even made to give a performance boost.

rbugajewski added enhancement New feature or request good first issue Good for newcomers labels Nov 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jemalloc benchmarks #635

jemalloc benchmarks #635

fwsGonzo commented Nov 25, 2020 •

edited

Loading

rbugajewski commented Nov 25, 2020

fwsGonzo commented Nov 29, 2020 •

edited

Loading

rbugajewski commented Nov 29, 2020

fwsGonzo commented Nov 30, 2020

an-tao commented Nov 30, 2020

fwsGonzo commented Aug 26, 2021 •

edited

Loading

jemalloc benchmarks #635

jemalloc benchmarks #635

Comments

fwsGonzo commented Nov 25, 2020 • edited Loading

rbugajewski commented Nov 25, 2020

fwsGonzo commented Nov 29, 2020 • edited Loading

rbugajewski commented Nov 29, 2020

fwsGonzo commented Nov 30, 2020

an-tao commented Nov 30, 2020

fwsGonzo commented Aug 26, 2021 • edited Loading

fwsGonzo commented Nov 25, 2020 •

edited

Loading

fwsGonzo commented Nov 29, 2020 •

edited

Loading

fwsGonzo commented Aug 26, 2021 •

edited

Loading