Optimisations (and cleanups) of Linux backend #68

antrik · 2016-05-01T03:12:50Z

This is a bunch of optimisations to the Linux platform code, along with various cleanups of the related code which the optimisation patches build upon.

Most of the cleanups are on the send() side, as the recv() side is less affected by the optimisation changes, and thus there has been less reason to refactor the code -- some extra cleanup work would probably be in order here.

The optimisations are mostly about avoiding unnecessary copies by using scatter-gather buffers for send and receive; as well as avoiding unnecessary initialisation of receive buffers.

The results are impressive: with gains of at least 5x for large transfers of several MiB (a bit more on a modern system); >5x (on an old system) up to >10x (on a modern one) for small transfers of up to a few KiB; and more than 10x for most of the range in between -- peaking at about 12x - 13x on the old system and 20x - 21x on the modern system for medium-sized transfers of about 64 KiB up to a few hundred KiB.

For another interesting data point, the CPU usage during benchmark runs (with many iterations, to amortise the setup time) was dominated by user time (more than two thirds of total time) with the original variant; whereas the optimised variant not only further reduces system time to less then half the original value (presumably because of fewer allocations?), but also almost entirely eliminates the user time, making it pretty insignificant in the total picture now -- as it should be.

On a less scientific note, Servo built with the optimised ipc-channel doesn't seem to show undue delays any more while rendering a language selection menu. (Which requires lots of fonts to be loaded, and thus triggers heavy ipc-channel activity.)

jdm · 2016-05-01T03:52:17Z

Build failure is from aster and quasi crates breaking on the current rustc nightly.

antrik · 2016-05-02T07:11:05Z

I now submitted a bunch of PRs for this: serde-deprecated/syntex#44 , serde-deprecated/aster#78 , serde-deprecated/quasi#41 , serde-rs/serde#303

Once these are in, we can update the dependencies in ipc-channel...

samlh · 2016-05-03T05:04:50Z

platform/linux/mod.rs

@@ -278,16 +286,10 @@ impl UnixSender {

                if result <= 0 {
                    let error = UnixError::last();
-                    if error.0 == libc::ENOBUFS && bytes_to_send > 2000 {
+                    if error.0 == libc::ENOBUFS
+                       && downsize(&mut sendbuf_size, bytes_to_send).is_ok() {


This does not handle the Err(()) case from downsize correctly - perhaps this should be:

if error.0 == libc::ENOBUFS { // If the kernel failed to allocate a buffer large enough for the packet, // retry with a smaller size (if possible). try!(downsize(&mut sendbuf_size, bytes_to_send)); continue; } else {

Note this may need a tweak for freeing behavior.

There is a similar issue above, but a later commit makes it moot.

I don't think returning the empty error from downsize() would be correct handling... The idea here is that if the downsize fails, we want to return the original error -- which happens in the else case.

As I said in the commit message, I'm not entirely happy how this turned out syntactically -- but I really can't think of a better approach. Quite frankly, I considered dropping this size test entirely: it's a case that I don't actually expect to happen; I only added it as an additional safeguard in the original ENOBUFS handling because it seemed easy to do -- but it turned out to be a major pain during refactoring, and I'm not sure any more it's indeed worth the trouble...

Maybe we should just turn it into an assert() instead?

Fair enough. I was looking at the call above that has an else if instead of an else, but that is fixed by a later commit. Thanks for taking a look, either solution works for me.

antrik · 2016-05-03T23:40:30Z

Added a patch making msghdr.iovec a *const to avoid confusion. (And did a few other minor cleanups to iovec handling along the way...)

bors-servo · 2016-05-04T15:50:48Z

☔ The latest upstream changes (presumably #66) made this pull request unmergeable. Please resolve the merge conflicts.

pcwalton · 2016-05-04T17:14:33Z

platform/test.rs

+macro_rules! create_with_n_fds {
+    ($name:ident, $n:expr, $size:expr) => (
+        #[test]
+        fn $name() {


Does this really need to be a macro? Can it just be a helper function that the various tests delegate to?

"Macros are for when you have run out of language" —@dherman

But I think there is still language left :)

Well, I did use a helper function at first; but that means there were a lot of redundant wrappers for each individual test. I didn't like it.

(The macro instantiations admittedly still involve some redundancy for now, due to limitations of the current macro system...)

BTW, I introduced a similar macro already in an earlier PR -- you didn't comment on that one...

Redundant wrappers? What do you mean?

I don't see how wrapper functions are redundant. There's a tiny amount of boilerplate, but the clarity is worth it IMO. I always have a hard time reading macros.

This should better communicate the actual meaning of this value. Also updated some comments to reflect the true meaning.

pcwalton · 2016-05-04T18:41:31Z

platform/linux/mod.rs

@@ -194,14 +194,19 @@ impl UnixSender {
                               -> Result<(),UnixError> {


typo in commit message: "auxiliary"

pcwalton · 2016-05-04T18:45:13Z

This looks good to me with the above nits addressed. Thanks!

Lots of corner cases here that can break when changing implementation details... As these tests are all platform-specific, put them in a separate module, to avoid redundant platform conditionals.

@mbrubeck

`libc::size_t` is an alias for `usize` -- so the casts are unnecessary, and only bloat the code, thus reducing readability (especially when they necessitate extra parentheses); and (as @mbrubeck pointed out) they actually become a liability when the involved types change, as they can silently turn into real casts, thus obscuring a potential need for code adaptations.

…cv()` The function actually processes the result of the system call before returning it, checking for negative values among other things -- so there is no need to return a sized type, nor a libc-specific one. The internal type should be abstracted by this kind of wrapper function, so callers don't need ugly casts on each invocation.

This avoids redundant processing; and will also enable further cleanups.

This variant is not only more compact and elegant, but also type-safe. I don't think it has any downsides.

When calculating the maximum size of data we can send in a single fragment, no longer deduct any amount "dynamically" based on the size of the auxiliary data (FDs) transferred in the control message. I'm not sure what originally prompted the idea that deducting this from the main buffer would be necessary -- my testing at least doesn't show any need for that. The auxiliary data is transferred in a separate buffer with its own size limitation. (Defaulting to 10 KiB on my system.)

As we explicitly size all fragments in a fragmented send, the maximum size of followup packets is always well known -- so there is no need to allocate any extra just in case. (Unlike for the first fragment, which is implicitly sized if it comes from an unfragmented send; and thus might potentially have a larger size than what we expect, in case our reserved size doesn't match reality...)

I wonder whether there is some obscure documentation I'm not aware of that suggests we have to deduct 256 bytes? According to my testing with Linux 4.4 on i386 as well as Linux 3.16 on x86_64, only 32 bytes actually need to be deducted.

Rather than halving the payload size, halve the total buffer size, and recalculate the payload size from that. This way, the buffer size remains a (more or less) round number, rather than becoming something just slightly above a round number -- which should improve resource utilisation, and might even reduce the number of downsizes necessary in some cases. This will also faciliate further cleanups.

While `recvmsg()` mutates the buffers referenced by the `iovec`, the `iovec` itself is never modified by either `sendmsg()` or `recvmsg()`. This field is indeed marked `*const` in `libc::msghdr` as well.

When using fragmentation, the `msghdr` structure is only used for the first packet, which is already sent in a separate conditional arm anyway -- so we can just as well create (and deallocate) the header within this conditional too. This should make the code more robust and much easier to follow. (Note that this means the header will be recreated when we have to resend on ENOBUFS. The performance impact should be negligible though; and it's an exceptional case anyways.)

Moving the actual sending of the message into the `construct_header()` function, and renaming it to `send_first_fragment()` to reflect that change. With the previous changes, we always send the message right after constructing the header -- so it makes sense to put the common sequence in one place. Removing the need to pass the header structs through the caller also gets rid of the ugly `iovec` return hack. What's more, this helps isolating the unsafe operations: invoking `send_first_fragment()` is in fact not unsafe at all.

Moving the `send()` call for followup fragments into a sub-function along the lines of `send_first_fragment()`. While this one is only invoked once, the commont pattern should make the code easier to read; and just like with `send_first_fragment()`, it also helps isolating the unsafe code.

Now that all the unsafe code is isolated in `send_*_fragment()`, the rest of the `send()` method doesn't need to be marked unsafe anymore.

Measure performance of transfers of various sizes, to keep track of the performance impact of upcoming optimisations. This makes use of `get_max_fragment_size()` outside of platform code; so it additionally necessitates adding stub implementations of this method for all platforms. The benchmark results are not as consistent as one would hope for -- but it should be good enough to judge the impact of any major changes. (See also the code comment in `benches/bench.rs` for an explanation of the `_invoke_chaos` benchmark pass...) Below are the numbers from a typical run on my system. (Which is an Intel Core2 Quad at 2.5 GHz running a 32 Bit GNU/Linux system, i.e. a pretty old system from 2008 or thereabouts.) These numbers were obtained with the cpufreq governor set to `performance` (rather than the default `ondemand`) for more reproducible results. Unfortunately, this doesn't exclude other external factors, such as memory pressure -- so it's still tricky to compare different test runs. The numbers presented here (and in the following bunch of optimisation commits) were all obtained in a single large test series; so they should be comparable -- but it's still tricky to compare results when checking the impact of any new patches... The second block of results is with `ITERATIONS` in `benches/bench.rs` increased to 100. Aside from somewhat reducing randomness in general, this is important because for fragmented transfers we need to spawn a thread in the benchmark suite, which is significantly affecting the results for medium-large transfers (especially 256 KiB and the next few ones) when using only a single iteration -- in fact dominating them once some optimisations are applied. (And also introducing lots of randomness.) Still presenting the result for one iteration as well: mostly because the absolute values have the right magnitude in this case, while adding more iterations shifts them accordingly. The largest size doesn't actually produce a result with 100 iterations, because of some integer overflow in `cargo bench` I presume. We will get results for this one as well though once some optimisations are applied. test _invoke_chaos ... bench: 4,283,639 ns/iter (+/- 369,532) test size_00_1 ... bench: 11,893 ns/iter (+/- 233) test size_01_2 ... bench: 11,786 ns/iter (+/- 111) test size_02_4 ... bench: 11,766 ns/iter (+/- 103) test size_03_8 ... bench: 11,731 ns/iter (+/- 71) test size_04_16 ... bench: 11,777 ns/iter (+/- 67) test size_05_32 ... bench: 11,823 ns/iter (+/- 87) test size_06_64 ... bench: 12,085 ns/iter (+/- 94) test size_07_128 ... bench: 12,358 ns/iter (+/- 108) test size_08_256 ... bench: 12,710 ns/iter (+/- 110) test size_09_512 ... bench: 13,674 ns/iter (+/- 151) test size_10_1k ... bench: 15,634 ns/iter (+/- 151) test size_11_2k ... bench: 19,289 ns/iter (+/- 182) test size_12_4k ... bench: 26,931 ns/iter (+/- 92) test size_13_8k ... bench: 42,751 ns/iter (+/- 234) test size_14_16k ... bench: 74,066 ns/iter (+/- 432) test size_15_32k ... bench: 137,961 ns/iter (+/- 694) test size_16_64k ... bench: 262,229 ns/iter (+/- 2,664) test size_17_128k ... bench: 509,617 ns/iter (+/- 7,176) test size_18_256k ... bench: 1,202,057 ns/iter (+/- 261,359) test size_19_512k ... bench: 2,267,058 ns/iter (+/- 403,483) test size_20_1m ... bench: 6,033,593 ns/iter (+/- 332,782) test size_21_2m ... bench: 12,403,937 ns/iter (+/- 626,731) test size_22_4m ... bench: 25,218,893 ns/iter (+/- 1,290,866) test size_23_8m ... bench: 45,120,983 ns/iter (+/- 1,714,226) test _invoke_chaos ... bench: 419,861,129 ns/iter (+/- 8,342,705) test size_00_1 ... bench: 1,172,231 ns/iter (+/- 6,850) test size_01_2 ... bench: 1,176,073 ns/iter (+/- 6,664) test size_02_4 ... bench: 1,179,213 ns/iter (+/- 10,001) test size_03_8 ... bench: 1,179,364 ns/iter (+/- 9,985) test size_04_16 ... bench: 1,182,618 ns/iter (+/- 9,154) test size_05_32 ... bench: 1,183,845 ns/iter (+/- 6,272) test size_06_64 ... bench: 1,192,917 ns/iter (+/- 6,473) test size_07_128 ... bench: 1,219,179 ns/iter (+/- 9,096) test size_08_256 ... bench: 1,266,088 ns/iter (+/- 14,919) test size_09_512 ... bench: 1,349,183 ns/iter (+/- 21,996) test size_10_1k ... bench: 1,548,835 ns/iter (+/- 12,133) test size_11_2k ... bench: 1,929,276 ns/iter (+/- 17,447) test size_12_4k ... bench: 2,649,545 ns/iter (+/- 52,515) test size_13_8k ... bench: 4,233,634 ns/iter (+/- 27,626) test size_14_16k ... bench: 7,410,534 ns/iter (+/- 16,211) test size_15_32k ... bench: 13,733,377 ns/iter (+/- 28,703) test size_16_64k ... bench: 26,113,539 ns/iter (+/- 73,558) test size_17_128k ... bench: 50,787,086 ns/iter (+/- 96,875) test size_18_256k ... bench: 93,566,074 ns/iter (+/- 3,746,932) test size_19_512k ... bench: 226,470,336 ns/iter (+/- 24,565,318) test size_20_1m ... bench: 519,162,370 ns/iter (+/- 7,462,605) test size_21_2m ... bench: 1,099,614,726 ns/iter (+/- 7,175,544) test size_22_4m ... bench: 2,108,188,068 ns/iter (+/- 39,768,687) test size_23_8m ... bench: 0 ns/iter (+/- 30,262,382)

Now that we no longer have to deal with fragments of different messages interleaving, there is no need for the complicated fragment ID handling. The only information we strictly need is whether we have fragmentation at all. (So the receiver knows whether to retrieve and use the dedicated channel.) However, rather than only sending a boolean flag, we can just as well send an `usize` announcing the total size of payload data in this message. This way the receiver knows when fragmentation occurs -- and additionally knows exactly when all fragments have been received. This avoids the need to check in the receiver for the channel being closed by the sender; and knowing the total size in advance will also enable further optimisations/simplifications in the future. As a side effect, getting rid of the fragment headers removes the need in the sender to copy the data again when preparing send buffers for the followup fragments. (One copy operation is still needed for assembling the initial send buffer from the size header and main data.) This almost doubles performance (send + receive) of large transfers. Note though that this is a temporary effect: upcoming, more thorough optimisations will make this change mostly meaningless. (In terms of performance, that is -- the simplification is still worthwhile of course!) Regarding the results below, note that the single-iteration numbers for 256k and especially 512k experience a *huge* random fluctuation here (the latter ranging between 1.2 ms and 1.6 ms from one test run to another...) -- so they are really useful only as a very rough orientation, rather than for comparing against other results. The results for 100 iterations on the other hand are pretty stable, with fluctuations usually below 2% for all sizes. test _invoke_chaos ... bench: 2,779,348 ns/iter (+/- 353,869) test size_00_1 ... bench: 11,958 ns/iter (+/- 131) test size_01_2 ... bench: 11,895 ns/iter (+/- 89) test size_02_4 ... bench: 11,646 ns/iter (+/- 99) test size_03_8 ... bench: 11,648 ns/iter (+/- 57) test size_04_16 ... bench: 11,675 ns/iter (+/- 64) test size_05_32 ... bench: 11,715 ns/iter (+/- 47) test size_06_64 ... bench: 11,885 ns/iter (+/- 84) test size_07_128 ... bench: 12,159 ns/iter (+/- 103) test size_08_256 ... bench: 12,536 ns/iter (+/- 155) test size_09_512 ... bench: 13,544 ns/iter (+/- 156) test size_10_1k ... bench: 15,472 ns/iter (+/- 123) test size_11_2k ... bench: 19,217 ns/iter (+/- 128) test size_12_4k ... bench: 26,548 ns/iter (+/- 401) test size_13_8k ... bench: 42,123 ns/iter (+/- 436) test size_14_16k ... bench: 73,710 ns/iter (+/- 259) test size_15_32k ... bench: 139,332 ns/iter (+/- 1,106) test size_16_64k ... bench: 267,651 ns/iter (+/- 2,688) test size_17_128k ... bench: 517,987 ns/iter (+/- 7,714) test size_18_256k ... bench: 934,841 ns/iter (+/- 272,889) test size_19_512k ... bench: 1,327,214 ns/iter (+/- 417,956) test size_20_1m ... bench: 3,786,214 ns/iter (+/- 429,365) test size_21_2m ... bench: 7,559,035 ns/iter (+/- 738,997) test size_22_4m ... bench: 15,069,971 ns/iter (+/- 1,203,609) test size_23_8m ... bench: 24,633,162 ns/iter (+/- 1,969,078) test _invoke_chaos ... bench: 277,756,398 ns/iter (+/- 8,038,558) test size_00_1 ... bench: 1,187,442 ns/iter (+/- 5,224) test size_01_2 ... bench: 1,189,368 ns/iter (+/- 7,582) test size_02_4 ... bench: 1,168,775 ns/iter (+/- 9,093) test size_03_8 ... bench: 1,173,062 ns/iter (+/- 10,568) test size_04_16 ... bench: 1,171,706 ns/iter (+/- 8,885) test size_05_32 ... bench: 1,176,364 ns/iter (+/- 5,833) test size_06_64 ... bench: 1,190,941 ns/iter (+/- 7,975) test size_07_128 ... bench: 1,224,023 ns/iter (+/- 7,265) test size_08_256 ... bench: 1,267,395 ns/iter (+/- 7,898) test size_09_512 ... bench: 1,360,957 ns/iter (+/- 9,707) test size_10_1k ... bench: 1,538,531 ns/iter (+/- 7,787) test size_11_2k ... bench: 1,921,562 ns/iter (+/- 19,868) test size_12_4k ... bench: 2,655,408 ns/iter (+/- 42,143) test size_13_8k ... bench: 4,250,711 ns/iter (+/- 61,767) test size_14_16k ... bench: 7,446,110 ns/iter (+/- 246,966) test size_15_32k ... bench: 13,971,866 ns/iter (+/- 20,793) test size_16_64k ... bench: 26,721,142 ns/iter (+/- 49,591) test size_17_128k ... bench: 51,645,773 ns/iter (+/- 109,419) test size_18_256k ... bench: 67,393,172 ns/iter (+/- 2,902,754) test size_19_512k ... bench: 138,950,061 ns/iter (+/- 27,886,102) test size_20_1m ... bench: 344,968,619 ns/iter (+/- 6,519,044) test size_21_2m ... bench: 685,842,845 ns/iter (+/- 16,417,853) test size_22_4m ... bench: 1,154,950,689 ns/iter (+/- 14,661,441) test size_23_8m ... bench: 2,335,253,136 ns/iter (+/- 25,108,022)

Using the scatter-gather functionality of the sendmsg() system call, we can avoid the need for copying data into a dedicated buffer altogether. This gets another sizeable performance increase (on top of the fragment header change) for large transfers, resulting in a total speedup of about 2x - 3x (depending on size) compared to the original version. Medium-sized (non-fragemented) transfers also gain a few per cent. (The gains on the sender side are actually even bigger: we still haven't optimised the receiver side at all -- so that is now the main bottleneck...) Note that comparing against the original variant indeed makes more sense here than looking at the most recent results, because this new approach effectively obsoletes the performance gains of the previous change: using scatter-gather, we could have achieved the same zero-copy effect even with the old fragment header approach, with only some small overhead for handling the actual headers. (The code is considerably simpler without the fragment headers, though.) test _invoke_chaos ... bench: 2,681,822 ns/iter (+/- 219,773) test size_00_1 ... bench: 11,815 ns/iter (+/- 121) test size_01_2 ... bench: 11,798 ns/iter (+/- 57) test size_02_4 ... bench: 11,727 ns/iter (+/- 72) test size_03_8 ... bench: 11,717 ns/iter (+/- 53) test size_04_16 ... bench: 11,783 ns/iter (+/- 128) test size_05_32 ... bench: 11,798 ns/iter (+/- 59) test size_06_64 ... bench: 11,947 ns/iter (+/- 103) test size_07_128 ... bench: 12,203 ns/iter (+/- 89) test size_08_256 ... bench: 12,779 ns/iter (+/- 94) test size_09_512 ... bench: 13,866 ns/iter (+/- 122) test size_10_1k ... bench: 15,288 ns/iter (+/- 231) test size_11_2k ... bench: 19,603 ns/iter (+/- 256) test size_12_4k ... bench: 28,476 ns/iter (+/- 232) test size_13_8k ... bench: 46,229 ns/iter (+/- 200) test size_14_16k ... bench: 73,107 ns/iter (+/- 314) test size_15_32k ... bench: 133,618 ns/iter (+/- 1,121) test size_16_64k ... bench: 254,531 ns/iter (+/- 4,815) test size_17_128k ... bench: 495,497 ns/iter (+/- 10,201) test size_18_256k ... bench: 886,351 ns/iter (+/- 199,942) test size_19_512k ... bench: 1,123,621 ns/iter (+/- 305,389) test size_20_1m ... bench: 3,036,919 ns/iter (+/- 285,712) test size_21_2m ... bench: 5,222,443 ns/iter (+/- 347,878) test size_22_4m ... bench: 7,551,218 ns/iter (+/- 914,100) test size_23_8m ... bench: 14,977,983 ns/iter (+/- 1,302,684) test _invoke_chaos ... bench: 244,491,008 ns/iter (+/- 7,239,693) test size_00_1 ... bench: 1,183,187 ns/iter (+/- 5,943) test size_01_2 ... bench: 1,183,225 ns/iter (+/- 9,967) test size_02_4 ... bench: 1,180,223 ns/iter (+/- 6,720) test size_03_8 ... bench: 1,181,466 ns/iter (+/- 6,136) test size_04_16 ... bench: 1,183,006 ns/iter (+/- 7,598) test size_05_32 ... bench: 1,191,722 ns/iter (+/- 9,888) test size_06_64 ... bench: 1,198,561 ns/iter (+/- 8,227) test size_07_128 ... bench: 1,217,393 ns/iter (+/- 5,343) test size_08_256 ... bench: 1,269,855 ns/iter (+/- 7,813) test size_09_512 ... bench: 1,393,586 ns/iter (+/- 7,397) test size_10_1k ... bench: 1,524,854 ns/iter (+/- 15,853) test size_11_2k ... bench: 1,959,964 ns/iter (+/- 24,707) test size_12_4k ... bench: 2,858,032 ns/iter (+/- 11,557) test size_13_8k ... bench: 4,629,783 ns/iter (+/- 11,677) test size_14_16k ... bench: 7,321,471 ns/iter (+/- 13,709) test size_15_32k ... bench: 13,397,902 ns/iter (+/- 16,635) test size_16_64k ... bench: 25,558,619 ns/iter (+/- 112,384) test size_17_128k ... bench: 49,717,629 ns/iter (+/- 1,668,997) test size_18_256k ... bench: 67,053,276 ns/iter (+/- 2,203,769) test size_19_512k ... bench: 125,389,098 ns/iter (+/- 20,284,576) test size_20_1m ... bench: 276,946,251 ns/iter (+/- 8,119,682) test size_21_2m ... bench: 487,328,628 ns/iter (+/- 12,586,919) test size_22_4m ... bench: 715,961,795 ns/iter (+/- 14,266,485) test size_23_8m ... bench: 1,156,209,396 ns/iter (+/- 20,298,143)

The receiver always has to allocate a buffer large enough to fit the maximal packet size, as it doesn't know how large the next message will be. Up till now, the entire buffer was being 0-filled on allocation -- which was inflicting considerable overhead for small messages: skipping the initialisation almost quadruples(!) the performance of small transfers on my system; and has a noticable effect on larger transfers too, if the last fragment is relatively small. (About 10% for 512KiB transfers for example, where the last fragment is only about 32 KiB.) We truncate the length of the receive buffer (vector) to the actual size of the data received, right after the receive call -- so given that we don't do anything else between allocating the buffer and receiving, having it temporarily uninitialised shouldn't be terribly unsafe. test _invoke_chaos ... bench: 2,646,830 ns/iter (+/- 221,495) test size_00_1 ... bench: 3,070 ns/iter (+/- 48) test size_01_2 ... bench: 3,076 ns/iter (+/- 63) test size_02_4 ... bench: 3,040 ns/iter (+/- 64) test size_03_8 ... bench: 3,020 ns/iter (+/- 55) test size_04_16 ... bench: 3,072 ns/iter (+/- 65) test size_05_32 ... bench: 3,162 ns/iter (+/- 56) test size_06_64 ... bench: 3,249 ns/iter (+/- 63) test size_07_128 ... bench: 3,447 ns/iter (+/- 74) test size_08_256 ... bench: 3,959 ns/iter (+/- 101) test size_09_512 ... bench: 5,199 ns/iter (+/- 77) test size_10_1k ... bench: 6,577 ns/iter (+/- 124) test size_11_2k ... bench: 11,090 ns/iter (+/- 157) test size_12_4k ... bench: 19,702 ns/iter (+/- 80) test size_13_8k ... bench: 37,433 ns/iter (+/- 234) test size_14_16k ... bench: 64,489 ns/iter (+/- 461) test size_15_32k ... bench: 125,141 ns/iter (+/- 1,305) test size_16_64k ... bench: 247,282 ns/iter (+/- 2,844) test size_17_128k ... bench: 489,099 ns/iter (+/- 2,951) test size_18_256k ... bench: 838,078 ns/iter (+/- 130,191) test size_19_512k ... bench: 1,071,451 ns/iter (+/- 193,737) test size_20_1m ... bench: 2,935,692 ns/iter (+/- 255,588) test size_21_2m ... bench: 5,183,219 ns/iter (+/- 295,405) test size_22_4m ... bench: 7,396,433 ns/iter (+/- 808,390) test size_23_8m ... bench: 14,341,520 ns/iter (+/- 1,275,331) test _invoke_chaos ... bench: 238,710,985 ns/iter (+/- 16,477,582) test size_00_1 ... bench: 312,659 ns/iter (+/- 8,490) test size_01_2 ... bench: 311,919 ns/iter (+/- 5,871) test size_02_4 ... bench: 305,455 ns/iter (+/- 3,955) test size_03_8 ... bench: 310,709 ns/iter (+/- 6,244) test size_04_16 ... bench: 309,599 ns/iter (+/- 5,053) test size_05_32 ... bench: 313,935 ns/iter (+/- 5,827) test size_06_64 ... bench: 325,553 ns/iter (+/- 4,227) test size_07_128 ... bench: 352,797 ns/iter (+/- 7,614) test size_08_256 ... bench: 396,614 ns/iter (+/- 11,826) test size_09_512 ... bench: 482,741 ns/iter (+/- 8,917) test size_10_1k ... bench: 664,678 ns/iter (+/- 9,785) test size_11_2k ... bench: 1,132,642 ns/iter (+/- 14,503) test size_12_4k ... bench: 1,942,667 ns/iter (+/- 22,892) test size_13_8k ... bench: 3,706,023 ns/iter (+/- 16,474) test size_14_16k ... bench: 6,410,924 ns/iter (+/- 12,465) test size_15_32k ... bench: 12,494,806 ns/iter (+/- 23,242) test size_16_64k ... bench: 24,591,965 ns/iter (+/- 81,764) test size_17_128k ... bench: 48,608,297 ns/iter (+/- 106,295) test size_18_256k ... bench: 65,057,222 ns/iter (+/- 1,918,700) test size_19_512k ... bench: 114,772,423 ns/iter (+/- 18,717,560) test size_20_1m ... bench: 261,632,538 ns/iter (+/- 6,904,648) test size_21_2m ... bench: 483,410,889 ns/iter (+/- 10,294,448) test size_22_4m ... bench: 708,794,834 ns/iter (+/- 12,841,685) test size_23_8m ... bench: 1,139,315,776 ns/iter (+/- 25,330,167)

Rather then receiving each fragment into an individual buffer first, and concatenating it to the main buffer afterwards, preallocate space in the main buffer, and receive the followup fragements directly into it. (Only the initial fragment is still being copied, while separating out the message size header.) This results in another huge performance boost for large transfers, showing improvements of 50% and more at most sizes. (It's hard to assess an overall number, because of very strong variance between the individual sizes...) The biggest gain is at 2 MiB, with performance improving by about 2.5x (moving the drop-off in performance to the 4 MiB data point) -- probably because getting rid of the surplus data copies allows everything to fit in the last-level cache now at this size; only needing to go to slower main memory for even larger sizes. 512 KiB also gains about 2x, probably for similar reasons. The total speedup compared to the original version for transfers of 512 KiB and more now amounts to about 3x, and some 4.5x on average for even larger ones. As an interesting side effect, the benchmark results get much more consistent with this change, avoiding the need for a warmup. There are still some weird jumps at certain sizes; but these are less severe now over all -- and no longer affected by random other factors... test size_00_1 ... bench: 3,066 ns/iter (+/- 59) test size_01_2 ... bench: 3,116 ns/iter (+/- 46) test size_02_4 ... bench: 3,003 ns/iter (+/- 40) test size_03_8 ... bench: 3,071 ns/iter (+/- 44) test size_04_16 ... bench: 3,088 ns/iter (+/- 52) test size_05_32 ... bench: 3,143 ns/iter (+/- 42) test size_06_64 ... bench: 3,219 ns/iter (+/- 73) test size_07_128 ... bench: 3,499 ns/iter (+/- 92) test size_08_256 ... bench: 3,992 ns/iter (+/- 78) test size_09_512 ... bench: 5,002 ns/iter (+/- 84) test size_10_1k ... bench: 6,691 ns/iter (+/- 102) test size_11_2k ... bench: 11,398 ns/iter (+/- 175) test size_12_4k ... bench: 19,907 ns/iter (+/- 116) test size_13_8k ... bench: 37,206 ns/iter (+/- 158) test size_14_16k ... bench: 63,840 ns/iter (+/- 369) test size_15_32k ... bench: 124,472 ns/iter (+/- 1,409) test size_16_64k ... bench: 246,473 ns/iter (+/- 2,767) test size_17_128k ... bench: 487,915 ns/iter (+/- 11,440) test size_18_256k ... bench: 781,964 ns/iter (+/- 59,461) test size_19_512k ... bench: 984,189 ns/iter (+/- 86,029) test size_20_1m ... bench: 2,037,886 ns/iter (+/- 214,774) test size_21_2m ... bench: 2,374,924 ns/iter (+/- 596,728) test size_22_4m ... bench: 5,573,282 ns/iter (+/- 756,270) test size_23_8m ... bench: 10,058,920 ns/iter (+/- 1,767,761) test size_00_1 ... bench: 307,074 ns/iter (+/- 3,333) test size_01_2 ... bench: 306,568 ns/iter (+/- 3,736) test size_02_4 ... bench: 299,714 ns/iter (+/- 4,773) test size_03_8 ... bench: 310,657 ns/iter (+/- 5,220) test size_04_16 ... bench: 306,247 ns/iter (+/- 4,279) test size_05_32 ... bench: 311,436 ns/iter (+/- 4,693) test size_06_64 ... bench: 321,380 ns/iter (+/- 5,674) test size_07_128 ... bench: 347,893 ns/iter (+/- 3,311) test size_08_256 ... bench: 398,704 ns/iter (+/- 4,745) test size_09_512 ... bench: 483,585 ns/iter (+/- 6,268) test size_10_1k ... bench: 664,508 ns/iter (+/- 12,941) test size_11_2k ... bench: 1,126,760 ns/iter (+/- 26,853) test size_12_4k ... bench: 1,946,401 ns/iter (+/- 34,891) test size_13_8k ... bench: 3,655,198 ns/iter (+/- 32,872) test size_14_16k ... bench: 6,374,230 ns/iter (+/- 12,787) test size_15_32k ... bench: 12,480,807 ns/iter (+/- 44,458) test size_16_64k ... bench: 24,633,454 ns/iter (+/- 94,243) test size_17_128k ... bench: 48,691,887 ns/iter (+/- 228,529) test size_18_256k ... bench: 60,997,701 ns/iter (+/- 1,925,834) test size_19_512k ... bench: 74,230,547 ns/iter (+/- 4,206,076) test size_20_1m ... bench: 163,075,989 ns/iter (+/- 6,646,674) test size_21_2m ... bench: 193,855,222 ns/iter (+/- 9,828,518) test size_22_4m ... bench: 511,140,039 ns/iter (+/- 10,467,888) test size_23_8m ... bench: 942,365,953 ns/iter (+/- 19,146,340)

Just like on the sender side, we can use the scatter-gather functionality of recvmsg(), to put the data directly into the final place -- rather than having to copy it around -- even for the initial fragment. This removes the last major piece of unnecessary overhead; and consequently results in very large gains mostly for medium-sized transfers, where the speedup exceeds 10x on my system. Larger transfers up to a few MiB are still affected quite significantly, improving by about 45% at 2 MiB and some 13% at 4 MiB. All in all, transfers of all sizes are now several times faster than on the first measured version, before applying optimisations -- with the lowest speedup on my system at about 4x for small transfers; about 4.5x for very large ones; and the largest boost of >11x for medium-sized ones. A quick check on a more modern system (64 bit; fairly recent Intel CPU) showed even larger gains: while very big transfers were similar (about 5x speedup), small ones gained >9x, and medium-sized ones >20x. One interesting observation is that on the modern system, the optimised version shows even more strongly pronounced jumps at specific sizes. (Especially 256 KiB and at 1 MiB.) While I haven't verified whether these are more like spikes or more like steps, I suspect it's the latter: with more streamlined memory access patterns in the optimised version, it becomes pretty obvious that these jumps are simply successive levels of the cache hierarchy being exhausted... Below are the numbers of a typical run on my old system, as usual. (Note that the numbers shown for medium-large transfers of 256 KiB and above are now pretty much entirely useless, as on this system the thread launching overhead is larger than the actual benchmark time... On the newer system on the other hand launching the extra thread doesn't seem to have a strongly pronounced effect: the 256 KiB data point shows an equally strong slowdown with one iteration as with 100...) test size_00_1 ... bench: 3,050 ns/iter (+/- 50) test size_01_2 ... bench: 3,018 ns/iter (+/- 59) test size_02_4 ... bench: 3,098 ns/iter (+/- 55) test size_03_8 ... bench: 3,112 ns/iter (+/- 35) test size_04_16 ... bench: 3,104 ns/iter (+/- 54) test size_05_32 ... bench: 3,094 ns/iter (+/- 80) test size_06_64 ... bench: 3,060 ns/iter (+/- 49) test size_07_128 ... bench: 3,101 ns/iter (+/- 30) test size_08_256 ... bench: 3,142 ns/iter (+/- 50) test size_09_512 ... bench: 3,171 ns/iter (+/- 49) test size_10_1k ... bench: 3,322 ns/iter (+/- 58) test size_11_2k ... bench: 4,755 ns/iter (+/- 69) test size_12_4k ... bench: 6,277 ns/iter (+/- 77) test size_13_8k ... bench: 10,070 ns/iter (+/- 96) test size_14_16k ... bench: 9,735 ns/iter (+/- 99) test size_15_32k ... bench: 14,330 ns/iter (+/- 114) test size_16_64k ... bench: 22,990 ns/iter (+/- 857) test size_17_128k ... bench: 44,556 ns/iter (+/- 1,693) test size_18_256k ... bench: 222,732 ns/iter (+/- 47,581) test size_19_512k ... bench: 447,259 ns/iter (+/- 178,507) test size_20_1m ... bench: 1,369,545 ns/iter (+/- 239,571) test size_21_2m ... bench: 1,737,641 ns/iter (+/- 515,468) test size_22_4m ... bench: 4,923,732 ns/iter (+/- 1,204,576) test size_23_8m ... bench: 9,373,281 ns/iter (+/- 1,282,587) test size_00_1 ... bench: 284,262 ns/iter (+/- 5,333) test size_01_2 ... bench: 287,241 ns/iter (+/- 5,105) test size_02_4 ... bench: 291,753 ns/iter (+/- 3,840) test size_03_8 ... bench: 294,802 ns/iter (+/- 8,149) test size_04_16 ... bench: 291,475 ns/iter (+/- 4,166) test size_05_32 ... bench: 292,328 ns/iter (+/- 4,728) test size_06_64 ... bench: 289,378 ns/iter (+/- 5,618) test size_07_128 ... bench: 293,067 ns/iter (+/- 5,312) test size_08_256 ... bench: 308,579 ns/iter (+/- 4,699) test size_09_512 ... bench: 301,040 ns/iter (+/- 5,636) test size_10_1k ... bench: 312,662 ns/iter (+/- 10,609) test size_11_2k ... bench: 446,448 ns/iter (+/- 5,646) test size_12_4k ... bench: 607,197 ns/iter (+/- 7,551) test size_13_8k ... bench: 997,677 ns/iter (+/- 10,513) test size_14_16k ... bench: 956,131 ns/iter (+/- 7,722) test size_15_32k ... bench: 1,437,060 ns/iter (+/- 6,730) test size_16_64k ... bench: 2,269,055 ns/iter (+/- 23,130) test size_17_128k ... bench: 4,413,551 ns/iter (+/- 19,655) test size_18_256k ... bench: 8,628,218 ns/iter (+/- 2,455,944) test size_19_512k ... bench: 19,452,160 ns/iter (+/- 3,474,999) test size_20_1m ... bench: 103,957,847 ns/iter (+/- 8,015,338) test size_21_2m ... bench: 134,416,502 ns/iter (+/- 15,466,457) test size_22_4m ... bench: 446,838,335 ns/iter (+/- 31,417,044) test size_23_8m ... bench: 895,111,420 ns/iter (+/- 16,740,773)

antrik · 2016-05-04T19:27:11Z

Rebased; fixed typos; replaced macros by helper functions + boilerplate. That should cover everything I hope?...

(Let's see whether CI succeeds on non-Linux, now that the Serde issues are sorted out...)

The buffer size shouldn't change from one channel to another; so instead of using a syscall to check the size each time, just check it once and store it using a `lazy_static`. This also means `get_max_fragment_size()` becomes a static method now, as it no longer fetches the value for a specific channel, but rather just refers to the stored value. Note that we only check the send size now, and use it for the size of the receive buffer too. This is indeed more correct than the previous implementation, as the receive buffer needs to hold exactly as much as we might sent at most. (Normally they are the same anyway; but if for some reason the maximum receive size happened to be larger, the previous code would use a larger buffer than necessary. If the receive size happened to be *smaller*, either version would fail horribly...) This doesn't have a noticable performance impact on the sender, as the present implementation only checks the size *after* failing to send the whole message in one packet, i.e. only for large transfers, where the cost of the extra system call is insignificant. The receiver side on the other hand always does the check -- and thus the saved call actually yields a significant improvement for small messages: on my system, small transfers (send + receive) gain more than 20% performance. Along with the other improvements, they are now almost five times faster than the original implementation. test size_00_1 ... bench: 2,289 ns/iter (+/- 38) test size_01_2 ... bench: 2,346 ns/iter (+/- 22) test size_02_4 ... bench: 2,357 ns/iter (+/- 38) test size_03_8 ... bench: 2,374 ns/iter (+/- 42) test size_04_16 ... bench: 2,471 ns/iter (+/- 40) test size_05_32 ... bench: 2,371 ns/iter (+/- 45) test size_06_64 ... bench: 2,422 ns/iter (+/- 44) test size_07_128 ... bench: 2,385 ns/iter (+/- 30) test size_08_256 ... bench: 2,406 ns/iter (+/- 28) test size_09_512 ... bench: 2,499 ns/iter (+/- 56) test size_10_1k ... bench: 2,727 ns/iter (+/- 88) test size_11_2k ... bench: 3,924 ns/iter (+/- 47) test size_12_4k ... bench: 5,555 ns/iter (+/- 60) test size_13_8k ... bench: 9,455 ns/iter (+/- 107) test size_14_16k ... bench: 8,999 ns/iter (+/- 90) test size_15_32k ... bench: 13,647 ns/iter (+/- 105) test size_16_64k ... bench: 22,213 ns/iter (+/- 489) test size_17_128k ... bench: 43,666 ns/iter (+/- 17,217) test size_18_256k ... bench: 221,851 ns/iter (+/- 69,636) test size_19_512k ... bench: 451,801 ns/iter (+/- 113,742) test size_20_1m ... bench: 1,330,491 ns/iter (+/- 182,352) test size_21_2m ... bench: 1,790,956 ns/iter (+/- 489,327) test size_22_4m ... bench: 4,989,840 ns/iter (+/- 1,188,114) test size_23_8m ... bench: 9,349,559 ns/iter (+/- 1,334,978) test size_00_1 ... bench: 231,706 ns/iter (+/- 4,892) test size_01_2 ... bench: 235,017 ns/iter (+/- 6,437) test size_02_4 ... bench: 240,197 ns/iter (+/- 4,068) test size_03_8 ... bench: 244,404 ns/iter (+/- 6,090) test size_04_16 ... bench: 239,248 ns/iter (+/- 4,041) test size_05_32 ... bench: 243,360 ns/iter (+/- 5,237) test size_06_64 ... bench: 236,956 ns/iter (+/- 5,098) test size_07_128 ... bench: 243,579 ns/iter (+/- 7,305) test size_08_256 ... bench: 247,605 ns/iter (+/- 5,047) test size_09_512 ... bench: 276,882 ns/iter (+/- 6,950) test size_10_1k ... bench: 261,665 ns/iter (+/- 4,985) test size_11_2k ... bench: 395,244 ns/iter (+/- 5,495) test size_12_4k ... bench: 558,647 ns/iter (+/- 8,908) test size_13_8k ... bench: 941,395 ns/iter (+/- 7,215) test size_14_16k ... bench: 907,290 ns/iter (+/- 9,087) test size_15_32k ... bench: 1,360,839 ns/iter (+/- 9,137) test size_16_64k ... bench: 2,224,395 ns/iter (+/- 362,003) test size_17_128k ... bench: 4,351,960 ns/iter (+/- 1,726,184) test size_18_256k ... bench: 8,627,702 ns/iter (+/- 2,335,525) test size_19_512k ... bench: 19,018,116 ns/iter (+/- 2,757,467) test size_20_1m ... bench: 102,819,410 ns/iter (+/- 7,050,372) test size_21_2m ... bench: 133,774,605 ns/iter (+/- 14,188,872) test size_22_4m ... bench: 450,259,095 ns/iter (+/- 12,859,207) test size_23_8m ... bench: 875,984,486 ns/iter (+/- 21,168,557)

Don't send a control message if we have no actual auxiliary data (channels / shared memory regions) to transfer. This shaves off another 6 or 7 per cent from small transfers (without FDs) on my system. test size_00_1 ... bench: 2,154 ns/iter (+/- 33) test size_01_2 ... bench: 2,203 ns/iter (+/- 42) test size_02_4 ... bench: 2,231 ns/iter (+/- 51) test size_03_8 ... bench: 2,231 ns/iter (+/- 28) test size_04_16 ... bench: 2,290 ns/iter (+/- 47) test size_05_32 ... bench: 2,261 ns/iter (+/- 57) test size_06_64 ... bench: 2,316 ns/iter (+/- 51) test size_07_128 ... bench: 2,247 ns/iter (+/- 38) test size_08_256 ... bench: 2,266 ns/iter (+/- 45) test size_09_512 ... bench: 2,572 ns/iter (+/- 52) test size_10_1k ... bench: 2,488 ns/iter (+/- 45) test size_11_2k ... bench: 3,791 ns/iter (+/- 51) test size_12_4k ... bench: 5,365 ns/iter (+/- 54) test size_13_8k ... bench: 9,235 ns/iter (+/- 84) test size_14_16k ... bench: 8,833 ns/iter (+/- 102) test size_15_32k ... bench: 13,370 ns/iter (+/- 117) test size_16_64k ... bench: 22,004 ns/iter (+/- 717) test size_17_128k ... bench: 43,369 ns/iter (+/- 976) test size_18_256k ... bench: 224,096 ns/iter (+/- 75,219) test size_19_512k ... bench: 458,353 ns/iter (+/- 149,531) test size_20_1m ... bench: 1,357,956 ns/iter (+/- 187,198) test size_21_2m ... bench: 1,781,991 ns/iter (+/- 512,027) test size_22_4m ... bench: 4,940,065 ns/iter (+/- 1,099,861) test size_23_8m ... bench: 9,345,216 ns/iter (+/- 1,557,181) test size_00_1 ... bench: 222,064 ns/iter (+/- 8,292) test size_01_2 ... bench: 224,589 ns/iter (+/- 4,033) test size_02_4 ... bench: 226,667 ns/iter (+/- 4,774) test size_03_8 ... bench: 229,002 ns/iter (+/- 5,107) test size_04_16 ... bench: 224,895 ns/iter (+/- 3,323) test size_05_32 ... bench: 230,973 ns/iter (+/- 3,265) test size_06_64 ... bench: 224,377 ns/iter (+/- 5,778) test size_07_128 ... bench: 229,364 ns/iter (+/- 8,282) test size_08_256 ... bench: 235,654 ns/iter (+/- 3,860) test size_09_512 ... bench: 235,874 ns/iter (+/- 6,021) test size_10_1k ... bench: 246,200 ns/iter (+/- 2,626) test size_11_2k ... bench: 386,233 ns/iter (+/- 5,313) test size_12_4k ... bench: 542,364 ns/iter (+/- 8,671) test size_13_8k ... bench: 929,892 ns/iter (+/- 12,989) test size_14_16k ... bench: 868,966 ns/iter (+/- 7,768) test size_15_32k ... bench: 1,342,400 ns/iter (+/- 6,781) test size_16_64k ... bench: 2,202,955 ns/iter (+/- 86,042) test size_17_128k ... bench: 4,323,643 ns/iter (+/- 41,162) test size_18_256k ... bench: 8,708,255 ns/iter (+/- 2,162,868) test size_19_512k ... bench: 19,281,153 ns/iter (+/- 5,449,494) test size_20_1m ... bench: 102,875,141 ns/iter (+/- 10,584,288) test size_21_2m ... bench: 134,820,099 ns/iter (+/- 13,015,545) test size_22_4m ... bench: 446,291,502 ns/iter (+/- 12,073,997) test size_23_8m ... bench: 876,656,087 ns/iter (+/- 16,439,303)

On a 32 bit system, the size header is only 4 bytes; so if we fully use the rest of the available buffer for payload data, the size of the latter won't be a multiple of 8 bytes -- and consequently, every second fragment is read from the source data buffer with poor alignment. Fixing this by always aligning the payload data size sent per fragment to 8 byte boundaries. (On 64 bit systems, this is a no-op. According to my testing, aligning to more than 8 byte boundaries doesn't benefit either 32 or 64 bit systems.) On my 32 bit x86 system, this produces quite a sizeable performance improvement for large transfers: peaking at 20% or so around 640 KiB, and staying above 10% for most of the range from 320 KiB to 2 MiB. test size_00_1 ... bench: 2,250 ns/iter (+/- 42) test size_01_2 ... bench: 2,279 ns/iter (+/- 46) test size_02_4 ... bench: 2,335 ns/iter (+/- 53) test size_03_8 ... bench: 2,351 ns/iter (+/- 67) test size_04_16 ... bench: 2,357 ns/iter (+/- 46) test size_05_32 ... bench: 2,327 ns/iter (+/- 64) test size_06_64 ... bench: 2,308 ns/iter (+/- 39) test size_07_128 ... bench: 2,277 ns/iter (+/- 43) test size_08_256 ... bench: 2,453 ns/iter (+/- 35) test size_09_512 ... bench: 2,402 ns/iter (+/- 57) test size_10_1k ... bench: 2,576 ns/iter (+/- 81) test size_11_2k ... bench: 3,883 ns/iter (+/- 66) test size_12_4k ... bench: 5,492 ns/iter (+/- 75) test size_13_8k ... bench: 9,368 ns/iter (+/- 83) test size_14_16k ... bench: 8,857 ns/iter (+/- 107) test size_15_32k ... bench: 14,132 ns/iter (+/- 140) test size_16_64k ... bench: 22,049 ns/iter (+/- 298) test size_17_128k ... bench: 43,577 ns/iter (+/- 1,947) test size_18_256k ... bench: 203,597 ns/iter (+/- 34,495) test size_19_512k ... bench: 407,050 ns/iter (+/- 246,606) test size_20_1m ... bench: 1,334,506 ns/iter (+/- 186,923) test size_21_2m ... bench: 1,630,275 ns/iter (+/- 481,481) test size_22_4m ... bench: 4,826,184 ns/iter (+/- 980,708) test size_23_8m ... bench: 9,390,020 ns/iter (+/- 1,655,050) test size_00_1 ... bench: 223,092 ns/iter (+/- 3,807) test size_01_2 ... bench: 223,918 ns/iter (+/- 3,535) test size_02_4 ... bench: 223,102 ns/iter (+/- 4,907) test size_03_8 ... bench: 230,394 ns/iter (+/- 4,700) test size_04_16 ... bench: 224,395 ns/iter (+/- 4,482) test size_05_32 ... bench: 231,436 ns/iter (+/- 4,214) test size_06_64 ... bench: 225,216 ns/iter (+/- 3,584) test size_07_128 ... bench: 228,905 ns/iter (+/- 4,260) test size_08_256 ... bench: 233,108 ns/iter (+/- 2,998) test size_09_512 ... bench: 236,013 ns/iter (+/- 4,803) test size_10_1k ... bench: 248,637 ns/iter (+/- 5,386) test size_11_2k ... bench: 383,750 ns/iter (+/- 6,134) test size_12_4k ... bench: 542,666 ns/iter (+/- 9,026) test size_13_8k ... bench: 934,623 ns/iter (+/- 9,200) test size_14_16k ... bench: 894,028 ns/iter (+/- 7,425) test size_15_32k ... bench: 1,350,185 ns/iter (+/- 6,316) test size_16_64k ... bench: 2,213,693 ns/iter (+/- 24,549) test size_17_128k ... bench: 4,348,762 ns/iter (+/- 90,784) test size_18_256k ... bench: 7,825,637 ns/iter (+/- 2,243,873) test size_19_512k ... bench: 17,792,873 ns/iter (+/- 3,775,904) test size_20_1m ... bench: 99,769,568 ns/iter (+/- 8,760,588) test size_21_2m ... bench: 126,575,377 ns/iter (+/- 12,528,236) test size_22_4m ... bench: 440,368,689 ns/iter (+/- 16,627,208) test size_23_8m ... bench: 859,509,896 ns/iter (+/- 21,848,024)

Using `byteorder` has neven been actually *necessary* (the serialised data doesn't ever cross machine boundaries) -- it was only being (ab-)used as a convenient way to write/read the header information to/from the shared buffer. Now that the header gets separate send/receive buffers, this isn't actually a simplification anymore -- on the contrary: just using the header data's backing storage as the send/receive buffer directly is indeed simpler now. (And not significantly more unsafe either.) The simpler code also improves performance of small transfers by another two or three per cent. test size_00_1 ... bench: 2,154 ns/iter (+/- 83) test size_01_2 ... bench: 2,154 ns/iter (+/- 75) test size_02_4 ... bench: 2,234 ns/iter (+/- 34) test size_03_8 ... bench: 2,212 ns/iter (+/- 21) test size_04_16 ... bench: 2,298 ns/iter (+/- 49) test size_05_32 ... bench: 2,211 ns/iter (+/- 32) test size_06_64 ... bench: 2,225 ns/iter (+/- 76) test size_07_128 ... bench: 2,202 ns/iter (+/- 40) test size_08_256 ... bench: 2,239 ns/iter (+/- 48) test size_09_512 ... bench: 2,316 ns/iter (+/- 30) test size_10_1k ... bench: 2,446 ns/iter (+/- 29) test size_11_2k ... bench: 3,797 ns/iter (+/- 67) test size_12_4k ... bench: 5,381 ns/iter (+/- 62) test size_13_8k ... bench: 9,225 ns/iter (+/- 96) test size_14_16k ... bench: 8,739 ns/iter (+/- 60) test size_15_32k ... bench: 13,243 ns/iter (+/- 76) test size_16_64k ... bench: 21,879 ns/iter (+/- 141) test size_17_128k ... bench: 43,193 ns/iter (+/- 398) test size_18_256k ... bench: 205,695 ns/iter (+/- 52,004) test size_19_512k ... bench: 409,146 ns/iter (+/- 68,651) test size_20_1m ... bench: 1,341,949 ns/iter (+/- 240,066) test size_21_2m ... bench: 1,662,774 ns/iter (+/- 527,172) test size_22_4m ... bench: 4,885,677 ns/iter (+/- 1,113,293) test size_23_8m ... bench: 9,300,784 ns/iter (+/- 1,806,111) test size_00_1 ... bench: 211,985 ns/iter (+/- 3,389) test size_01_2 ... bench: 210,848 ns/iter (+/- 5,040) test size_02_4 ... bench: 218,757 ns/iter (+/- 3,885) test size_03_8 ... bench: 218,317 ns/iter (+/- 5,671) test size_04_16 ... bench: 219,027 ns/iter (+/- 5,073) test size_05_32 ... bench: 218,795 ns/iter (+/- 4,695) test size_06_64 ... bench: 218,217 ns/iter (+/- 3,875) test size_07_128 ... bench: 224,165 ns/iter (+/- 4,367) test size_08_256 ... bench: 227,112 ns/iter (+/- 3,922) test size_09_512 ... bench: 225,733 ns/iter (+/- 3,931) test size_10_1k ... bench: 239,269 ns/iter (+/- 4,323) test size_11_2k ... bench: 371,675 ns/iter (+/- 6,760) test size_12_4k ... bench: 529,841 ns/iter (+/- 7,052) test size_13_8k ... bench: 910,285 ns/iter (+/- 7,308) test size_14_16k ... bench: 860,518 ns/iter (+/- 7,659) test size_15_32k ... bench: 1,331,114 ns/iter (+/- 5,774) test size_16_64k ... bench: 2,193,192 ns/iter (+/- 22,878) test size_17_128k ... bench: 4,324,455 ns/iter (+/- 86,997) test size_18_256k ... bench: 7,973,472 ns/iter (+/- 1,688,190) test size_19_512k ... bench: 17,325,137 ns/iter (+/- 6,222,526) test size_20_1m ... bench: 100,037,281 ns/iter (+/- 8,976,164) test size_21_2m ... bench: 127,489,104 ns/iter (+/- 14,776,106) test size_22_4m ... bench: 438,418,131 ns/iter (+/- 13,543,391) test size_23_8m ... bench: 858,248,355 ns/iter (+/- 14,316,633)

Followup fragments don't have a header; so they can use a few bytes more for payload. While this is not likely ever to make a noticable performance difference, having exact calculations in each case seems cleaner, hopefully avoiding potential confusion...

Only try sending the entire message in one packet if it will actually fit. This saves a syscall and some processing, but only for large (fragmented) messages -- so it doesn't have a noticable performance impact. However, it should make behaviour clearer and more predictable; and it is also required in order to enable further cleanups.

Now that we fully control the size of the first packet even in the non-fragmented case, we can rely on this size on the receiver side as well, rather than having to allocate a larger buffer just in case.

This requires some shuffling around of declarations, to faciliate untangling the actually unsafe operations from those not affecting safety. (As far as reasonably possible...) Also added a new assertion to make sure that the trimmed `unsafe` blocks really do not rely on any conditions being upheld outside.

antrik · 2016-05-04T19:37:31Z

OK, let's try again...

pcwalton · 2016-05-04T20:29:00Z

@bors-servo: r+

bors-servo · 2016-05-04T20:29:03Z

📌 Commit 7c2466e has been approved by pcwalton

bors-servo · 2016-05-04T20:29:04Z

⌛ Testing commit 7c2466e with merge ca96865...

Optimisations (and cleanups) of Linux backend This is a bunch of optimisations to the Linux platform code, along with various cleanups of the related code which the optimisation patches build upon. Most of the cleanups are on the `send()` side, as the `recv()` side is less affected by the optimisation changes, and thus there has been less reason to refactor the code -- some extra cleanup work would probably be in order here. The optimisations are mostly about avoiding unnecessary copies by using scatter-gather buffers for send and receive; as well as avoiding unnecessary initialisation of receive buffers. The results are impressive: with gains of at least 5x for large transfers of several MiB (a bit more on a modern system); >5x (on an old system) up to >10x (on a modern one) for small transfers of up to a few KiB; and more than 10x for most of the range in between -- peaking at about 12x - 13x on the old system and 20x - 21x on the modern system for medium-sized transfers of about 64 KiB up to a few hundred KiB. For another interesting data point, the CPU usage during benchmark runs (with many iterations, to amortise the setup time) was dominated by user time (more than two thirds of total time) with the original variant; whereas the optimised variant not only further reduces system time to less then half the original value (presumably because of fewer allocations?), but also almost entirely eliminates the user time, making it pretty insignificant in the total picture now -- as it should be. On a less scientific note, Servo built with the optimised ipc-channel doesn't seem to show undue delays any more while rendering a language selection menu. (Which requires lots of fonts to be loaded, and thus triggers heavy ipc-channel activity.)

bors-servo · 2016-05-04T20:44:17Z

☀️ Test successful - travis

mbrubeck mentioned this pull request May 3, 2016

Update lazy_static. #66

Merged

samlh reviewed May 3, 2016
View reviewed changes

UK992 mentioned this pull request May 3, 2016

Replace deprecated Matrix4 with Matrix4D servo/rust-layers#235

Merged

antrik force-pushed the optimise-buffers branch from 97bacae to 24c9b4f Compare May 3, 2016 23:38

pcwalton reviewed May 4, 2016
View reviewed changes

Linux: Rename get_maximum_send_size() -> get_system_sendbuf_size()

e6d57cc

This should better communicate the actual meaning of this value. Also updated some comments to reflect the true meaning.

pcwalton reviewed May 4, 2016
View reviewed changes

platform/linux/mod.rs

@@ -194,14 +194,19 @@ impl UnixSender {

-> Result<(),UnixError> {

Copy link

Contributor

pcwalton May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in commit message: "auxiliary"

antrik added 10 commits May 4, 2016 20:57

Add some more tests for sizes around the maximum fragment size

e4f695e

Lots of corner cases here that can break when changing implementation details... As these tests are all platform-specific, put them in a separate module, to avoid redundant platform conditionals.

Linux: Refactor: Move FD array assembly out of construct_header()

d0c3938

This avoids redundant processing; and will also enable further cleanups.

Linux: Cleanup: Use mem::size_of_val() instead of manual calculation

ca14b46

This variant is not only more compact and elegant, but also type-safe. I don't think it has any downsides.

Linux: Cleanup: Turn magic value into RESERVED_SIZE symbolic constant

65d964e

antrik added 11 commits May 4, 2016 20:59

Linux: Cleanup: Make msghdr.iovec a *const

e554671

While `recvmsg()` mutates the buffers referenced by the `iovec`, the `iovec` itself is never modified by either `sendmsg()` or `recvmsg()`. This field is indeed marked `*const` in `libc::msghdr` as well.

Linux: Cleanup: Remove no longer needed unsafe block

06ed0c3

Now that all the unsafe code is isolated in `send_*_fragment()`, the rest of the `send()` method doesn't need to be marked unsafe anymore.

antrik force-pushed the optimise-buffers branch from 24c9b4f to aea1630 Compare May 4, 2016 19:22

antrik added 8 commits May 4, 2016 21:34

Linux: Don't over-allocate first fragment receive buffer

0bda93a

Now that we fully control the size of the first packet even in the non-fragmented case, we can rely on this size on the receiver side as well, rather than having to allocate a larger buffer just in case.

antrik force-pushed the optimise-buffers branch from aea1630 to 7c2466e Compare May 4, 2016 19:34

bors-servo merged commit 7c2466e into servo:master May 4, 2016

bors-servo mentioned this pull request May 4, 2016

run tidy #67

Closed

		@@ -194,14 +194,19 @@ impl UnixSender {
		-> Result<(),UnixError> {

Optimisations (and cleanups) of Linux backend #68

Optimisations (and cleanups) of Linux backend #68

Uh oh!

Conversation

antrik commented May 1, 2016

Uh oh!

jdm commented May 1, 2016

Uh oh!

antrik commented May 2, 2016

Uh oh!

samlh May 3, 2016

Choose a reason for hiding this comment

Uh oh!

antrik May 3, 2016

Choose a reason for hiding this comment

Uh oh!

samlh May 4, 2016

Choose a reason for hiding this comment

Uh oh!

antrik commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors-servo commented May 4, 2016

Uh oh!

pcwalton May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antrik May 4, 2016

Choose a reason for hiding this comment

Uh oh!

pcwalton May 4, 2016

Choose a reason for hiding this comment

Uh oh!

pcwalton May 4, 2016

Choose a reason for hiding this comment

Uh oh!

pcwalton commented May 4, 2016

Uh oh!

antrik commented May 4, 2016

Uh oh!

antrik commented May 4, 2016

Uh oh!

pcwalton commented May 4, 2016

Uh oh!

bors-servo commented May 4, 2016

Uh oh!

bors-servo commented May 4, 2016

Uh oh!

bors-servo commented May 4, 2016

Uh oh!

Uh oh!

antrik commented May 3, 2016 •

edited

Loading

pcwalton May 4, 2016 •

edited

Loading