Skip to content

Optimisations (and cleanups) of Linux backend #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 4, 2016

Conversation

antrik
Copy link
Contributor

@antrik antrik commented May 1, 2016

This is a bunch of optimisations to the Linux platform code, along with various cleanups of the related code which the optimisation patches build upon.

Most of the cleanups are on the send() side, as the recv() side is less affected by the optimisation changes, and thus there has been less reason to refactor the code -- some extra cleanup work would probably be in order here.

The optimisations are mostly about avoiding unnecessary copies by using scatter-gather buffers for send and receive; as well as avoiding unnecessary initialisation of receive buffers.

The results are impressive: with gains of at least 5x for large transfers of several MiB (a bit more on a modern system); >5x (on an old system) up to >10x (on a modern one) for small transfers of up to a few KiB; and more than 10x for most of the range in between -- peaking at about 12x - 13x on the old system and 20x - 21x on the modern system for medium-sized transfers of about 64 KiB up to a few hundred KiB.

For another interesting data point, the CPU usage during benchmark runs (with many iterations, to amortise the setup time) was dominated by user time (more than two thirds of total time) with the original variant; whereas the optimised variant not only further reduces system time to less then half the original value (presumably because of fewer allocations?), but also almost entirely eliminates the user time, making it pretty insignificant in the total picture now -- as it should be.

On a less scientific note, Servo built with the optimised ipc-channel doesn't seem to show undue delays any more while rendering a language selection menu. (Which requires lots of fonts to be loaded, and thus triggers heavy ipc-channel activity.)

@jdm
Copy link
Member

jdm commented May 1, 2016

Build failure is from aster and quasi crates breaking on the current rustc nightly.

@antrik
Copy link
Contributor Author

antrik commented May 2, 2016

I now submitted a bunch of PRs for this: serde-deprecated/syntex#44 , serde-deprecated/aster#78 , serde-deprecated/quasi#41 , serde-rs/serde#303

Once these are in, we can update the dependencies in ipc-channel...

@mbrubeck mbrubeck mentioned this pull request May 3, 2016
@@ -278,16 +286,10 @@ impl UnixSender {

if result <= 0 {
let error = UnixError::last();
if error.0 == libc::ENOBUFS && bytes_to_send > 2000 {
if error.0 == libc::ENOBUFS
&& downsize(&mut sendbuf_size, bytes_to_send).is_ok() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not handle the Err(()) case from downsize correctly - perhaps this should be:

if error.0 == libc::ENOBUFS {
    // If the kernel failed to allocate a buffer large enough for the packet,
    // retry with a smaller size (if possible).
    try!(downsize(&mut sendbuf_size, bytes_to_send));
    continue;
} else {

Note this may need a tweak for freeing behavior.

There is a similar issue above, but a later commit makes it moot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning the empty error from downsize() would be correct handling... The idea here is that if the downsize fails, we want to return the original error -- which happens in the else case.

As I said in the commit message, I'm not entirely happy how this turned out syntactically -- but I really can't think of a better approach. Quite frankly, I considered dropping this size test entirely: it's a case that I don't actually expect to happen; I only added it as an additional safeguard in the original ENOBUFS handling because it seemed easy to do -- but it turned out to be a major pain during refactoring, and I'm not sure any more it's indeed worth the trouble...

Maybe we should just turn it into an assert() instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I was looking at the call above that has an else if instead of an else, but that is fixed by a later commit. Thanks for taking a look, either solution works for me.

@antrik
Copy link
Contributor Author

antrik commented May 3, 2016

Added a patch making msghdr.iovec a *const to avoid confusion. (And did a few other minor cleanups to iovec handling along the way...)

@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably #66) made this pull request unmergeable. Please resolve the merge conflicts.

macro_rules! create_with_n_fds {
($name:ident, $n:expr, $size:expr) => (
#[test]
fn $name() {
Copy link
Contributor

@pcwalton pcwalton May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a macro? Can it just be a helper function that the various tests delegate to?

"Macros are for when you have run out of language" —@dherman

But I think there is still language left :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I did use a helper function at first; but that means there were a lot of redundant wrappers for each individual test. I didn't like it.

(The macro instantiations admittedly still involve some redundancy for now, due to limitations of the current macro system...)

BTW, I introduced a similar macro already in an earlier PR -- you didn't comment on that one...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant wrappers? What do you mean?

I don't see how wrapper functions are redundant. There's a tiny amount of boilerplate, but the clarity is worth it IMO. I always have a hard time reading macros.

This should better communicate the actual meaning of this value.

Also updated some comments to reflect the true meaning.
@@ -194,14 +194,19 @@ impl UnixSender {
-> Result<(),UnixError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in commit message: "auxiliary"

@pcwalton
Copy link
Contributor

pcwalton commented May 4, 2016

This looks good to me with the above nits addressed. Thanks!

antrik added 10 commits May 4, 2016 20:57
Lots of corner cases here that can break when changing implementation
details...

As these tests are all platform-specific, put them in a separate module,
to avoid redundant platform conditionals.
`libc::size_t` is an alias for `usize` -- so the casts are unnecessary,
and only bloat the code, thus reducing readability (especially when they
necessitate extra parentheses); and (as @mbrubeck pointed out) they
actually become a liability when the involved types change, as they can
silently turn into real casts, thus obscuring a potential need for code
adaptations.
…cv()`

The function actually processes the result of the system call before
returning it, checking for negative values among other things -- so
there is no need to return a sized type, nor a libc-specific one. The
internal type should be abstracted by this kind of wrapper function, so
callers don't need ugly casts on each invocation.
This avoids redundant processing; and will also enable further cleanups.
This variant is not only more compact and elegant, but also type-safe. I
don't think it has any downsides.
When calculating the maximum size of data we can send in a single
fragment, no longer deduct any amount "dynamically" based on the size of
the auxiliary data (FDs) transferred in the control message.

I'm not sure what originally prompted the idea that deducting this from
the main buffer would be necessary -- my testing at least doesn't show
any need for that. The auxiliary data is transferred in a separate
buffer with its own size limitation. (Defaulting to 10 KiB on my
system.)
As we explicitly size all fragments in a fragmented send, the maximum
size of followup packets is always well known -- so there is no need to
allocate any extra just in case.

(Unlike for the first fragment, which is implicitly sized if it comes
from an unfragmented send; and thus might potentially have a larger size
than what we expect, in case our reserved size doesn't match reality...)
I wonder whether there is some obscure documentation I'm not aware of
that suggests we have to deduct 256 bytes? According to my testing with
Linux 4.4 on i386 as well as Linux 3.16 on x86_64, only 32 bytes
actually need to be deducted.
Rather than halving the payload size, halve the total buffer size, and
recalculate the payload size from that.

This way, the buffer size remains a (more or less) round number, rather
than becoming something just slightly above a round number -- which
should improve resource utilisation, and might even reduce the number of
downsizes necessary in some cases.

This will also faciliate further cleanups.
antrik added 11 commits May 4, 2016 20:59
While `recvmsg()` mutates the buffers referenced by the `iovec`, the
`iovec` itself is never modified by either `sendmsg()` or `recvmsg()`.

This field is indeed marked `*const` in `libc::msghdr` as well.
When using fragmentation, the `msghdr` structure is only used for the
first packet, which is already sent in a separate conditional arm anyway
-- so we can just as well create (and deallocate) the header within this
conditional too.

This should make the code more robust and much easier to follow.

(Note that this means the header will be recreated when we have to
resend on ENOBUFS. The performance impact should be negligible though;
and it's an exceptional case anyways.)
Moving the actual sending of the message into the `construct_header()`
function, and renaming it to `send_first_fragment()` to reflect that
change.

With the previous changes, we always send the message right after
constructing the header -- so it makes sense to put the common sequence
in one place. Removing the need to pass the header structs through the
caller also gets rid of the ugly `iovec` return hack.

What's more, this helps isolating the unsafe operations: invoking
`send_first_fragment()` is in fact not unsafe at all.
Moving the `send()` call for followup fragments into a sub-function
along the lines of `send_first_fragment()`.

While this one is only invoked once, the commont pattern should make the
code easier to read; and just like with `send_first_fragment()`, it also
helps isolating the unsafe code.
Now that all the unsafe code is isolated in `send_*_fragment()`, the
rest of the `send()` method doesn't need to be marked unsafe anymore.
Measure performance of transfers of various sizes, to keep track of the
performance impact of upcoming optimisations.

This makes use of `get_max_fragment_size()` outside of platform code; so
it additionally necessitates adding stub implementations of this method
for all platforms.

The benchmark results are not as consistent as one would hope for -- but
it should be good enough to judge the impact of any major changes. (See
also the code comment in `benches/bench.rs` for an explanation of the
`_invoke_chaos` benchmark pass...)

Below are the numbers from a typical run on my system. (Which is an
Intel Core2 Quad at 2.5 GHz running a 32 Bit GNU/Linux system, i.e. a
pretty old system from 2008 or thereabouts.)

These numbers were obtained with the cpufreq governor set to
`performance` (rather than the default `ondemand`) for more reproducible
results. Unfortunately, this doesn't exclude other external factors,
such as memory pressure -- so it's still tricky to compare different
test runs.

The numbers presented here (and in the following bunch of optimisation
commits) were all obtained in a single large test series; so they should
be comparable -- but it's still tricky to compare results when checking
the impact of any new patches...

The second block of results is with `ITERATIONS` in `benches/bench.rs`
increased to 100. Aside from somewhat reducing randomness in general,
this is important because for fragmented transfers we need to spawn a
thread in the benchmark suite, which is significantly affecting the
results for medium-large transfers (especially 256 KiB and the next few
ones) when using only a single iteration -- in fact dominating them once
some optimisations are applied. (And also introducing lots of
randomness.)

Still presenting the result for one iteration as well: mostly because
the absolute values have the right magnitude in this case, while adding
more iterations shifts them accordingly.

The largest size doesn't actually produce a result with 100 iterations,
because of some integer overflow in `cargo bench` I presume. We will get
results for this one as well though once some optimisations are applied.

test _invoke_chaos ... bench:   4,283,639 ns/iter (+/- 369,532)
test size_00_1     ... bench:      11,893 ns/iter (+/- 233)
test size_01_2     ... bench:      11,786 ns/iter (+/- 111)
test size_02_4     ... bench:      11,766 ns/iter (+/- 103)
test size_03_8     ... bench:      11,731 ns/iter (+/- 71)
test size_04_16    ... bench:      11,777 ns/iter (+/- 67)
test size_05_32    ... bench:      11,823 ns/iter (+/- 87)
test size_06_64    ... bench:      12,085 ns/iter (+/- 94)
test size_07_128   ... bench:      12,358 ns/iter (+/- 108)
test size_08_256   ... bench:      12,710 ns/iter (+/- 110)
test size_09_512   ... bench:      13,674 ns/iter (+/- 151)
test size_10_1k    ... bench:      15,634 ns/iter (+/- 151)
test size_11_2k    ... bench:      19,289 ns/iter (+/- 182)
test size_12_4k    ... bench:      26,931 ns/iter (+/- 92)
test size_13_8k    ... bench:      42,751 ns/iter (+/- 234)
test size_14_16k   ... bench:      74,066 ns/iter (+/- 432)
test size_15_32k   ... bench:     137,961 ns/iter (+/- 694)
test size_16_64k   ... bench:     262,229 ns/iter (+/- 2,664)
test size_17_128k  ... bench:     509,617 ns/iter (+/- 7,176)
test size_18_256k  ... bench:   1,202,057 ns/iter (+/- 261,359)
test size_19_512k  ... bench:   2,267,058 ns/iter (+/- 403,483)
test size_20_1m    ... bench:   6,033,593 ns/iter (+/- 332,782)
test size_21_2m    ... bench:  12,403,937 ns/iter (+/- 626,731)
test size_22_4m    ... bench:  25,218,893 ns/iter (+/- 1,290,866)
test size_23_8m    ... bench:  45,120,983 ns/iter (+/- 1,714,226)

test _invoke_chaos ... bench: 419,861,129 ns/iter (+/- 8,342,705)
test size_00_1     ... bench:   1,172,231 ns/iter (+/- 6,850)
test size_01_2     ... bench:   1,176,073 ns/iter (+/- 6,664)
test size_02_4     ... bench:   1,179,213 ns/iter (+/- 10,001)
test size_03_8     ... bench:   1,179,364 ns/iter (+/- 9,985)
test size_04_16    ... bench:   1,182,618 ns/iter (+/- 9,154)
test size_05_32    ... bench:   1,183,845 ns/iter (+/- 6,272)
test size_06_64    ... bench:   1,192,917 ns/iter (+/- 6,473)
test size_07_128   ... bench:   1,219,179 ns/iter (+/- 9,096)
test size_08_256   ... bench:   1,266,088 ns/iter (+/- 14,919)
test size_09_512   ... bench:   1,349,183 ns/iter (+/- 21,996)
test size_10_1k    ... bench:   1,548,835 ns/iter (+/- 12,133)
test size_11_2k    ... bench:   1,929,276 ns/iter (+/- 17,447)
test size_12_4k    ... bench:   2,649,545 ns/iter (+/- 52,515)
test size_13_8k    ... bench:   4,233,634 ns/iter (+/- 27,626)
test size_14_16k   ... bench:   7,410,534 ns/iter (+/- 16,211)
test size_15_32k   ... bench:  13,733,377 ns/iter (+/- 28,703)
test size_16_64k   ... bench:  26,113,539 ns/iter (+/- 73,558)
test size_17_128k  ... bench:  50,787,086 ns/iter (+/- 96,875)
test size_18_256k  ... bench:  93,566,074 ns/iter (+/- 3,746,932)
test size_19_512k  ... bench: 226,470,336 ns/iter (+/- 24,565,318)
test size_20_1m    ... bench: 519,162,370 ns/iter (+/- 7,462,605)
test size_21_2m    ... bench: 1,099,614,726 ns/iter (+/- 7,175,544)
test size_22_4m    ... bench: 2,108,188,068 ns/iter (+/- 39,768,687)
test size_23_8m    ... bench:           0 ns/iter (+/- 30,262,382)
Now that we no longer have to deal with fragments of different messages
interleaving, there is no need for the complicated fragment ID handling.
The only information we strictly need is whether we have fragmentation
at all. (So the receiver knows whether to retrieve and use the dedicated
channel.)

However, rather than only sending a boolean flag, we can just as well
send an `usize` announcing the total size of payload data in this
message. This way the receiver knows when fragmentation occurs -- and
additionally knows exactly when all fragments have been received. This
avoids the need to check in the receiver for the channel being closed by
the sender; and knowing the total size in advance will also enable
further optimisations/simplifications in the future.

As a side effect, getting rid of the fragment headers removes the need
in the sender to copy the data again when preparing send buffers for the
followup fragments. (One copy operation is still needed for assembling
the initial send buffer from the size header and main data.) This almost
doubles performance (send + receive) of large transfers. Note though
that this is a temporary effect: upcoming, more thorough optimisations
will make this change mostly meaningless. (In terms of performance, that
is -- the simplification is still worthwhile of course!)

Regarding the results below, note that the single-iteration numbers for
256k and especially 512k experience a *huge* random fluctuation here
(the latter ranging between 1.2 ms and 1.6 ms from one test run to
another...) -- so they are really useful only as a very rough
orientation, rather than for comparing against other results.

The results for 100 iterations on the other hand are pretty stable, with
fluctuations usually below 2% for all sizes.

test _invoke_chaos ... bench:   2,779,348 ns/iter (+/- 353,869)
test size_00_1     ... bench:      11,958 ns/iter (+/- 131)
test size_01_2     ... bench:      11,895 ns/iter (+/- 89)
test size_02_4     ... bench:      11,646 ns/iter (+/- 99)
test size_03_8     ... bench:      11,648 ns/iter (+/- 57)
test size_04_16    ... bench:      11,675 ns/iter (+/- 64)
test size_05_32    ... bench:      11,715 ns/iter (+/- 47)
test size_06_64    ... bench:      11,885 ns/iter (+/- 84)
test size_07_128   ... bench:      12,159 ns/iter (+/- 103)
test size_08_256   ... bench:      12,536 ns/iter (+/- 155)
test size_09_512   ... bench:      13,544 ns/iter (+/- 156)
test size_10_1k    ... bench:      15,472 ns/iter (+/- 123)
test size_11_2k    ... bench:      19,217 ns/iter (+/- 128)
test size_12_4k    ... bench:      26,548 ns/iter (+/- 401)
test size_13_8k    ... bench:      42,123 ns/iter (+/- 436)
test size_14_16k   ... bench:      73,710 ns/iter (+/- 259)
test size_15_32k   ... bench:     139,332 ns/iter (+/- 1,106)
test size_16_64k   ... bench:     267,651 ns/iter (+/- 2,688)
test size_17_128k  ... bench:     517,987 ns/iter (+/- 7,714)
test size_18_256k  ... bench:     934,841 ns/iter (+/- 272,889)
test size_19_512k  ... bench:   1,327,214 ns/iter (+/- 417,956)
test size_20_1m    ... bench:   3,786,214 ns/iter (+/- 429,365)
test size_21_2m    ... bench:   7,559,035 ns/iter (+/- 738,997)
test size_22_4m    ... bench:  15,069,971 ns/iter (+/- 1,203,609)
test size_23_8m    ... bench:  24,633,162 ns/iter (+/- 1,969,078)

test _invoke_chaos ... bench: 277,756,398 ns/iter (+/- 8,038,558)
test size_00_1     ... bench:   1,187,442 ns/iter (+/- 5,224)
test size_01_2     ... bench:   1,189,368 ns/iter (+/- 7,582)
test size_02_4     ... bench:   1,168,775 ns/iter (+/- 9,093)
test size_03_8     ... bench:   1,173,062 ns/iter (+/- 10,568)
test size_04_16    ... bench:   1,171,706 ns/iter (+/- 8,885)
test size_05_32    ... bench:   1,176,364 ns/iter (+/- 5,833)
test size_06_64    ... bench:   1,190,941 ns/iter (+/- 7,975)
test size_07_128   ... bench:   1,224,023 ns/iter (+/- 7,265)
test size_08_256   ... bench:   1,267,395 ns/iter (+/- 7,898)
test size_09_512   ... bench:   1,360,957 ns/iter (+/- 9,707)
test size_10_1k    ... bench:   1,538,531 ns/iter (+/- 7,787)
test size_11_2k    ... bench:   1,921,562 ns/iter (+/- 19,868)
test size_12_4k    ... bench:   2,655,408 ns/iter (+/- 42,143)
test size_13_8k    ... bench:   4,250,711 ns/iter (+/- 61,767)
test size_14_16k   ... bench:   7,446,110 ns/iter (+/- 246,966)
test size_15_32k   ... bench:  13,971,866 ns/iter (+/- 20,793)
test size_16_64k   ... bench:  26,721,142 ns/iter (+/- 49,591)
test size_17_128k  ... bench:  51,645,773 ns/iter (+/- 109,419)
test size_18_256k  ... bench:  67,393,172 ns/iter (+/- 2,902,754)
test size_19_512k  ... bench: 138,950,061 ns/iter (+/- 27,886,102)
test size_20_1m    ... bench: 344,968,619 ns/iter (+/- 6,519,044)
test size_21_2m    ... bench: 685,842,845 ns/iter (+/- 16,417,853)
test size_22_4m    ... bench: 1,154,950,689 ns/iter (+/- 14,661,441)
test size_23_8m    ... bench: 2,335,253,136 ns/iter (+/- 25,108,022)
Using the scatter-gather functionality of the sendmsg() system call, we
can avoid the need for copying data into a dedicated buffer altogether.

This gets another sizeable performance increase (on top of the fragment
header change) for large transfers, resulting in a total speedup of
about 2x - 3x (depending on size) compared to the original version.
Medium-sized (non-fragemented) transfers also gain a few per cent.

(The gains on the sender side are actually even bigger: we still haven't
optimised the receiver side at all -- so that is now the main
bottleneck...)

Note that comparing against the original variant indeed makes more sense
here than looking at the most recent results, because this new approach
effectively obsoletes the performance gains of the previous change:
using scatter-gather, we could have achieved the same zero-copy effect
even with the old fragment header approach, with only some small
overhead for handling the actual headers. (The code is considerably
simpler without the fragment headers, though.)

test _invoke_chaos ... bench:   2,681,822 ns/iter (+/- 219,773)
test size_00_1     ... bench:      11,815 ns/iter (+/- 121)
test size_01_2     ... bench:      11,798 ns/iter (+/- 57)
test size_02_4     ... bench:      11,727 ns/iter (+/- 72)
test size_03_8     ... bench:      11,717 ns/iter (+/- 53)
test size_04_16    ... bench:      11,783 ns/iter (+/- 128)
test size_05_32    ... bench:      11,798 ns/iter (+/- 59)
test size_06_64    ... bench:      11,947 ns/iter (+/- 103)
test size_07_128   ... bench:      12,203 ns/iter (+/- 89)
test size_08_256   ... bench:      12,779 ns/iter (+/- 94)
test size_09_512   ... bench:      13,866 ns/iter (+/- 122)
test size_10_1k    ... bench:      15,288 ns/iter (+/- 231)
test size_11_2k    ... bench:      19,603 ns/iter (+/- 256)
test size_12_4k    ... bench:      28,476 ns/iter (+/- 232)
test size_13_8k    ... bench:      46,229 ns/iter (+/- 200)
test size_14_16k   ... bench:      73,107 ns/iter (+/- 314)
test size_15_32k   ... bench:     133,618 ns/iter (+/- 1,121)
test size_16_64k   ... bench:     254,531 ns/iter (+/- 4,815)
test size_17_128k  ... bench:     495,497 ns/iter (+/- 10,201)
test size_18_256k  ... bench:     886,351 ns/iter (+/- 199,942)
test size_19_512k  ... bench:   1,123,621 ns/iter (+/- 305,389)
test size_20_1m    ... bench:   3,036,919 ns/iter (+/- 285,712)
test size_21_2m    ... bench:   5,222,443 ns/iter (+/- 347,878)
test size_22_4m    ... bench:   7,551,218 ns/iter (+/- 914,100)
test size_23_8m    ... bench:  14,977,983 ns/iter (+/- 1,302,684)

test _invoke_chaos ... bench: 244,491,008 ns/iter (+/- 7,239,693)
test size_00_1     ... bench:   1,183,187 ns/iter (+/- 5,943)
test size_01_2     ... bench:   1,183,225 ns/iter (+/- 9,967)
test size_02_4     ... bench:   1,180,223 ns/iter (+/- 6,720)
test size_03_8     ... bench:   1,181,466 ns/iter (+/- 6,136)
test size_04_16    ... bench:   1,183,006 ns/iter (+/- 7,598)
test size_05_32    ... bench:   1,191,722 ns/iter (+/- 9,888)
test size_06_64    ... bench:   1,198,561 ns/iter (+/- 8,227)
test size_07_128   ... bench:   1,217,393 ns/iter (+/- 5,343)
test size_08_256   ... bench:   1,269,855 ns/iter (+/- 7,813)
test size_09_512   ... bench:   1,393,586 ns/iter (+/- 7,397)
test size_10_1k    ... bench:   1,524,854 ns/iter (+/- 15,853)
test size_11_2k    ... bench:   1,959,964 ns/iter (+/- 24,707)
test size_12_4k    ... bench:   2,858,032 ns/iter (+/- 11,557)
test size_13_8k    ... bench:   4,629,783 ns/iter (+/- 11,677)
test size_14_16k   ... bench:   7,321,471 ns/iter (+/- 13,709)
test size_15_32k   ... bench:  13,397,902 ns/iter (+/- 16,635)
test size_16_64k   ... bench:  25,558,619 ns/iter (+/- 112,384)
test size_17_128k  ... bench:  49,717,629 ns/iter (+/- 1,668,997)
test size_18_256k  ... bench:  67,053,276 ns/iter (+/- 2,203,769)
test size_19_512k  ... bench: 125,389,098 ns/iter (+/- 20,284,576)
test size_20_1m    ... bench: 276,946,251 ns/iter (+/- 8,119,682)
test size_21_2m    ... bench: 487,328,628 ns/iter (+/- 12,586,919)
test size_22_4m    ... bench: 715,961,795 ns/iter (+/- 14,266,485)
test size_23_8m    ... bench: 1,156,209,396 ns/iter (+/- 20,298,143)
The receiver always has to allocate a buffer large enough to fit the
maximal packet size, as it doesn't know how large the next message will
be. Up till now, the entire buffer was being 0-filled on allocation --
which was inflicting considerable overhead for small messages: skipping
the initialisation almost quadruples(!) the performance of small
transfers on my system; and has a noticable effect on larger transfers
too, if the last fragment is relatively small. (About 10% for 512KiB
transfers for example, where the last fragment is only about 32 KiB.)

We truncate the length of the receive buffer (vector) to the actual size
of the data received, right after the receive call -- so given that we
don't do anything else between allocating the buffer and receiving,
having it temporarily uninitialised shouldn't be terribly unsafe.

test _invoke_chaos ... bench:   2,646,830 ns/iter (+/- 221,495)
test size_00_1     ... bench:       3,070 ns/iter (+/- 48)
test size_01_2     ... bench:       3,076 ns/iter (+/- 63)
test size_02_4     ... bench:       3,040 ns/iter (+/- 64)
test size_03_8     ... bench:       3,020 ns/iter (+/- 55)
test size_04_16    ... bench:       3,072 ns/iter (+/- 65)
test size_05_32    ... bench:       3,162 ns/iter (+/- 56)
test size_06_64    ... bench:       3,249 ns/iter (+/- 63)
test size_07_128   ... bench:       3,447 ns/iter (+/- 74)
test size_08_256   ... bench:       3,959 ns/iter (+/- 101)
test size_09_512   ... bench:       5,199 ns/iter (+/- 77)
test size_10_1k    ... bench:       6,577 ns/iter (+/- 124)
test size_11_2k    ... bench:      11,090 ns/iter (+/- 157)
test size_12_4k    ... bench:      19,702 ns/iter (+/- 80)
test size_13_8k    ... bench:      37,433 ns/iter (+/- 234)
test size_14_16k   ... bench:      64,489 ns/iter (+/- 461)
test size_15_32k   ... bench:     125,141 ns/iter (+/- 1,305)
test size_16_64k   ... bench:     247,282 ns/iter (+/- 2,844)
test size_17_128k  ... bench:     489,099 ns/iter (+/- 2,951)
test size_18_256k  ... bench:     838,078 ns/iter (+/- 130,191)
test size_19_512k  ... bench:   1,071,451 ns/iter (+/- 193,737)
test size_20_1m    ... bench:   2,935,692 ns/iter (+/- 255,588)
test size_21_2m    ... bench:   5,183,219 ns/iter (+/- 295,405)
test size_22_4m    ... bench:   7,396,433 ns/iter (+/- 808,390)
test size_23_8m    ... bench:  14,341,520 ns/iter (+/- 1,275,331)

test _invoke_chaos ... bench: 238,710,985 ns/iter (+/- 16,477,582)
test size_00_1     ... bench:     312,659 ns/iter (+/- 8,490)
test size_01_2     ... bench:     311,919 ns/iter (+/- 5,871)
test size_02_4     ... bench:     305,455 ns/iter (+/- 3,955)
test size_03_8     ... bench:     310,709 ns/iter (+/- 6,244)
test size_04_16    ... bench:     309,599 ns/iter (+/- 5,053)
test size_05_32    ... bench:     313,935 ns/iter (+/- 5,827)
test size_06_64    ... bench:     325,553 ns/iter (+/- 4,227)
test size_07_128   ... bench:     352,797 ns/iter (+/- 7,614)
test size_08_256   ... bench:     396,614 ns/iter (+/- 11,826)
test size_09_512   ... bench:     482,741 ns/iter (+/- 8,917)
test size_10_1k    ... bench:     664,678 ns/iter (+/- 9,785)
test size_11_2k    ... bench:   1,132,642 ns/iter (+/- 14,503)
test size_12_4k    ... bench:   1,942,667 ns/iter (+/- 22,892)
test size_13_8k    ... bench:   3,706,023 ns/iter (+/- 16,474)
test size_14_16k   ... bench:   6,410,924 ns/iter (+/- 12,465)
test size_15_32k   ... bench:  12,494,806 ns/iter (+/- 23,242)
test size_16_64k   ... bench:  24,591,965 ns/iter (+/- 81,764)
test size_17_128k  ... bench:  48,608,297 ns/iter (+/- 106,295)
test size_18_256k  ... bench:  65,057,222 ns/iter (+/- 1,918,700)
test size_19_512k  ... bench: 114,772,423 ns/iter (+/- 18,717,560)
test size_20_1m    ... bench: 261,632,538 ns/iter (+/- 6,904,648)
test size_21_2m    ... bench: 483,410,889 ns/iter (+/- 10,294,448)
test size_22_4m    ... bench: 708,794,834 ns/iter (+/- 12,841,685)
test size_23_8m    ... bench: 1,139,315,776 ns/iter (+/- 25,330,167)
Rather then receiving each fragment into an individual buffer first, and
concatenating it to the main buffer afterwards, preallocate space in the
main buffer, and receive the followup fragements directly into it. (Only
the initial fragment is still being copied, while separating out the
message size header.)

This results in another huge performance boost for large transfers,
showing improvements of 50% and more at most sizes. (It's hard to assess
an overall number, because of very strong variance between the
individual sizes...)

The biggest gain is at 2 MiB, with performance improving by about 2.5x
(moving the drop-off in performance to the 4 MiB data point) -- probably
because getting rid of the surplus data copies allows everything to fit
in the last-level cache now at this size; only needing to go to slower
main memory for even larger sizes. 512 KiB also gains about 2x, probably
for similar reasons.

The total speedup compared to the original version for transfers of 512
KiB and more now amounts to about 3x, and some 4.5x on average for even
larger ones.

As an interesting side effect, the benchmark results get much more
consistent with this change, avoiding the need for a warmup. There are
still some weird jumps at certain sizes; but these are less severe now
over all -- and no longer affected by random other factors...

test size_00_1    ... bench:       3,066 ns/iter (+/- 59)
test size_01_2    ... bench:       3,116 ns/iter (+/- 46)
test size_02_4    ... bench:       3,003 ns/iter (+/- 40)
test size_03_8    ... bench:       3,071 ns/iter (+/- 44)
test size_04_16   ... bench:       3,088 ns/iter (+/- 52)
test size_05_32   ... bench:       3,143 ns/iter (+/- 42)
test size_06_64   ... bench:       3,219 ns/iter (+/- 73)
test size_07_128  ... bench:       3,499 ns/iter (+/- 92)
test size_08_256  ... bench:       3,992 ns/iter (+/- 78)
test size_09_512  ... bench:       5,002 ns/iter (+/- 84)
test size_10_1k   ... bench:       6,691 ns/iter (+/- 102)
test size_11_2k   ... bench:      11,398 ns/iter (+/- 175)
test size_12_4k   ... bench:      19,907 ns/iter (+/- 116)
test size_13_8k   ... bench:      37,206 ns/iter (+/- 158)
test size_14_16k  ... bench:      63,840 ns/iter (+/- 369)
test size_15_32k  ... bench:     124,472 ns/iter (+/- 1,409)
test size_16_64k  ... bench:     246,473 ns/iter (+/- 2,767)
test size_17_128k ... bench:     487,915 ns/iter (+/- 11,440)
test size_18_256k ... bench:     781,964 ns/iter (+/- 59,461)
test size_19_512k ... bench:     984,189 ns/iter (+/- 86,029)
test size_20_1m   ... bench:   2,037,886 ns/iter (+/- 214,774)
test size_21_2m   ... bench:   2,374,924 ns/iter (+/- 596,728)
test size_22_4m   ... bench:   5,573,282 ns/iter (+/- 756,270)
test size_23_8m   ... bench:  10,058,920 ns/iter (+/- 1,767,761)

test size_00_1    ... bench:     307,074 ns/iter (+/- 3,333)
test size_01_2    ... bench:     306,568 ns/iter (+/- 3,736)
test size_02_4    ... bench:     299,714 ns/iter (+/- 4,773)
test size_03_8    ... bench:     310,657 ns/iter (+/- 5,220)
test size_04_16   ... bench:     306,247 ns/iter (+/- 4,279)
test size_05_32   ... bench:     311,436 ns/iter (+/- 4,693)
test size_06_64   ... bench:     321,380 ns/iter (+/- 5,674)
test size_07_128  ... bench:     347,893 ns/iter (+/- 3,311)
test size_08_256  ... bench:     398,704 ns/iter (+/- 4,745)
test size_09_512  ... bench:     483,585 ns/iter (+/- 6,268)
test size_10_1k   ... bench:     664,508 ns/iter (+/- 12,941)
test size_11_2k   ... bench:   1,126,760 ns/iter (+/- 26,853)
test size_12_4k   ... bench:   1,946,401 ns/iter (+/- 34,891)
test size_13_8k   ... bench:   3,655,198 ns/iter (+/- 32,872)
test size_14_16k  ... bench:   6,374,230 ns/iter (+/- 12,787)
test size_15_32k  ... bench:  12,480,807 ns/iter (+/- 44,458)
test size_16_64k  ... bench:  24,633,454 ns/iter (+/- 94,243)
test size_17_128k ... bench:  48,691,887 ns/iter (+/- 228,529)
test size_18_256k ... bench:  60,997,701 ns/iter (+/- 1,925,834)
test size_19_512k ... bench:  74,230,547 ns/iter (+/- 4,206,076)
test size_20_1m   ... bench: 163,075,989 ns/iter (+/- 6,646,674)
test size_21_2m   ... bench: 193,855,222 ns/iter (+/- 9,828,518)
test size_22_4m   ... bench: 511,140,039 ns/iter (+/- 10,467,888)
test size_23_8m   ... bench: 942,365,953 ns/iter (+/- 19,146,340)
Just like on the sender side, we can use the scatter-gather
functionality of recvmsg(), to put the data directly into the final
place -- rather than having to copy it around -- even for the initial
fragment.

This removes the last major piece of unnecessary overhead; and
consequently results in very large gains mostly for medium-sized
transfers, where the speedup exceeds 10x on my system. Larger transfers
up to a few MiB are still affected quite significantly, improving by
about 45% at 2 MiB and some 13% at 4 MiB.

All in all, transfers of all sizes are now several times faster than on
the first measured version, before applying optimisations -- with the
lowest speedup on my system at about 4x for small transfers; about 4.5x
for very large ones; and the largest boost of >11x for medium-sized
ones.

A quick check on a more modern system (64 bit; fairly recent Intel CPU)
showed even larger gains: while very big transfers were similar (about
5x speedup), small ones gained >9x, and medium-sized ones >20x.

One interesting observation is that on the modern system, the optimised
version shows even more strongly pronounced jumps at specific sizes.
(Especially 256 KiB and at 1 MiB.) While I haven't verified whether
these are more like spikes or more like steps, I suspect it's the
latter: with more streamlined memory access patterns in the optimised
version, it becomes pretty obvious that these jumps are simply
successive levels of the cache hierarchy being exhausted...

Below are the numbers of a typical run on my old system, as usual.

(Note that the numbers shown for medium-large transfers of 256 KiB and
above are now pretty much entirely useless, as on this system the thread
launching overhead is larger than the actual benchmark time... On the
newer system on the other hand launching the extra thread doesn't seem
to have a strongly pronounced effect: the 256 KiB data point shows an
equally strong slowdown with one iteration as with 100...)

test size_00_1    ... bench:       3,050 ns/iter (+/- 50)
test size_01_2    ... bench:       3,018 ns/iter (+/- 59)
test size_02_4    ... bench:       3,098 ns/iter (+/- 55)
test size_03_8    ... bench:       3,112 ns/iter (+/- 35)
test size_04_16   ... bench:       3,104 ns/iter (+/- 54)
test size_05_32   ... bench:       3,094 ns/iter (+/- 80)
test size_06_64   ... bench:       3,060 ns/iter (+/- 49)
test size_07_128  ... bench:       3,101 ns/iter (+/- 30)
test size_08_256  ... bench:       3,142 ns/iter (+/- 50)
test size_09_512  ... bench:       3,171 ns/iter (+/- 49)
test size_10_1k   ... bench:       3,322 ns/iter (+/- 58)
test size_11_2k   ... bench:       4,755 ns/iter (+/- 69)
test size_12_4k   ... bench:       6,277 ns/iter (+/- 77)
test size_13_8k   ... bench:      10,070 ns/iter (+/- 96)
test size_14_16k  ... bench:       9,735 ns/iter (+/- 99)
test size_15_32k  ... bench:      14,330 ns/iter (+/- 114)
test size_16_64k  ... bench:      22,990 ns/iter (+/- 857)
test size_17_128k ... bench:      44,556 ns/iter (+/- 1,693)
test size_18_256k ... bench:     222,732 ns/iter (+/- 47,581)
test size_19_512k ... bench:     447,259 ns/iter (+/- 178,507)
test size_20_1m   ... bench:   1,369,545 ns/iter (+/- 239,571)
test size_21_2m   ... bench:   1,737,641 ns/iter (+/- 515,468)
test size_22_4m   ... bench:   4,923,732 ns/iter (+/- 1,204,576)
test size_23_8m   ... bench:   9,373,281 ns/iter (+/- 1,282,587)

test size_00_1    ... bench:     284,262 ns/iter (+/- 5,333)
test size_01_2    ... bench:     287,241 ns/iter (+/- 5,105)
test size_02_4    ... bench:     291,753 ns/iter (+/- 3,840)
test size_03_8    ... bench:     294,802 ns/iter (+/- 8,149)
test size_04_16   ... bench:     291,475 ns/iter (+/- 4,166)
test size_05_32   ... bench:     292,328 ns/iter (+/- 4,728)
test size_06_64   ... bench:     289,378 ns/iter (+/- 5,618)
test size_07_128  ... bench:     293,067 ns/iter (+/- 5,312)
test size_08_256  ... bench:     308,579 ns/iter (+/- 4,699)
test size_09_512  ... bench:     301,040 ns/iter (+/- 5,636)
test size_10_1k   ... bench:     312,662 ns/iter (+/- 10,609)
test size_11_2k   ... bench:     446,448 ns/iter (+/- 5,646)
test size_12_4k   ... bench:     607,197 ns/iter (+/- 7,551)
test size_13_8k   ... bench:     997,677 ns/iter (+/- 10,513)
test size_14_16k  ... bench:     956,131 ns/iter (+/- 7,722)
test size_15_32k  ... bench:   1,437,060 ns/iter (+/- 6,730)
test size_16_64k  ... bench:   2,269,055 ns/iter (+/- 23,130)
test size_17_128k ... bench:   4,413,551 ns/iter (+/- 19,655)
test size_18_256k ... bench:   8,628,218 ns/iter (+/- 2,455,944)
test size_19_512k ... bench:  19,452,160 ns/iter (+/- 3,474,999)
test size_20_1m   ... bench: 103,957,847 ns/iter (+/- 8,015,338)
test size_21_2m   ... bench: 134,416,502 ns/iter (+/- 15,466,457)
test size_22_4m   ... bench: 446,838,335 ns/iter (+/- 31,417,044)
test size_23_8m   ... bench: 895,111,420 ns/iter (+/- 16,740,773)
@antrik antrik force-pushed the optimise-buffers branch from 24c9b4f to aea1630 Compare May 4, 2016 19:22
@antrik
Copy link
Contributor Author

antrik commented May 4, 2016

Rebased; fixed typos; replaced macros by helper functions + boilerplate. That should cover everything I hope?...

(Let's see whether CI succeeds on non-Linux, now that the Serde issues are sorted out...)

antrik added 8 commits May 4, 2016 21:34
The buffer size shouldn't change from one channel to another; so instead
of using a syscall to check the size each time, just check it once and
store it using a `lazy_static`.

This also means `get_max_fragment_size()` becomes a static method now,
as it no longer fetches the value for a specific channel, but rather
just refers to the stored value.

Note that we only check the send size now, and use it for the size of
the receive buffer too. This is indeed more correct than the previous
implementation, as the receive buffer needs to hold exactly as much as
we might sent at most. (Normally they are the same anyway; but if for
some reason the maximum receive size happened to be larger, the previous
code would use a larger buffer than necessary. If the receive size
happened to be *smaller*, either version would fail horribly...)

This doesn't have a noticable performance impact on the sender, as the
present implementation only checks the size *after* failing to send the
whole message in one packet, i.e. only for large transfers, where the
cost of the extra system call is insignificant. The receiver side on the
other hand always does the check -- and thus the saved call actually
yields a significant improvement for small messages: on my system, small
transfers (send + receive) gain more than 20% performance. Along with
the other improvements, they are now almost five times faster than the
original implementation.

test size_00_1    ... bench:       2,289 ns/iter (+/- 38)
test size_01_2    ... bench:       2,346 ns/iter (+/- 22)
test size_02_4    ... bench:       2,357 ns/iter (+/- 38)
test size_03_8    ... bench:       2,374 ns/iter (+/- 42)
test size_04_16   ... bench:       2,471 ns/iter (+/- 40)
test size_05_32   ... bench:       2,371 ns/iter (+/- 45)
test size_06_64   ... bench:       2,422 ns/iter (+/- 44)
test size_07_128  ... bench:       2,385 ns/iter (+/- 30)
test size_08_256  ... bench:       2,406 ns/iter (+/- 28)
test size_09_512  ... bench:       2,499 ns/iter (+/- 56)
test size_10_1k   ... bench:       2,727 ns/iter (+/- 88)
test size_11_2k   ... bench:       3,924 ns/iter (+/- 47)
test size_12_4k   ... bench:       5,555 ns/iter (+/- 60)
test size_13_8k   ... bench:       9,455 ns/iter (+/- 107)
test size_14_16k  ... bench:       8,999 ns/iter (+/- 90)
test size_15_32k  ... bench:      13,647 ns/iter (+/- 105)
test size_16_64k  ... bench:      22,213 ns/iter (+/- 489)
test size_17_128k ... bench:      43,666 ns/iter (+/- 17,217)
test size_18_256k ... bench:     221,851 ns/iter (+/- 69,636)
test size_19_512k ... bench:     451,801 ns/iter (+/- 113,742)
test size_20_1m   ... bench:   1,330,491 ns/iter (+/- 182,352)
test size_21_2m   ... bench:   1,790,956 ns/iter (+/- 489,327)
test size_22_4m   ... bench:   4,989,840 ns/iter (+/- 1,188,114)
test size_23_8m   ... bench:   9,349,559 ns/iter (+/- 1,334,978)

test size_00_1    ... bench:     231,706 ns/iter (+/- 4,892)
test size_01_2    ... bench:     235,017 ns/iter (+/- 6,437)
test size_02_4    ... bench:     240,197 ns/iter (+/- 4,068)
test size_03_8    ... bench:     244,404 ns/iter (+/- 6,090)
test size_04_16   ... bench:     239,248 ns/iter (+/- 4,041)
test size_05_32   ... bench:     243,360 ns/iter (+/- 5,237)
test size_06_64   ... bench:     236,956 ns/iter (+/- 5,098)
test size_07_128  ... bench:     243,579 ns/iter (+/- 7,305)
test size_08_256  ... bench:     247,605 ns/iter (+/- 5,047)
test size_09_512  ... bench:     276,882 ns/iter (+/- 6,950)
test size_10_1k   ... bench:     261,665 ns/iter (+/- 4,985)
test size_11_2k   ... bench:     395,244 ns/iter (+/- 5,495)
test size_12_4k   ... bench:     558,647 ns/iter (+/- 8,908)
test size_13_8k   ... bench:     941,395 ns/iter (+/- 7,215)
test size_14_16k  ... bench:     907,290 ns/iter (+/- 9,087)
test size_15_32k  ... bench:   1,360,839 ns/iter (+/- 9,137)
test size_16_64k  ... bench:   2,224,395 ns/iter (+/- 362,003)
test size_17_128k ... bench:   4,351,960 ns/iter (+/- 1,726,184)
test size_18_256k ... bench:   8,627,702 ns/iter (+/- 2,335,525)
test size_19_512k ... bench:  19,018,116 ns/iter (+/- 2,757,467)
test size_20_1m   ... bench: 102,819,410 ns/iter (+/- 7,050,372)
test size_21_2m   ... bench: 133,774,605 ns/iter (+/- 14,188,872)
test size_22_4m   ... bench: 450,259,095 ns/iter (+/- 12,859,207)
test size_23_8m   ... bench: 875,984,486 ns/iter (+/- 21,168,557)
Don't send a control message if we have no actual auxiliary data
(channels / shared memory regions) to transfer.

This shaves off another 6 or 7 per cent from small transfers (without
FDs) on my system.

test size_00_1    ... bench:       2,154 ns/iter (+/- 33)
test size_01_2    ... bench:       2,203 ns/iter (+/- 42)
test size_02_4    ... bench:       2,231 ns/iter (+/- 51)
test size_03_8    ... bench:       2,231 ns/iter (+/- 28)
test size_04_16   ... bench:       2,290 ns/iter (+/- 47)
test size_05_32   ... bench:       2,261 ns/iter (+/- 57)
test size_06_64   ... bench:       2,316 ns/iter (+/- 51)
test size_07_128  ... bench:       2,247 ns/iter (+/- 38)
test size_08_256  ... bench:       2,266 ns/iter (+/- 45)
test size_09_512  ... bench:       2,572 ns/iter (+/- 52)
test size_10_1k   ... bench:       2,488 ns/iter (+/- 45)
test size_11_2k   ... bench:       3,791 ns/iter (+/- 51)
test size_12_4k   ... bench:       5,365 ns/iter (+/- 54)
test size_13_8k   ... bench:       9,235 ns/iter (+/- 84)
test size_14_16k  ... bench:       8,833 ns/iter (+/- 102)
test size_15_32k  ... bench:      13,370 ns/iter (+/- 117)
test size_16_64k  ... bench:      22,004 ns/iter (+/- 717)
test size_17_128k ... bench:      43,369 ns/iter (+/- 976)
test size_18_256k ... bench:     224,096 ns/iter (+/- 75,219)
test size_19_512k ... bench:     458,353 ns/iter (+/- 149,531)
test size_20_1m   ... bench:   1,357,956 ns/iter (+/- 187,198)
test size_21_2m   ... bench:   1,781,991 ns/iter (+/- 512,027)
test size_22_4m   ... bench:   4,940,065 ns/iter (+/- 1,099,861)
test size_23_8m   ... bench:   9,345,216 ns/iter (+/- 1,557,181)

test size_00_1    ... bench:     222,064 ns/iter (+/- 8,292)
test size_01_2    ... bench:     224,589 ns/iter (+/- 4,033)
test size_02_4    ... bench:     226,667 ns/iter (+/- 4,774)
test size_03_8    ... bench:     229,002 ns/iter (+/- 5,107)
test size_04_16   ... bench:     224,895 ns/iter (+/- 3,323)
test size_05_32   ... bench:     230,973 ns/iter (+/- 3,265)
test size_06_64   ... bench:     224,377 ns/iter (+/- 5,778)
test size_07_128  ... bench:     229,364 ns/iter (+/- 8,282)
test size_08_256  ... bench:     235,654 ns/iter (+/- 3,860)
test size_09_512  ... bench:     235,874 ns/iter (+/- 6,021)
test size_10_1k   ... bench:     246,200 ns/iter (+/- 2,626)
test size_11_2k   ... bench:     386,233 ns/iter (+/- 5,313)
test size_12_4k   ... bench:     542,364 ns/iter (+/- 8,671)
test size_13_8k   ... bench:     929,892 ns/iter (+/- 12,989)
test size_14_16k  ... bench:     868,966 ns/iter (+/- 7,768)
test size_15_32k  ... bench:   1,342,400 ns/iter (+/- 6,781)
test size_16_64k  ... bench:   2,202,955 ns/iter (+/- 86,042)
test size_17_128k ... bench:   4,323,643 ns/iter (+/- 41,162)
test size_18_256k ... bench:   8,708,255 ns/iter (+/- 2,162,868)
test size_19_512k ... bench:  19,281,153 ns/iter (+/- 5,449,494)
test size_20_1m   ... bench: 102,875,141 ns/iter (+/- 10,584,288)
test size_21_2m   ... bench: 134,820,099 ns/iter (+/- 13,015,545)
test size_22_4m   ... bench: 446,291,502 ns/iter (+/- 12,073,997)
test size_23_8m   ... bench: 876,656,087 ns/iter (+/- 16,439,303)
On a 32 bit system, the size header is only 4 bytes; so if we fully use
the rest of the available buffer for payload data, the size of the
latter won't be a multiple of 8 bytes -- and consequently, every second
fragment is read from the source data buffer with poor alignment.

Fixing this by always aligning the payload data size sent per fragment
to 8 byte boundaries.

(On 64 bit systems, this is a no-op. According to my testing, aligning
to more than 8 byte boundaries doesn't benefit either 32 or 64 bit
systems.)

On my 32 bit x86 system, this produces quite a sizeable performance
improvement for large transfers: peaking at 20% or so around 640 KiB,
and staying above 10% for most of the range from 320 KiB to 2 MiB.

test size_00_1    ... bench:       2,250 ns/iter (+/- 42)
test size_01_2    ... bench:       2,279 ns/iter (+/- 46)
test size_02_4    ... bench:       2,335 ns/iter (+/- 53)
test size_03_8    ... bench:       2,351 ns/iter (+/- 67)
test size_04_16   ... bench:       2,357 ns/iter (+/- 46)
test size_05_32   ... bench:       2,327 ns/iter (+/- 64)
test size_06_64   ... bench:       2,308 ns/iter (+/- 39)
test size_07_128  ... bench:       2,277 ns/iter (+/- 43)
test size_08_256  ... bench:       2,453 ns/iter (+/- 35)
test size_09_512  ... bench:       2,402 ns/iter (+/- 57)
test size_10_1k   ... bench:       2,576 ns/iter (+/- 81)
test size_11_2k   ... bench:       3,883 ns/iter (+/- 66)
test size_12_4k   ... bench:       5,492 ns/iter (+/- 75)
test size_13_8k   ... bench:       9,368 ns/iter (+/- 83)
test size_14_16k  ... bench:       8,857 ns/iter (+/- 107)
test size_15_32k  ... bench:      14,132 ns/iter (+/- 140)
test size_16_64k  ... bench:      22,049 ns/iter (+/- 298)
test size_17_128k ... bench:      43,577 ns/iter (+/- 1,947)
test size_18_256k ... bench:     203,597 ns/iter (+/- 34,495)
test size_19_512k ... bench:     407,050 ns/iter (+/- 246,606)
test size_20_1m   ... bench:   1,334,506 ns/iter (+/- 186,923)
test size_21_2m   ... bench:   1,630,275 ns/iter (+/- 481,481)
test size_22_4m   ... bench:   4,826,184 ns/iter (+/- 980,708)
test size_23_8m   ... bench:   9,390,020 ns/iter (+/- 1,655,050)

test size_00_1    ... bench:     223,092 ns/iter (+/- 3,807)
test size_01_2    ... bench:     223,918 ns/iter (+/- 3,535)
test size_02_4    ... bench:     223,102 ns/iter (+/- 4,907)
test size_03_8    ... bench:     230,394 ns/iter (+/- 4,700)
test size_04_16   ... bench:     224,395 ns/iter (+/- 4,482)
test size_05_32   ... bench:     231,436 ns/iter (+/- 4,214)
test size_06_64   ... bench:     225,216 ns/iter (+/- 3,584)
test size_07_128  ... bench:     228,905 ns/iter (+/- 4,260)
test size_08_256  ... bench:     233,108 ns/iter (+/- 2,998)
test size_09_512  ... bench:     236,013 ns/iter (+/- 4,803)
test size_10_1k   ... bench:     248,637 ns/iter (+/- 5,386)
test size_11_2k   ... bench:     383,750 ns/iter (+/- 6,134)
test size_12_4k   ... bench:     542,666 ns/iter (+/- 9,026)
test size_13_8k   ... bench:     934,623 ns/iter (+/- 9,200)
test size_14_16k  ... bench:     894,028 ns/iter (+/- 7,425)
test size_15_32k  ... bench:   1,350,185 ns/iter (+/- 6,316)
test size_16_64k  ... bench:   2,213,693 ns/iter (+/- 24,549)
test size_17_128k ... bench:   4,348,762 ns/iter (+/- 90,784)
test size_18_256k ... bench:   7,825,637 ns/iter (+/- 2,243,873)
test size_19_512k ... bench:  17,792,873 ns/iter (+/- 3,775,904)
test size_20_1m   ... bench:  99,769,568 ns/iter (+/- 8,760,588)
test size_21_2m   ... bench: 126,575,377 ns/iter (+/- 12,528,236)
test size_22_4m   ... bench: 440,368,689 ns/iter (+/- 16,627,208)
test size_23_8m   ... bench: 859,509,896 ns/iter (+/- 21,848,024)
Using `byteorder` has neven been actually *necessary* (the serialised
data doesn't ever cross machine boundaries) -- it was only being
(ab-)used as a convenient way to write/read the header information
to/from the shared buffer. Now that the header gets separate
send/receive buffers, this isn't actually a simplification anymore -- on
the contrary: just using the header data's backing storage as the
send/receive buffer directly is indeed simpler now. (And not
significantly more unsafe either.)

The simpler code also improves performance of small transfers by another
two or three per cent.

test size_00_1    ... bench:       2,154 ns/iter (+/- 83)
test size_01_2    ... bench:       2,154 ns/iter (+/- 75)
test size_02_4    ... bench:       2,234 ns/iter (+/- 34)
test size_03_8    ... bench:       2,212 ns/iter (+/- 21)
test size_04_16   ... bench:       2,298 ns/iter (+/- 49)
test size_05_32   ... bench:       2,211 ns/iter (+/- 32)
test size_06_64   ... bench:       2,225 ns/iter (+/- 76)
test size_07_128  ... bench:       2,202 ns/iter (+/- 40)
test size_08_256  ... bench:       2,239 ns/iter (+/- 48)
test size_09_512  ... bench:       2,316 ns/iter (+/- 30)
test size_10_1k   ... bench:       2,446 ns/iter (+/- 29)
test size_11_2k   ... bench:       3,797 ns/iter (+/- 67)
test size_12_4k   ... bench:       5,381 ns/iter (+/- 62)
test size_13_8k   ... bench:       9,225 ns/iter (+/- 96)
test size_14_16k  ... bench:       8,739 ns/iter (+/- 60)
test size_15_32k  ... bench:      13,243 ns/iter (+/- 76)
test size_16_64k  ... bench:      21,879 ns/iter (+/- 141)
test size_17_128k ... bench:      43,193 ns/iter (+/- 398)
test size_18_256k ... bench:     205,695 ns/iter (+/- 52,004)
test size_19_512k ... bench:     409,146 ns/iter (+/- 68,651)
test size_20_1m   ... bench:   1,341,949 ns/iter (+/- 240,066)
test size_21_2m   ... bench:   1,662,774 ns/iter (+/- 527,172)
test size_22_4m   ... bench:   4,885,677 ns/iter (+/- 1,113,293)
test size_23_8m   ... bench:   9,300,784 ns/iter (+/- 1,806,111)

test size_00_1    ... bench:     211,985 ns/iter (+/- 3,389)
test size_01_2    ... bench:     210,848 ns/iter (+/- 5,040)
test size_02_4    ... bench:     218,757 ns/iter (+/- 3,885)
test size_03_8    ... bench:     218,317 ns/iter (+/- 5,671)
test size_04_16   ... bench:     219,027 ns/iter (+/- 5,073)
test size_05_32   ... bench:     218,795 ns/iter (+/- 4,695)
test size_06_64   ... bench:     218,217 ns/iter (+/- 3,875)
test size_07_128  ... bench:     224,165 ns/iter (+/- 4,367)
test size_08_256  ... bench:     227,112 ns/iter (+/- 3,922)
test size_09_512  ... bench:     225,733 ns/iter (+/- 3,931)
test size_10_1k   ... bench:     239,269 ns/iter (+/- 4,323)
test size_11_2k   ... bench:     371,675 ns/iter (+/- 6,760)
test size_12_4k   ... bench:     529,841 ns/iter (+/- 7,052)
test size_13_8k   ... bench:     910,285 ns/iter (+/- 7,308)
test size_14_16k  ... bench:     860,518 ns/iter (+/- 7,659)
test size_15_32k  ... bench:   1,331,114 ns/iter (+/- 5,774)
test size_16_64k  ... bench:   2,193,192 ns/iter (+/- 22,878)
test size_17_128k ... bench:   4,324,455 ns/iter (+/- 86,997)
test size_18_256k ... bench:   7,973,472 ns/iter (+/- 1,688,190)
test size_19_512k ... bench:  17,325,137 ns/iter (+/- 6,222,526)
test size_20_1m   ... bench: 100,037,281 ns/iter (+/- 8,976,164)
test size_21_2m   ... bench: 127,489,104 ns/iter (+/- 14,776,106)
test size_22_4m   ... bench: 438,418,131 ns/iter (+/- 13,543,391)
test size_23_8m   ... bench: 858,248,355 ns/iter (+/- 14,316,633)
Followup fragments don't have a header; so they can use a few bytes more
for payload.

While this is not likely ever to make a noticable performance
difference, having exact calculations in each case seems cleaner,
hopefully avoiding potential confusion...
Only try sending the entire message in one packet if it will actually
fit.

This saves a syscall and some processing, but only for large
(fragmented) messages -- so it doesn't have a noticable performance
impact. However, it should make behaviour clearer and more predictable;
and it is also required in order to enable further cleanups.
Now that we fully control the size of the first packet even in the
non-fragmented case, we can rely on this size on the receiver side as
well, rather than having to allocate a larger buffer just in case.
This requires some shuffling around of declarations, to faciliate
untangling the actually unsafe operations from those not affecting
safety. (As far as reasonably possible...)

Also added a new assertion to make sure that the trimmed `unsafe` blocks
really do not rely on any conditions being upheld outside.
@antrik antrik force-pushed the optimise-buffers branch from aea1630 to 7c2466e Compare May 4, 2016 19:34
@antrik
Copy link
Contributor Author

antrik commented May 4, 2016

OK, let's try again...

@pcwalton
Copy link
Contributor

pcwalton commented May 4, 2016

@bors-servo: r+

@bors-servo
Copy link
Contributor

📌 Commit 7c2466e has been approved by pcwalton

@bors-servo
Copy link
Contributor

⌛ Testing commit 7c2466e with merge ca96865...

bors-servo pushed a commit that referenced this pull request May 4, 2016
Optimisations (and cleanups) of Linux backend

This is a bunch of optimisations to the Linux platform code, along with various cleanups of the related code which the optimisation patches build upon.

Most of the cleanups are on the `send()` side, as the `recv()` side is less affected by the optimisation changes, and thus there has been less reason to refactor the code -- some extra cleanup work would probably be in order here.

The optimisations are mostly about avoiding unnecessary copies by using scatter-gather buffers for send and receive; as well as avoiding unnecessary initialisation of receive buffers.

The results are impressive: with gains of at least 5x for large transfers of several MiB (a bit more on a modern system); >5x (on an old system) up to >10x (on a modern one) for small transfers of up to a few KiB; and more than 10x for most of the range in between -- peaking at about 12x - 13x on the old system and 20x - 21x on the modern system for medium-sized transfers of about 64 KiB up to a few hundred KiB.

For another interesting data point, the CPU usage during benchmark runs (with many iterations, to amortise the setup time) was dominated by user time (more than two thirds of total time) with the original variant; whereas the optimised variant not only further reduces system time to less then half the original value (presumably because of fewer allocations?), but also almost entirely eliminates the user time, making it pretty insignificant in the total picture now -- as it should be.

On a less scientific note, Servo built with the optimised ipc-channel doesn't seem to show undue delays any more while rendering a language selection menu. (Which requires lots of fonts to be loaded, and thus triggers heavy ipc-channel activity.)
@bors-servo
Copy link
Contributor

☀️ Test successful - travis

@bors-servo bors-servo merged commit 7c2466e into servo:master May 4, 2016
@bors-servo bors-servo mentioned this pull request May 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants