Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add netstacklat tool #125

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

simosund
Copy link
Contributor

Initial implementation of the netstacklat tool, which is based on Jesper Brouers's bpftrace script for measuring latency up to certain certain points in the Linux network stack.

The eBPF portion of the program is designed to be compatible with ebpf_exporter. In theory one only has to drop in the netstack.bpf.c file into the ebpf_exporter/examples directory and add the corresponding yaml config file (which I will probably add to this repository as well later), although have not yet verified that it actually works as intended there.

In addition to finishing up existing commits, some points I wish to address before attempting to merge this request is:

  • Figure out how to use vmlinux.h properly in this setup (or if it's better to get definition of sk_buffer some other way)
  • Add option to the userspace program to enable software timestamping on select interface(s) to make it a bit easier to use? Currently you're unlikely to actually get any useful data out of this unless you enable software timestamping on relevant interfaces externally (as skb->tstamp will just be 0).
  • Verify that it works with ebpf_exporter and add yaml config file.

@tohojo tohojo requested a review from netoptimizer January 13, 2025 12:23
@simosund simosund force-pushed the netstacklat branch 2 times, most recently from 36dd2e9 to 05b8542 Compare January 14, 2025 10:02
@simosund
Copy link
Contributor Author

So, I now belive I have a full "minimal working example" that quite closely replicates the bpftrace script and prints out its output in a similar format. There are still many missing features that would be nice to have, such as:

  • Parse some command-line arguments, letting the user configure e.g. the reporting interval (current hard coded to 10s)
  • Automatically enable software RX timestamping (on all interfaces or interfaces provided as command line arguments)
  • Automatically figure out the TAI-offset at start up (currently hard coded as 37 seconds)
  • Figure out clock basis for skb->tstamp (currently assumes CLOCK_REALTIME, but on at least one machine that did not seem to be the case), and possibly if it's valid (right now it assumes any non-zero timestamps are valid)

But figure that this should be sufficent for an early review at least.

One issue worth highlighting is that it's currently not fully compatible with ebpf-exporter in the sense that you can't just drop in the netstacklat.bpf.c file. First of there are some shared defines and enums in netstacklat.h that netstacklat.bpf.c depends on, so unless you also copy over that header (which seems to go against the style of the existing examples at ebpf-exporter/examples) you need to first insert those into the netstacklat.bpf.c file. Furthermore, netstacklat.bpf.c does #include "vmlinux_local.h" as expected for this bpf-examples repository, but the ebpf-exporter examples seem to expect #include <vmlinux.h> instead. Not sure if there's any clever pre-processor magic or other tricks you could use to solve this.

For licensing, I've put my own files under GPL-2.0 as most other examples here. However, I've also included the same version of bits.bpf.h header from bcc/libbpf-tools (same version as ebpf-exporter uses), which then retains its dual LGPL/BSD license, and maps.bpf.h from ebpf-exporter that I put under MIT as that seems to be what ebpf-exporter in general uses.

As I mentioned during the meeting, I intentionally opted to use separate BPF maps for each histogram rather than multiplex multiple histograms in the same map as most of the ebpf-exporter examples seem to do, e.g. here. By doing so I can simply use array maps instead of hashmaps and thereby reduce the overhead for map lookups a bit.

@simosund simosund marked this pull request as ready for review January 14, 2025 10:46
@simosund
Copy link
Contributor Author

Oh, and ping @netoptimizer if Github hasn't already done so (not intending to rush you, just want to make sure you're aware this PR is ready to be reviewed).

Add the tool netstacklat, which measures latency up to different parts
in the Linux ingress network stack. Base the initial implementation on
a bpftrace script from Jesper Dangaard Brouer, which requiers the
kernel to timestamp ingress packets (i.e. set the tstamp member of the
skb). Hence, the latency recorded by the tool is the difference
between the kernel timestamping point and various later points in the
network stack.

In this initial commit, include the eBPF programs for recording the
networks stack latency at the start of the kernel functions
tcp_v4_do_rcv(), tcp_data_queue(), and udp_queue_rcv_one_skb(). Use a
structure making it easy to extend the tool with additional hook
points in the future. Make the eBPF programs compatible with
Cloudflare's ebpf_exporter, and use the map helpers (maps.bpf.h) from
ebpf_exporter to ensure maps are used in a compatible way.

Open code the histogram maps for different hook points as entirely
separate maps, instead of encoding the different hook points in the
key of a separate map as ebpf_exporter often does. This avoids costly
hashmap lookups, as simple array maps can be used instead of hash
maps.

Also include a minimal user space loader, which loads and attaches the
eBPF programs. Later commits will extend this loader to also report
the recorded latencies stored in the BPF maps.

Signed-off-by: Simon Sundberg <[email protected]>
Add functionality to the user space component to periodically fetch
the BPF maps netstacklat records the values in and print them out.

Base the core program loop on the same setup as pping, where a single
epoll instance is used to support multiple different types of
events. So far it only deals with signal handling (for clean shutdown)
plus a timer (for periodical reporting), but the setup can easily be
extended if the program grows more complex in the future.

Use the (somewhat complicated) bpf_map_lookup_batch to fetch the
entire histogram maps in a single system call (instead of performing a
lookup for each bin index individually).

Signed-off-by: Simon Sundberg <[email protected]>
Add a README briefly explaining what this example does.

Signed-off-by: Simon Sundberg <[email protected]>
Add support for parsing arguments from the command line. Add the
--help/-h and --report-interval/-r <n-seconds> options. As part of
this, add some minor documentation about the program and the available
options that are displayed by the --help option.

Base the print_usage() function on the usage() function from the
traffic-pacing-edt example, but change its output format to be more
consistent with most GNU utilities, listing the short option (if any)
before the long option. Furthermore, also show if an argument is
expected based on the has_arg value.

Automatically generate the optstring that getopt_long() expects from
the options array. While needlessly complicated for the two options
currently supported, it simplifies adding additional options in the
future.

To allow for mapping the opt value returned by getopt_long() back the
corresponding option struct (for example to access the full name of
the option), add the optval_to_longopt() helper that linearly searches
through the options array. This is more reliable than using the
longindex pointer from getopt_long(), as that pointer is only set when
a long option is passed, and not when the corresponding short option
is passed.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund
Copy link
Contributor Author

While I've done a couple of force pushes, there's been no major change to the initial 3 commits (which was what I originally requested the review for). I've just done some minor cleanup there (renamed an enum value from NETSTACK_HOOK_TCP_V4_DO_RCV to NETSTACKLAT_HOOK_TCP_V4_DO_RCV to be consistent with the others, and prefixed the commit messages with netstacklat:).

I have now also added 3 additional commits implementing most of the features I mentioned were missing in my previous comment. They are all relatively self-contained I can move them out into a separate PR(s) and/or reorder them if requested.
There are however some aspects of those later PRs that I would also like to get some feedback on, so I've included them here for the moment (as it seems weird starting up a PR for additions to another PR that hasn't been merged yet).

For the commit about enabling the RX software timestamps, it relies on the user space component to do that, so it won't really work when just using the netstacklat.bpf.c file together with ebpf-exporter. While @tohojo pointed out that it seems like they're perhaps working on some way of allowing you to enable SO_TIMESTAMPING from eBPF programs (a bit unclear if that's what it actually does though, and only seems to be for the TX path for now), I don't think we can rely on that as it will be quite some time before that makes it into most shipped kernels.

An alternative approach would be to add an eBPF program at some early part of the network stack that checks if the skb->tstamp field is set, and if not sets it itself. Although I guess not all hooks are necessarily allowed to write to that field, and I'm not sure if there may be any consequences higher up the stack for setting the tstamp field when its not expected to be set.

For the commit about initializing the TAI-offset, it has a similar limitation in that it relies on the user space program, so won't work for ebpf-exporter. As an alternative, it may be possible to get the TAI offset from an eBPF program instead. E.g. the following bpftrace script can get it:

sudo bpftrace -e '
// On kernels >= 6.13, remove this struct definition (no longer an anonymous struct)
struct tk_data {
	seqcount_raw_spinlock_t	seq;
	struct timekeeper	timekeeper;
};

BEGIN
{
	printf("Getting clock offsets from core...\n");

	$tk = (struct tk_data *)kaddr("tk_core");
	printf("MONOTONIC -> REALTIME (offs_real): %ld\n",
	       $tk->timekeeper.offs_real);
	printf("MONOTONIC -> TAI (offs_tai): %ld\n",
	       $tk->timekeeper.offs_tai);
	printf("REALTIME -> TAI (tai_offset): %ld\n",
	       $tk->timekeeper.tai_offset);

	exit();
}'

However, I'm not really sure how to translate that to C/libbpf in a good way. First it needs to be attached to some suitable hook that ideally only runs once before all the other hooks trigger so that we don't have to probe that part of kernel memory every time we need to use the TAI offset. And then I'm not really sure if there's really anything corresponding the the bpftrace kaddr() function you can use in C/libbpf. Through a brief look it seems to lookup the /proc/kallsyms "file" and get the kernel address from there.

Furthermore, the TAI offset is currently only set at startup, so if the program is running when a leap second occurs, the TAI offset will be out of sync, and we will get massively inflated latencies. We could periodically update it, although that still runs the risk of it being out of sync in-between updates. We could also try to hook into whatever part of the kernel that actually updates that offset, so that we change it as soon as the kernel updates it. But in order to update it, it can obviously no longer be a constant, so it will potentially add a small amount of overhead when using the TAI offset.

The netstacklat tool depends on the tstamp member in the skbs to be
set by the kernel. So make the user space loader enable software RX
timestamping on all packets when the tool is started.

Note that, as the documentation at
https://docs.kernel.org/networking/timestamping.html points out,
enabling software receive timestamping on a socket enables it for all
packets, as the destination socket is not known early in the network
stack. It is therefore sufficient to enable it on a socket created by
the program itself, even if no packets are received on that socket.

Signed-off-by: Simon Sundberg <[email protected]>
Make the user space loader calculate the TAI-offset at startup and set
it as a constant for the eBPF programs. Split the open and loading
stages of the eBPF programs apart to enable setting constants in the
eBPF programs.

Note that on some systems (e.g. most debian systems by default), the
TAI offset may (incorrectly) be 0, so that CLOCK_TAI becomes identical
to CLOCK_REALTIME. While this is principly incorrect, it does not pose
an issue for netstacklat, as it only needs the TAI offset to translate
CLOCK_TAI to CLOCK_REALTIME (which skb->tstamp is assumed to use as
clock basis). Therefore, netstacklat will (from this commit) work
correctly even if the TAI offset is not correctly set on the system.

Limitation: The TAI offset is only set once the program is first
loaded, and is not dynamically updated in case the TAI offset
changes. So if the program is running while a leap second occurs, the
recorded latencies may be off with one second. Furthermore, as the TAI
offset is set from user space, it will not work when just using the
eBPF portion together with ebpf-exporter.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund
Copy link
Contributor Author

When it comes to output format, I have for now approximately replicated the output format from @netoptimizer's bpftrace script. So it may for example look like this:

Mon Jan 27 20:02:01 2025
tcp_v4_do_rcv:
[1.02us, 2.05us)       1054 |                                                |
[2.05us, 4.1us)      181795 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us)      181529 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us)      68458 |@@@@@@@@@@@@@@@@@@                              |
[16.4us, 32.8us)       1187 |                                                |
[32.8us, 65.5us)        990 |                                                |
[65.5us, 131us)         453 |                                                |
[131us, 262us)            8 |                                                |
Sum: 2373204797

tcp_data_queue:
[2.05us, 4.1us)      169968 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |
[4.1us, 8.19us)      179638 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us)      51028 |@@@@@@@@@@@@@@                                  |
[16.4us, 32.8us)       1176 |                                                |
[32.8us, 65.5us)        900 |                                                |
[65.5us, 131us)         433 |                                                |
[131us, 262us)            8 |                                                |
Sum: 2320708288

udp_queue_rcv_one_skb:
[1.02us, 2.05us)          8 |@@@@@@@@@@@@@@@@@                               |
[2.05us, 4.1us)          23 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us)          12 |@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[8.19us, 16.4us)         14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[16.4us, 32.8us)          7 |@@@@@@@@@@@@@@@                                 |
[32.8us, 65.5us)          2 |@@@@                                            |
Sum: 571657

(For reference, the corresponding output from the bpftrace script is:

20:02:01
@cnt[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@cnt[kfunc:vmlinux:tcp_data_queue]: 404016
@cnt[kfunc:vmlinux:tcp_v4_do_rcv]: 436508
@total[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@total[kfunc:vmlinux:tcp_data_queue]: 404022
@total[kfunc:vmlinux:tcp_v4_do_rcv]: 436514
@stats[kfunc:vmlinux:tcp_v4_do_rcv]: count 436515, average 5148, total 2247484837
@stats[kfunc:vmlinux:tcp_data_queue]: count 404024, average 5489, total 2217754970
@stats[kfunc:vmlinux:udp_queue_rcv_one_skb]: count 66, average 6507, total 429464

@hist_ns[kfunc:vmlinux:udp_queue_rcv_one_skb]: 
[1K, 2K)              10 |@@@@@@@@@@@@@@@@@@@@                                |
[2K, 4K)              26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)              11 |@@@@@@@@@@@@@@@@@@@@@@                              |
[8K, 16K)             13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[16K, 32K)             5 |@@@@@@@@@@                                          |
[32K, 64K)             1 |@@                                                  |

@hist_ns[kfunc:vmlinux:tcp_data_queue]: 
[1K, 2K)              25 |                                                    |
[2K, 4K)          175047 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4K, 8K)          177977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)          48539 |@@@@@@@@@@@@@@                                      |
[16K, 32K)          1135 |                                                    |
[32K, 64K)           880 |                                                    |
[64K, 128K)          420 |                                                    |
[128K, 256K)           7 |                                                    |

@hist_ns[kfunc:vmlinux:tcp_v4_do_rcv]: 
[256, 512)             1 |                                                    |
[512, 1K)              4 |                                                    |
[1K, 2K)           28061 |@@@@@@@                                             |
[2K, 4K)          158756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[4K, 8K)          185669 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)          61565 |@@@@@@@@@@@@@@@@@                                   |
[16K, 32K)          1072 |                                                    |
[32K, 64K)           946 |                                                    |
[64K, 128K)          442 |                                                    |
[128K, 256K)           8 |                                                    |

)

So it "visualizes" the data in a format that makes it fairly easy to glance at, but not really ideal if you want to parse it and feed it into some other system. But if you want to feed the data into some other system it's perhaps better to use it together with ebpf-exporter in the first place.

@simosund
Copy link
Contributor Author

simosund commented Jan 27, 2025

I have also verified that it "works" together with ebpf-exporter (with some minor changes to the includes). However, as this issue for ebpf-exporter points out, it seems like ebpf-exporter sort of has an of-by-one error in their log2 histogram implementation, so when using with Prometheus you will generally get around half of the actual latency (and this issue affects all of the log2 histogram examples currently included in ebpf-exporter).

Note that the output from netstacklat when used here in bpf-examples is still correct as I print the correct bin labels for the histogram (acknowledging that the way the bin-index is calculated the bins are left-inclusive). But when used with ebpf-exporter the bins will be incorrectly labeled (to pretend to match Prometheus right-inclusive bin format), and therefore give misleading results.

To clarify the issue a bit, I provide the following table:

Bin index Bin content Effective label (Actual Prometheus label)
0 [0, 2) [0, 1] le1
1 [2, 4) (1, 2] le2
2 [4, 8) (2, 4] le4
3 [8, 16) (4, 8] le8
4 [16, 32) (8, 16] le16
... ... ... ...

Where the index is the index in the BPF array map for the bin (and the index that will essentially be calculated by the histogram helpers in maps.bpf.h, content is the range of values that will be mapped to that bin (and the label I provide in my own output), effective label is basically what ebpf-exporter will say is stored in that bucket (with the corresponding Prometheus label, where le = less or equal).

I'm not really sure what the best way to address this problem is. I could use the updated bin calculation from the PR that intends to fix the issue in ebpf-exporter. However, until that fix is merged in ebpf-exporter itself that won't really change anything (as the function for calculating the bin-index will be based on maps.bpf.h in the current repo).

Add a YAML configuration that ebpf-exporter can use to convert the BPF
maps to Prometheus metrics. Convert the histogram map for each hook
into the corresponding Prometheus histogram.

Signed-off-by: Simon Sundberg <[email protected]>
Add a BPF_EXAMPLES macro flag that let's netstack.bpf.c know if it's
being compiled in the bpf-examples repository or ebpf-exporter, and
allows it to adjust the included headers accordingly. This allows
copying the netstack.bpf.c, netstack.h and netstack.yaml files
directly into ebpf_exporter/examples without any modifications.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund
Copy link
Contributor Author

I've now added the ebpf-exporter YAML config file.

Furthermore, in the "Make build process compatible with ebpf-exporter" commit I have provided a suggestion for how to make netstacklat.bpf.c compile correctly both here in the bpf-examples repository as well as in when copied into the ebpf_exporter/examples directory. You will still have to also copy over the netstacklat.h file, which breaks the style compared to the other exampels in ebpf_exporter/examples a bit, but works without having to actually modify any file.

Copy link
Member

@tohojo tohojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally this looks sane, with the main exception being the macros from bpf_exporter. Those we should definitely not import!

@@ -0,0 +1,84 @@
/* SPDX-License-Identifier: MIT */
/* From https://github.com/cloudflare/ebpf_exporter/blob/99d2752f9e0a095b57f53e5da6856ad143b0e443/examples/maps.bpf.h */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so a bunch of macros that don't properly delimit their contents, even going so far as having later macros directly refer to the expanded values of earlier ones? Yikes, no thanks! Please get rid of these and reimplement them using proper inline functions, or at least macros that are sane :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well for just using it here in bpf-examples it would be easy to implement these as functions (not sure if I even need to specify them as inlined, generally leave that decision to the compiler).

However, if the goal is to be compatible with ebpf-exporter, maybe it makes more sense to try and fix this upstream (and then import the fixed version of these)? For that it will probably still have to be macros as the type of the key is unknown, it's just assumed to be a struct that has the member bucket (which they simply access with .bucket). Not very experienced in writing anything but the most trivial macros, so might have to bounce this back and forth a few times before I get this right if we go down this path.

Could of coure still be handled with functions that accept a void pointer to the key, as long as we decide that the bucket member has to be the first member. However, ebpf-exporter currently seem to assume that the bucket is the the last member of the key (ran into some issues when I tried placing something after the bucket member), so that would likely require changes to some interal aspects of ebpf-exporter as well as updating all of their examples.

@@ -0,0 +1,100 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
#ifdef BPF_EXAMPLES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably going a bit too far in the compatibility. Let's just drop this commit for now, and circle back to it once we have an idea for how things will look wrt bpf-exporter.

@simosund
Copy link
Contributor Author

simosund commented Feb 5, 2025

I've now done a few sanity tests to validate that it seems to able to capture latency increases in the network stack. Simply used 10 parallel iperf flows to generate traffic and did a test without iptables rules, and another test after adding ~25k pointless iptables rules.

These first graphs are without the iptables rules.
image

These graphs are with the 25k iptables rules.
image

Note that the median and 95th percentile values should likely be around twice as high due to the issue with how ebpf-exporter bins the values that I previously mentioned.

While these graphs may not be the easiest to read, the latency one can note that the latency increases quite a bit after adding the 25k iptables rules, with e.g. the median roughly doubling and the mean for tcp_v4_do_rcv() increasing from around 12us to 200us (the mean for tcp_data_queue() actually drops from around 12us to 6us though). The heatmap also shows that the distribution for tcp_v4_do_rcv() shifts from a rather unimodal distribution to a much more multimodal one. Interestingly enough, the relationship between how frequently tcp_v4_do_rcv() and tcp_data_queue() also shifts, where tcp_v4_do_rcv() only runs around 15% more frequently than tcp_data_queue() without all the iptables rules, whereas it runs about twice as frequently with the 25k rules (where the rate for tcp_data_queue() seems to match the actual packet rate).

These tests, however, also somewhat highligted that the log2 histograms are indeed very coarse grained, as a latency increase will generally only be noticible if it roughly doubles the latency (as that's what's required to end up in another histogram bin). And in this case the impact of adding those 25k iptables rules could for example much easier be noticed by the packet rate dropping from around 260 kpps to 24 kpps. Instead of log2 histograms we could of course use something more fine-grained, but already at our current 30 buckets we already exceed what's typically used with Prometheus histograms.

In the last year or so Prometheus has added support for "native histograms" that can handle a large amount of bins much more efficently. However, that is still an experimental feature and not supported by ebpf-exporter.

@simosund
Copy link
Contributor Author

simosund commented Feb 5, 2025

Have also looked a little bit at the performance. Based on the bpf stats (sysctl -w kernel.bpf_stats_enabled=1), the overhead from the eBPF probes for the current implementation looks as follows:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2093 , netstacklat_tcp_v4_do_rcv ,     93309605,           176.03,      310702.88,         5.469346%
2095 , netstacklat_tcp_data_queue,     82011162,           134.08,      273081.26,         3.661600%
2096 , netstacklat_udp_queue_rcv ,          255,           762.80,           0.85,         0.000065%
0    , Total                     ,    175321022,           156.41,      583784.99,         9.131011%

Where count is the number of times the hook ran, avg runtime is the processing time per hook activation, frequency is how frequently the hook was triggered, and estimated CPU utilization is the single core CPU utilization for the program given at that runtime and frequency (i.e. runtime * frequency).

So the tcp_v4_do_rcv() hook takes up around 175ns per activation, while the tcp_data_queue() hook takes up around 135ns per activation while under load (the 762ns for udp_queue_rcv() is a bit inflated as it's not really under any load, if I redid the tests with a UDP traffic load I expect it would drop to a similar level).

I also tested with a version where I multiplex all the histograms in a single hashmap (how ebpf-exporter typically does it, and produces a more easily handled Prometheus metric), which gave the following results:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2057 , netstacklat_tcp_v4_do_rcv ,     94031163,           266.09,      313093.05,         8.331175%
2059 , netstacklat_tcp_data_queue,     82230810,           192.18,      273801.73,         5.261953%
2060 , netstacklat_udp_queue_rcv ,          131,           832.94,           0.44,         0.000036%
0    , Total                     ,    176262104,           231.61,      586895.21,        13.593164%

So about 60ns (40%) more for tcp_data_queue() and 90ns (50%) more for tcp_v4_do_rcv().

I also tested a dummy version that just picked a fixed value for the delay instead of getting it from reading the skb timestamp and comparing to the current time to get a sort of lower floor for what the maximum performance could reasonably be (only overhead for adding the value to the histogram, using the separate array maps), which gave the following:

id   , name                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
276  , netstacklat_tcp_v4_do_rcv ,     89714195,            70.09,      297714.38,         2.086633%
278  , netstacklat_tcp_data_queue,     76804520,            55.17,      254873.94,         1.406174%
279  , netstacklat_udp_queue_rcv ,          379,           373.21,           1.26,         0.000047%
0    , Total                     ,    166519094,            63.21,      552589.58,         3.492854%

So just parsing the the skb timestamp and getting the current time seems to make up the majority of the overhead in the current implementation.

One interesting observation here is that the overhead from the tcp_v4_do_rcv() hook (per activation) seems to be consistently higher than that for tcp_data_queue(), despite all hooks running essentially the same code and all of them being fentry hooks. Furthermore, the it seems that tcp_v4_do_rcv() hook takes roughly about 30% longer than tcp_data_queue() in all these configurations, rather than being a fixed overhead (e.g. 40ns).

As some reference points, I also ran similar tests with epping and the Kyanos tool, result are as follows:
epping:

id   , name            ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
189  , pping_tc_egress ,     79354117,           583.30,      251995.04,        14.698873%
191  , pping_tc_ingress,     15509659,           829.35,       49252.10,         4.084723%
192  , tsmap_cleanup   ,       966946,           427.65,        3219.48,         0.137680%
193  , flowmap_cleanup ,         3289,           323.04,          10.95,         0.000354%
0    , Total           ,     95834011,           621.54,      304477.57,        18.921630%

kyanos:

id   , name                                      ,        count, avg runtime (ns), frequency (Hz), est. CPU util (%)
2326 , dev_hard_start_xmit                       ,     91032535,           837.27,      302098.44,        25.293666%
2333 , security_socket_recvmsg_enter             ,     37333410,            91.60,      123893.78,         1.134898%
2334 , security_socket_sendmsg_enter             ,     28416279,            87.66,       94301.60,         0.826667%
2337 , sock_alloc_ret                            ,         1737,           713.72,           5.76,         0.000411%
2338 , tcp_destroy_sock                          ,           98,          2661.88,           0.33,         0.000087%
2341 , tcp_v4_do_rcv                             ,     86890360,           658.30,      288352.32,        18.982194%
2343 , tracepoint__netif_receive_skb             ,     91034180,           771.08,      302103.90,        23.294533%
2344 , tracepoint__sched__sched_process_exec     ,           13,         12189.92,           0.04,         0.000053%
2346 , tracepoint__skb_copy_datagram_iovec       ,     94955297,           441.97,      315116.43,        13.927324%
2347 , tracepoint__syscalls__sys_enter_accept4   ,          140,          1562.58,           0.46,         0.000073%
2348 , tracepoint__syscalls__sys_enter_close     ,        43195,           444.89,         143.35,         0.006377%
2349 , tracepoint__syscalls__sys_enter_connect   ,          363,           921.93,           1.20,         0.000111%
2350 , tracepoint__syscalls__sys_enter_read      ,       105619,           660.55,         350.50,         0.023153%
2352 , tracepoint__syscalls__sys_enter_recvfrom  ,     37107108,           350.08,      123142.78,         4.311032%
2353 , tracepoint__syscalls__sys_enter_recvmsg   ,       221419,           779.13,         734.80,         0.057250%
2355 , tracepoint__syscalls__sys_enter_sendmsg   ,        71460,           829.18,         237.15,         0.019664%
2356 , tracepoint__syscalls__sys_enter_sendto    ,         1426,           699.82,           4.73,         0.000331%
2357 , tracepoint__syscalls__sys_enter_write     ,     28387953,           357.42,       94207.60,         3.367143%
2359 , tracepoint__syscalls__sys_exit_accept4    ,          140,          2785.14,           0.46,         0.000129%
2360 , tracepoint__syscalls__sys_exit_close      ,        43195,           375.93,         143.35,         0.005389%
2361 , tracepoint__syscalls__sys_exit_connect    ,          363,          2790.62,           1.20,         0.000336%
2362 , tracepoint__syscalls__sys_exit_read       ,       105620,           474.41,         350.51,         0.016628%
2364 , tracepoint__syscalls__sys_exit_recvfrom   ,     37107108,           850.75,      123142.78,        10.476430%
2365 , tracepoint__syscalls__sys_exit_recvmsg    ,       221419,          1367.78,         734.80,         0.100504%
2367 , tracepoint__syscalls__sys_exit_sendmsg    ,        71460,          1540.79,         237.15,         0.036539%
2368 , tracepoint__syscalls__sys_exit_sendto     ,         1426,          1829.03,           4.73,         0.000866%
2369 , tracepoint__syscalls__sys_exit_write      ,     28387952,           924.71,       94207.59,         8.711466%
0    , Total                                     ,    561541275,           593.46,     1863517.74,       110.593253%

So in summary, current overhead is fairly low, but still not insignificant. Only adding a few more hook points will likely result in overhead similar to that of epping (in a setup where epping does 3-4 hashmap lookups per packet), although we still have a lot of leg room left until we reach as high overhead as kyanos (but kyanos can also deliver much more detailed data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants