-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add netstacklat tool #125
base: main
Are you sure you want to change the base?
Add netstacklat tool #125
Conversation
36dd2e9
to
05b8542
Compare
So, I now belive I have a full "minimal working example" that quite closely replicates the bpftrace script and prints out its output in a similar format. There are still many missing features that would be nice to have, such as:
But figure that this should be sufficent for an early review at least. One issue worth highlighting is that it's currently not fully compatible with ebpf-exporter in the sense that you can't just drop in the For licensing, I've put my own files under GPL-2.0 as most other examples here. However, I've also included the same version of As I mentioned during the meeting, I intentionally opted to use separate BPF maps for each histogram rather than multiplex multiple histograms in the same map as most of the ebpf-exporter examples seem to do, e.g. here. By doing so I can simply use array maps instead of hashmaps and thereby reduce the overhead for map lookups a bit. |
Oh, and ping @netoptimizer if Github hasn't already done so (not intending to rush you, just want to make sure you're aware this PR is ready to be reviewed). |
Add the tool netstacklat, which measures latency up to different parts in the Linux ingress network stack. Base the initial implementation on a bpftrace script from Jesper Dangaard Brouer, which requiers the kernel to timestamp ingress packets (i.e. set the tstamp member of the skb). Hence, the latency recorded by the tool is the difference between the kernel timestamping point and various later points in the network stack. In this initial commit, include the eBPF programs for recording the networks stack latency at the start of the kernel functions tcp_v4_do_rcv(), tcp_data_queue(), and udp_queue_rcv_one_skb(). Use a structure making it easy to extend the tool with additional hook points in the future. Make the eBPF programs compatible with Cloudflare's ebpf_exporter, and use the map helpers (maps.bpf.h) from ebpf_exporter to ensure maps are used in a compatible way. Open code the histogram maps for different hook points as entirely separate maps, instead of encoding the different hook points in the key of a separate map as ebpf_exporter often does. This avoids costly hashmap lookups, as simple array maps can be used instead of hash maps. Also include a minimal user space loader, which loads and attaches the eBPF programs. Later commits will extend this loader to also report the recorded latencies stored in the BPF maps. Signed-off-by: Simon Sundberg <[email protected]>
Add functionality to the user space component to periodically fetch the BPF maps netstacklat records the values in and print them out. Base the core program loop on the same setup as pping, where a single epoll instance is used to support multiple different types of events. So far it only deals with signal handling (for clean shutdown) plus a timer (for periodical reporting), but the setup can easily be extended if the program grows more complex in the future. Use the (somewhat complicated) bpf_map_lookup_batch to fetch the entire histogram maps in a single system call (instead of performing a lookup for each bin index individually). Signed-off-by: Simon Sundberg <[email protected]>
Add a README briefly explaining what this example does. Signed-off-by: Simon Sundberg <[email protected]>
Add support for parsing arguments from the command line. Add the --help/-h and --report-interval/-r <n-seconds> options. As part of this, add some minor documentation about the program and the available options that are displayed by the --help option. Base the print_usage() function on the usage() function from the traffic-pacing-edt example, but change its output format to be more consistent with most GNU utilities, listing the short option (if any) before the long option. Furthermore, also show if an argument is expected based on the has_arg value. Automatically generate the optstring that getopt_long() expects from the options array. While needlessly complicated for the two options currently supported, it simplifies adding additional options in the future. To allow for mapping the opt value returned by getopt_long() back the corresponding option struct (for example to access the full name of the option), add the optval_to_longopt() helper that linearly searches through the options array. This is more reliable than using the longindex pointer from getopt_long(), as that pointer is only set when a long option is passed, and not when the corresponding short option is passed. Signed-off-by: Simon Sundberg <[email protected]>
While I've done a couple of force pushes, there's been no major change to the initial 3 commits (which was what I originally requested the review for). I've just done some minor cleanup there (renamed an enum value from I have now also added 3 additional commits implementing most of the features I mentioned were missing in my previous comment. They are all relatively self-contained I can move them out into a separate PR(s) and/or reorder them if requested. For the commit about enabling the RX software timestamps, it relies on the user space component to do that, so it won't really work when just using the An alternative approach would be to add an eBPF program at some early part of the network stack that checks if the skb->tstamp field is set, and if not sets it itself. Although I guess not all hooks are necessarily allowed to write to that field, and I'm not sure if there may be any consequences higher up the stack for setting the tstamp field when its not expected to be set. For the commit about initializing the TAI-offset, it has a similar limitation in that it relies on the user space program, so won't work for ebpf-exporter. As an alternative, it may be possible to get the TAI offset from an eBPF program instead. E.g. the following bpftrace script can get it: sudo bpftrace -e '
// On kernels >= 6.13, remove this struct definition (no longer an anonymous struct)
struct tk_data {
seqcount_raw_spinlock_t seq;
struct timekeeper timekeeper;
};
BEGIN
{
printf("Getting clock offsets from core...\n");
$tk = (struct tk_data *)kaddr("tk_core");
printf("MONOTONIC -> REALTIME (offs_real): %ld\n",
$tk->timekeeper.offs_real);
printf("MONOTONIC -> TAI (offs_tai): %ld\n",
$tk->timekeeper.offs_tai);
printf("REALTIME -> TAI (tai_offset): %ld\n",
$tk->timekeeper.tai_offset);
exit();
}' However, I'm not really sure how to translate that to C/libbpf in a good way. First it needs to be attached to some suitable hook that ideally only runs once before all the other hooks trigger so that we don't have to probe that part of kernel memory every time we need to use the TAI offset. And then I'm not really sure if there's really anything corresponding the the bpftrace Furthermore, the TAI offset is currently only set at startup, so if the program is running when a leap second occurs, the TAI offset will be out of sync, and we will get massively inflated latencies. We could periodically update it, although that still runs the risk of it being out of sync in-between updates. We could also try to hook into whatever part of the kernel that actually updates that offset, so that we change it as soon as the kernel updates it. But in order to update it, it can obviously no longer be a constant, so it will potentially add a small amount of overhead when using the TAI offset. |
The netstacklat tool depends on the tstamp member in the skbs to be set by the kernel. So make the user space loader enable software RX timestamping on all packets when the tool is started. Note that, as the documentation at https://docs.kernel.org/networking/timestamping.html points out, enabling software receive timestamping on a socket enables it for all packets, as the destination socket is not known early in the network stack. It is therefore sufficient to enable it on a socket created by the program itself, even if no packets are received on that socket. Signed-off-by: Simon Sundberg <[email protected]>
Make the user space loader calculate the TAI-offset at startup and set it as a constant for the eBPF programs. Split the open and loading stages of the eBPF programs apart to enable setting constants in the eBPF programs. Note that on some systems (e.g. most debian systems by default), the TAI offset may (incorrectly) be 0, so that CLOCK_TAI becomes identical to CLOCK_REALTIME. While this is principly incorrect, it does not pose an issue for netstacklat, as it only needs the TAI offset to translate CLOCK_TAI to CLOCK_REALTIME (which skb->tstamp is assumed to use as clock basis). Therefore, netstacklat will (from this commit) work correctly even if the TAI offset is not correctly set on the system. Limitation: The TAI offset is only set once the program is first loaded, and is not dynamically updated in case the TAI offset changes. So if the program is running while a leap second occurs, the recorded latencies may be off with one second. Furthermore, as the TAI offset is set from user space, it will not work when just using the eBPF portion together with ebpf-exporter. Signed-off-by: Simon Sundberg <[email protected]>
When it comes to output format, I have for now approximately replicated the output format from @netoptimizer's bpftrace script. So it may for example look like this: Mon Jan 27 20:02:01 2025
tcp_v4_do_rcv:
[1.02us, 2.05us) 1054 | |
[2.05us, 4.1us) 181795 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us) 181529 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us) 68458 |@@@@@@@@@@@@@@@@@@ |
[16.4us, 32.8us) 1187 | |
[32.8us, 65.5us) 990 | |
[65.5us, 131us) 453 | |
[131us, 262us) 8 | |
Sum: 2373204797
tcp_data_queue:
[2.05us, 4.1us) 169968 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4.1us, 8.19us) 179638 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8.19us, 16.4us) 51028 |@@@@@@@@@@@@@@ |
[16.4us, 32.8us) 1176 | |
[32.8us, 65.5us) 900 | |
[65.5us, 131us) 433 | |
[131us, 262us) 8 | |
Sum: 2320708288
udp_queue_rcv_one_skb:
[1.02us, 2.05us) 8 |@@@@@@@@@@@@@@@@@ |
[2.05us, 4.1us) 23 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4.1us, 8.19us) 12 |@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8.19us, 16.4us) 14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16.4us, 32.8us) 7 |@@@@@@@@@@@@@@@ |
[32.8us, 65.5us) 2 |@@@@ |
Sum: 571657 (For reference, the corresponding output from the bpftrace script is: 20:02:01
@cnt[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@cnt[kfunc:vmlinux:tcp_data_queue]: 404016
@cnt[kfunc:vmlinux:tcp_v4_do_rcv]: 436508
@total[kfunc:vmlinux:udp_queue_rcv_one_skb]: 66
@total[kfunc:vmlinux:tcp_data_queue]: 404022
@total[kfunc:vmlinux:tcp_v4_do_rcv]: 436514
@stats[kfunc:vmlinux:tcp_v4_do_rcv]: count 436515, average 5148, total 2247484837
@stats[kfunc:vmlinux:tcp_data_queue]: count 404024, average 5489, total 2217754970
@stats[kfunc:vmlinux:udp_queue_rcv_one_skb]: count 66, average 6507, total 429464
@hist_ns[kfunc:vmlinux:udp_queue_rcv_one_skb]:
[1K, 2K) 10 |@@@@@@@@@@@@@@@@@@@@ |
[2K, 4K) 26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 11 |@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 5 |@@@@@@@@@@ |
[32K, 64K) 1 |@@ |
@hist_ns[kfunc:vmlinux:tcp_data_queue]:
[1K, 2K) 25 | |
[2K, 4K) 175047 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4K, 8K) 177977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 48539 |@@@@@@@@@@@@@@ |
[16K, 32K) 1135 | |
[32K, 64K) 880 | |
[64K, 128K) 420 | |
[128K, 256K) 7 | |
@hist_ns[kfunc:vmlinux:tcp_v4_do_rcv]:
[256, 512) 1 | |
[512, 1K) 4 | |
[1K, 2K) 28061 |@@@@@@@ |
[2K, 4K) 158756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4K, 8K) 185669 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 61565 |@@@@@@@@@@@@@@@@@ |
[16K, 32K) 1072 | |
[32K, 64K) 946 | |
[64K, 128K) 442 | |
[128K, 256K) 8 | |
) So it "visualizes" the data in a format that makes it fairly easy to glance at, but not really ideal if you want to parse it and feed it into some other system. But if you want to feed the data into some other system it's perhaps better to use it together with ebpf-exporter in the first place. |
I have also verified that it "works" together with ebpf-exporter (with some minor changes to the includes). However, as this issue for ebpf-exporter points out, it seems like ebpf-exporter sort of has an of-by-one error in their log2 histogram implementation, so when using with Prometheus you will generally get around half of the actual latency (and this issue affects all of the log2 histogram examples currently included in ebpf-exporter). Note that the output from netstacklat when used here in bpf-examples is still correct as I print the correct bin labels for the histogram (acknowledging that the way the bin-index is calculated the bins are left-inclusive). But when used with ebpf-exporter the bins will be incorrectly labeled (to pretend to match Prometheus right-inclusive bin format), and therefore give misleading results. To clarify the issue a bit, I provide the following table:
Where the index is the index in the BPF array map for the bin (and the index that will essentially be calculated by the histogram helpers in I'm not really sure what the best way to address this problem is. I could use the updated bin calculation from the PR that intends to fix the issue in ebpf-exporter. However, until that fix is merged in ebpf-exporter itself that won't really change anything (as the function for calculating the bin-index will be based on |
Add a YAML configuration that ebpf-exporter can use to convert the BPF maps to Prometheus metrics. Convert the histogram map for each hook into the corresponding Prometheus histogram. Signed-off-by: Simon Sundberg <[email protected]>
Add a BPF_EXAMPLES macro flag that let's netstack.bpf.c know if it's being compiled in the bpf-examples repository or ebpf-exporter, and allows it to adjust the included headers accordingly. This allows copying the netstack.bpf.c, netstack.h and netstack.yaml files directly into ebpf_exporter/examples without any modifications. Signed-off-by: Simon Sundberg <[email protected]>
I've now added the ebpf-exporter YAML config file. Furthermore, in the "Make build process compatible with ebpf-exporter" commit I have provided a suggestion for how to make netstacklat.bpf.c compile correctly both here in the bpf-examples repository as well as in when copied into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally this looks sane, with the main exception being the macros from bpf_exporter. Those we should definitely not import!
@@ -0,0 +1,84 @@ | |||
/* SPDX-License-Identifier: MIT */ | |||
/* From https://github.com/cloudflare/ebpf_exporter/blob/99d2752f9e0a095b57f53e5da6856ad143b0e443/examples/maps.bpf.h */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so a bunch of macros that don't properly delimit their contents, even going so far as having later macros directly refer to the expanded values of earlier ones? Yikes, no thanks! Please get rid of these and reimplement them using proper inline functions, or at least macros that are sane :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well for just using it here in bpf-examples it would be easy to implement these as functions (not sure if I even need to specify them as inlined, generally leave that decision to the compiler).
However, if the goal is to be compatible with ebpf-exporter, maybe it makes more sense to try and fix this upstream (and then import the fixed version of these)? For that it will probably still have to be macros as the type of the key is unknown, it's just assumed to be a struct that has the member bucket
(which they simply access with .bucket). Not very experienced in writing anything but the most trivial macros, so might have to bounce this back and forth a few times before I get this right if we go down this path.
Could of coure still be handled with functions that accept a void pointer to the key, as long as we decide that the bucket member has to be the first member. However, ebpf-exporter currently seem to assume that the bucket is the the last member of the key (ran into some issues when I tried placing something after the bucket member), so that would likely require changes to some interal aspects of ebpf-exporter as well as updating all of their examples.
@@ -0,0 +1,100 @@ | |||
/* SPDX-License-Identifier: GPL-2.0-or-later */ | |||
#ifdef BPF_EXAMPLES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably going a bit too far in the compatibility. Let's just drop this commit for now, and circle back to it once we have an idea for how things will look wrt bpf-exporter.
I've now done a few sanity tests to validate that it seems to able to capture latency increases in the network stack. Simply used 10 parallel iperf flows to generate traffic and did a test without iptables rules, and another test after adding ~25k pointless iptables rules. These first graphs are without the iptables rules. These graphs are with the 25k iptables rules. Note that the median and 95th percentile values should likely be around twice as high due to the issue with how ebpf-exporter bins the values that I previously mentioned. While these graphs may not be the easiest to read, the latency one can note that the latency increases quite a bit after adding the 25k iptables rules, with e.g. the median roughly doubling and the mean for tcp_v4_do_rcv() increasing from around 12us to 200us (the mean for tcp_data_queue() actually drops from around 12us to 6us though). The heatmap also shows that the distribution for tcp_v4_do_rcv() shifts from a rather unimodal distribution to a much more multimodal one. Interestingly enough, the relationship between how frequently tcp_v4_do_rcv() and tcp_data_queue() also shifts, where tcp_v4_do_rcv() only runs around 15% more frequently than tcp_data_queue() without all the iptables rules, whereas it runs about twice as frequently with the 25k rules (where the rate for tcp_data_queue() seems to match the actual packet rate). These tests, however, also somewhat highligted that the log2 histograms are indeed very coarse grained, as a latency increase will generally only be noticible if it roughly doubles the latency (as that's what's required to end up in another histogram bin). And in this case the impact of adding those 25k iptables rules could for example much easier be noticed by the packet rate dropping from around 260 kpps to 24 kpps. Instead of log2 histograms we could of course use something more fine-grained, but already at our current 30 buckets we already exceed what's typically used with Prometheus histograms. In the last year or so Prometheus has added support for "native histograms" that can handle a large amount of bins much more efficently. However, that is still an experimental feature and not supported by ebpf-exporter. |
Have also looked a little bit at the performance. Based on the bpf stats (
Where count is the number of times the hook ran, avg runtime is the processing time per hook activation, frequency is how frequently the hook was triggered, and estimated CPU utilization is the single core CPU utilization for the program given at that runtime and frequency (i.e. runtime * frequency). So the tcp_v4_do_rcv() hook takes up around 175ns per activation, while the tcp_data_queue() hook takes up around 135ns per activation while under load (the 762ns for udp_queue_rcv() is a bit inflated as it's not really under any load, if I redid the tests with a UDP traffic load I expect it would drop to a similar level). I also tested with a version where I multiplex all the histograms in a single hashmap (how ebpf-exporter typically does it, and produces a more easily handled Prometheus metric), which gave the following results:
So about 60ns (40%) more for tcp_data_queue() and 90ns (50%) more for tcp_v4_do_rcv(). I also tested a dummy version that just picked a fixed value for the delay instead of getting it from reading the skb timestamp and comparing to the current time to get a sort of lower floor for what the maximum performance could reasonably be (only overhead for adding the value to the histogram, using the separate array maps), which gave the following:
So just parsing the the skb timestamp and getting the current time seems to make up the majority of the overhead in the current implementation. One interesting observation here is that the overhead from the tcp_v4_do_rcv() hook (per activation) seems to be consistently higher than that for tcp_data_queue(), despite all hooks running essentially the same code and all of them being fentry hooks. Furthermore, the it seems that tcp_v4_do_rcv() hook takes roughly about 30% longer than tcp_data_queue() in all these configurations, rather than being a fixed overhead (e.g. 40ns). As some reference points, I also ran similar tests with epping and the Kyanos tool, result are as follows:
kyanos:
So in summary, current overhead is fairly low, but still not insignificant. Only adding a few more hook points will likely result in overhead similar to that of epping (in a setup where epping does 3-4 hashmap lookups per packet), although we still have a lot of leg room left until we reach as high overhead as kyanos (but kyanos can also deliver much more detailed data). |
Initial implementation of the netstacklat tool, which is based on Jesper Brouers's bpftrace script for measuring latency up to certain certain points in the Linux network stack.
The eBPF portion of the program is designed to be compatible with ebpf_exporter. In theory one only has to drop in the netstack.bpf.c file into the
ebpf_exporter/examples
directory and add the correspondingyaml
config file (which I will probably add to this repository as well later), although have not yet verified that it actually works as intended there.In addition to finishing up existing commits, some points I wish to address before attempting to merge this request is: