Accessing Kernel struct Data in Systemtap - Fri, Dec 29, 2017
About Systemtap
https://sourceware.org/systemtap is an
instrumentation framework which enables you to write scripts which can
measure and inspect code and data running on a live Linux system. It is
incredibly useful for gathering data and metrics when diagnosing
problems too complex to be understood using the standard kernel metrics
available via /proc
, /sys
or netlink endpoints. Scripts are based
around the concept of probes, which attach to probe points - which are
primarily functions in the kernel or userspace. Probes can then be then
used to extrapolate a huge amount of insight - things like how many
times was this specific function called? Which kernel module is flushing
the disk cache? The kind of data you can't just pull generic counters
for hoping to guess what's going on by inferring it. Either probe
points, which are compiled into modern kernels, or a full debug symbol
table (usually installed seperately via your package manager) can be
used as a source of information for SystemTap to map system calls and
memory locations back to something useful.
Using Systemtap we can gain some very deep introspection of the kernel, how it works, and the program flow and in-memory structures which are causing specific behaviour you want to diagnose. Systemtap can help you! Using Systemtap, you can also get access to the variables and their contents passed to kernel functions as they are called, which allows for all sorts of very powerful analysis. Let's dig into that - by examining dropped network packets. Examining dropped network packets requires quite a lot of kernel introspection, so it's a good topic to step through and examine when talking about breaking down kernel structures whilst tracing. It's also a difficult problem to diagnose typically.
But first, some context on what exactly we are analysing when we talk about kernel introspection.
Network Structures in Linux
Most things in the Linux kernel are represented by purpose-built
in-memory structures. These are often just referred to as structs
as
this is the low-level code primitive used to build representative
structures in memory in C, and are widely used to represent key kernel
concepts.
We are going to refer to two key struct
types in the Linux kernel.
These are just examples though, you can use the approach I'm describing
to get introspection into anything defined in the kernel and passed as
an argument to a probed function.
Firstly, network devices. In the Linux kernel, every interface is
described by a net_device
structure (defined
include/linux/netdevice.h
, if you are interested, I won't reproduce it
here because it's hefty). Network drivers will allocate and update a
struct net_device
for each interface, and these are often passed
around as pointers when other kernel structures need to keep track of a
device they are interacting with. For example, when a device is added to
a bond or bridge, the child interfaces are referred to by their
net_device structs. Essentially, each interface you see in ip link
has
a net_device
struct floating around in kernel memory.
(Side note: This is somewhat loosely - as is the case in a few kernel
bugs in recent memory - tracked by the refcount
member, which has to
fall to zero before a device can be freed, or not, as the case was…).
Secondly, packets. Every packet which passes through the Linux networking stack is stored in a structure called the SKB ("Socket Buffer"). It keeps track of a packet as it is received, handled, and transmitted, the devices which handled it, the priority, TTL all the way until it is handled and removed from the kernel's socket buffer.
Dropped Packets
Whenever a packet is dropped in the Linux kernel, it is as a result of a module, or the net code in the Kernel itself, deciding to drop it, rather than send it on to somewhere else. Sometimes, in the case of QoS, or active/active bonding, it is desirable or at least expected to drop a packet - maybe you've hit the bandwidth limit in the qdisc the traffic is assigned to, or maybe a packet come in on the wrong leg of the bond - so the packet is dropped and the protocol compensates.
Other times, it is a move made out of panic or not other code path
presenting itself. Every time it happens however, there is a call to
kfree_skb
. kfree_skb
is the function in the kernel which sends a
packet into the void, never to be seen again.
The standard tooling tells us when packets are dropped, and when packets
are not being transmitted/received, per interface, by way of the
interface statistics present in ip link
. But this doesn't help us
understand the state of the kernel, where in the kernel the decision to
drop happened, the contents of a dropped packet, and the code path
leading up to the decision made when dropping a packet, which can be key
to understanding why a packet is dropped, and if you're dealing with a
bug, or just plain old configuration issues.
Anatomy of a SystemTap Script
The easiest way to tie all of the above together is with an example
script. This script below has been constructed to identify where the
kfree_skb
call is being called in the kernel, and to use the passed
structures to identify details about the packet passed to the call. This
allows us to identify useful information such as the interface and
kernel module referenced when a packet was dropped.
# Array to hold the list of kfree_skb functions we find called
global funcs
# Function to resolve interface name from an SKB struct
function devname:string(skb:long) {
# In SystemTap, data needs to be typecast to the correct structure
# to be useful. We typecast the pointer to an sk_buff, which is
# the kernel structure used for storing network buffers
dev = @cast(skb, "struct sk_buff", "kernel<linux/skbuff.h>")->dev
# sk_buff links out to a net_device struct as a pointer, so we
# typecast that too, so we can get at the data inside
dev_name = @cast(dev, "struct net_device", "kernel<linux/netdevice.h>")->name
# If that all worked, let's continue
if (dev_name != 0) {
# We use sprintf here to dereference the pointer and return it
return sprintf("%s", kernel_string(dev_name));
# Otherwise, we have no idea which interface was buffered for
} else {
return "unknown"
}
}
# We are using a tracepoint here, and attaching to the kfree_skb trace point.
# Using tracepoints (kernel.trace) rather than kernel probes (kernel.probe) are favourable
# given tracepoints do not require a debug kernel or debug symbols be installed.
# Tracepoints, as a downside, are not as plentiful, and have to have been inserted at
# specifically interesting places in the kernel source. Luckily for us, the kfree_skb trace
# point lives just before the kfree_skb function definition in the kernel source, so we can
# not only pick up a pointer to the function call, but to the data passed as the first argument.
probe kernel.trace("kfree_skb") {
# We pass the pointer to the data passed as the first argument to kfree_skb to our devname
# function, where the data is typecast and deconstructed per above comments to get the
# device name the packet/buffer was dropped from.
dev_name = devname($skb)
# We then add a pointer to the function call into an array with two indices, the
# function pointer and the dev name. This allows us to firstly decode and then later
# recall the devname recorded alongside a specific invocation of kfree_skb when printing results
# in the print_drops function.
# $location contains the location in the kernel the function was called.
funcs[$location,dev_name] <<< 1
}
# This function loops through results and prints out a pretty display of packet drops per
# kernel line of code, per kernel module, per interface.
function print_drops() {
# Loop over the locations in kernel and devname array
foreach([loc,dev] in funcs) {
# Print out a count of function location and device name pairs, along with the module name.
# @count can count occurences of one of more values in an array, in our case records where
# unique location and device name pairs exist.
# We also print out symname, which gets the calling function/symbol name.
# We print out our device name (interface name) where the packets were dropped.
# We then print out modname, which is the kernel module containing the calling code.
printf("%d: %s@%s (%s)\n", @count(funcs[loc,dev]), symname(loc), dev, modname(loc));
}
# We then clean up the array so the next call will start with freshly gathered data
delete funcs
}
# Every 5 seconds report our drop locations by registering a timer probe
probe timer.sec(5) {
# Print out our combined output for the last 5 seconds
print_drops()
}
Et Voila. Comments inline. Any questions, feel free to get in touch!