𖤐 ec0 :: tachibana systems division 𖤐

About Systemtap

https://sourceware.org/systemtap is an instrumentation framework which enables you to write scripts which can measure and inspect code and data running on a live Linux system. It is incredibly useful for gathering data and metrics when diagnosing problems too complex to be understood using the standard kernel metrics available via /proc, /sys or netlink endpoints. Scripts are based around the concept of probes, which attach to probe points - which are primarily functions in the kernel or userspace. Probes can then be then used to extrapolate a huge amount of insight - things like how many times was this specific function called? Which kernel module is flushing the disk cache? The kind of data you can't just pull generic counters for hoping to guess what's going on by inferring it. Either probe points, which are compiled into modern kernels, or a full debug symbol table (usually installed seperately via your package manager) can be used as a source of information for SystemTap to map system calls and memory locations back to something useful.

Using Systemtap we can gain some very deep introspection of the kernel, how it works, and the program flow and in-memory structures which are causing specific behaviour you want to diagnose. Systemtap can help you! Using Systemtap, you can also get access to the variables and their contents passed to kernel functions as they are called, which allows for all sorts of very powerful analysis. Let's dig into that - by examining dropped network packets. Examining dropped network packets requires quite a lot of kernel introspection, so it's a good topic to step through and examine when talking about breaking down kernel structures whilst tracing. It's also a difficult problem to diagnose typically.

But first, some context on what exactly we are analysing when we talk about kernel introspection.

Network Structures in Linux

Most things in the Linux kernel are represented by purpose-built in-memory structures. These are often just referred to as structs as this is the low-level code primitive used to build representative structures in memory in C, and are widely used to represent key kernel concepts.

We are going to refer to two key struct types in the Linux kernel. These are just examples though, you can use the approach I'm describing to get introspection into anything defined in the kernel and passed as an argument to a probed function.

Firstly, network devices. In the Linux kernel, every interface is described by a net_device structure (defined include/linux/netdevice.h, if you are interested, I won't reproduce it here because it's hefty). Network drivers will allocate and update a struct net_device for each interface, and these are often passed around as pointers when other kernel structures need to keep track of a device they are interacting with. For example, when a device is added to a bond or bridge, the child interfaces are referred to by their net_device structs. Essentially, each interface you see in ip link has a net_device struct floating around in kernel memory.

(Side note: This is somewhat loosely - as is the case in a few kernel bugs in recent memory - tracked by the refcount member, which has to fall to zero before a device can be freed, or not, as the case was…).

Secondly, packets. Every packet which passes through the Linux networking stack is stored in a structure called the SKB ("Socket Buffer"). It keeps track of a packet as it is received, handled, and transmitted, the devices which handled it, the priority, TTL all the way until it is handled and removed from the kernel's socket buffer.

Dropped Packets

Whenever a packet is dropped in the Linux kernel, it is as a result of a module, or the net code in the Kernel itself, deciding to drop it, rather than send it on to somewhere else. Sometimes, in the case of QoS, or active/active bonding, it is desirable or at least expected to drop a packet - maybe you've hit the bandwidth limit in the qdisc the traffic is assigned to, or maybe a packet come in on the wrong leg of the bond - so the packet is dropped and the protocol compensates.

Other times, it is a move made out of panic or not other code path presenting itself. Every time it happens however, there is a call to kfree_skb. kfree_skb is the function in the kernel which sends a packet into the void, never to be seen again.

The standard tooling tells us when packets are dropped, and when packets are not being transmitted/received, per interface, by way of the interface statistics present in ip link. But this doesn't help us understand the state of the kernel, where in the kernel the decision to drop happened, the contents of a dropped packet, and the code path leading up to the decision made when dropping a packet, which can be key to understanding why a packet is dropped, and if you're dealing with a bug, or just plain old configuration issues.

Anatomy of a SystemTap Script

The easiest way to tie all of the above together is with an example script. This script below has been constructed to identify where the kfree_skb call is being called in the kernel, and to use the passed structures to identify details about the packet passed to the call. This allows us to identify useful information such as the interface and kernel module referenced when a packet was dropped.

# Array to hold the list of kfree_skb functions we find called
global funcs

# Function to resolve interface name from an SKB struct
function devname:string(skb:long) {
  # In SystemTap, data needs to be typecast to the correct structure
  # to be useful. We typecast the pointer to an sk_buff, which is
  # the kernel structure used for storing network buffers
  dev = @cast(skb, "struct sk_buff", "kernel<linux/skbuff.h>")->dev
  # sk_buff links out to a net_device struct as a pointer, so we
  # typecast that too, so we can get at the data inside
  dev_name = @cast(dev, "struct net_device", "kernel<linux/netdevice.h>")->name
  # If that all worked, let's continue
  if (dev_name != 0) {
    # We use sprintf here to dereference the pointer and return it
    return sprintf("%s", kernel_string(dev_name));
  # Otherwise, we have no idea which interface was buffered for
  } else {
    return "unknown"
  }
}

# We are using a tracepoint here, and attaching to the kfree_skb trace point.
# Using tracepoints (kernel.trace) rather than kernel probes (kernel.probe) are favourable
# given tracepoints do not require a debug kernel or debug symbols be installed.
# Tracepoints, as a downside, are not as plentiful, and have to have been inserted at
# specifically interesting places in the kernel source. Luckily for us, the kfree_skb trace
# point lives just before the kfree_skb function definition in the kernel source, so we can
# not only pick up a pointer to the function call, but to the data passed as the first argument.
probe kernel.trace("kfree_skb") {
  # We pass the pointer to the data passed as the first argument to kfree_skb to our devname
  # function, where the data is typecast and deconstructed per above comments to get the
  # device name the packet/buffer was dropped from.
  dev_name = devname($skb)
  # We then add a pointer to the function call into an array with two indices, the
  # function pointer and the dev name. This allows us to firstly decode and then later
  # recall the devname recorded alongside a specific invocation of kfree_skb when printing results
  # in the print_drops function.
  # $location contains the location in the kernel the function was called.
  funcs[$location,dev_name] <<< 1
}

# This function loops through results and prints out a pretty display of packet drops per
# kernel line of code, per kernel module, per interface.
function print_drops() {
  # Loop over the locations in kernel and devname array
  foreach([loc,dev] in funcs) {
    # Print out a count of function location and device name pairs, along with the module name.
    # @count can count occurences of one of more values in an array, in our case records where
    # unique location and device name pairs exist.
    # We also print out symname, which gets the calling function/symbol name.
    # We print out our device name (interface name) where the packets were dropped.
    # We then print out modname, which is the kernel module containing the calling code.
    printf("%d: %s@%s (%s)\n", @count(funcs[loc,dev]), symname(loc), dev, modname(loc));

  }
  # We then clean up the array so the next call will start with freshly gathered data
  delete funcs
}

# Every 5 seconds report our drop locations by registering a timer probe
probe timer.sec(5) {
  # Print out our combined output for the last 5 seconds
  print_drops()
}

Et Voila. Comments inline. Any questions, feel free to get in touch!

Accessing Kernel struct Data in Systemtap - Fri, Dec 29, 2017

About Systemtap

Network Structures in Linux

Dropped Packets

Anatomy of a SystemTap Script