~ Introduction
At Delphix we’ve switched the operating system that powers our appliance from Illumos to Linux. One of the main factors that contributed to this decision was that Linux is supported by nearly all hypervisors and cloud-providers. This is important for us because it means that our System Platforms team doesn’t need to spend time implementing and maintaining drivers for each hypervisor. At the same time though, the switch had its disadvantages. A big one was that Linux lacked a kernel debugger that supports the workflows of our engineering and support organizations. This post describes these workflows and how they were supported by MDB (Modular Debugger) in illumos. Then it surveys the current state of similar tooling in Linux and introduces SDB, which stands for the Slick Debugger and is a project that we’ve been working on for the past few months.
~ Debugging Systems in Production - The Delphix Use Case and MDB
The Delphix Appliance is a VM that our customers run on-prem or in the public cloud. When something goes wrong on the VM, customers contact support and give our support engineers temporary access to the VM to diagnose and troubleshoot the issue. There are times though where things have gone to the point that support needs to reach out to engineering. Examples of such scenarios are kernel panics, slow or hanged systems, and misbehaving device drivers. In such cases if the system is still running, the engineer handling the escalation connects to it and starts to poke around generally using MDB and DTrace. If the machine experienced something like a kernel panic and had to reboot, a crash dump is generated and collected by support for in-house analysis by engineering.
The actual details of handling such escalations can vary widely depending on the situation and sometimes the situation can be extreme. For example, the machine may enter a panic loop, in which case the kernel crashes early in boot, leaving no time for the user to do anything (including getting their hands on the crash dump of the previous panic). In such cases Illumos can be configured to drop into KMDB exactly when the panic occurs, allowing our engineers to poke around the system through the console (KMDB is a version of MDB specifically designed for kernel execution control). Another way that we’ve dealt with this issue in the past for customers running on VMware ESX is to get a VMware snapshot and analyze that with MDB. For customers running on the cloud where we can’t get console access, our VMs are configured with an alternative boot environment that is activated once the bootloader detects that we’ve rebooted multiple times within a small period of time. That boot environment is a minimal system that our engineers can SSH into, mount the directory where the crash dumps reside, and analyze them. It is also used afterwards to place any patched binaries once we’ve root-caused and fixed the issue.
Reboots in general are used as a last resort. If the nature of the issue results in a stuck system that users can still SSH into (e.g. a deadlock in NFS) it is preferable to first analyze the live-system. If the engineers cannot root-cause the issue right away and the customer’s productivity is severely affected, only then we generate a crash dump and potentially reboot. The reason we prefer debugging the live-system first is that it is the only point in time when the engineer can have access to the whole state of the system (i.e. both in memory and on disk state). Also for issues like a slow performing VM, the engineer may be able to fix the issue by just changing a kernel tunable/global variable on the live system though MDB. This completely avoids the reboot and does not interrupt the use of the VM by the customer.
All the above use-cases show that MDB is a versatile tool that can be used in multiple ways to debug issues in the kernel. This is one of the reasons that has made it indespensable to our processes when debugging issues in production. The other reason is that MDB is overall a well-designed debugger.
~ What Makes a Good Kernel Debugger?
“The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.”
Debugging is an attitude. There is always a logical explanation for why a system is misbehaving and it generally doesn’t take more than 3 or 4 good questions to root-cause the issue. When debugging a kernel panic for example, the first question that we ask is “what happened and where?” which can generally be answered by looking at the stack trace of the panicked thread. Then we pause, reflect, and use backwards reasoning and ask more questions to figure out why. A good debugger helps you get the answers to your questions quickly and effectively.
A good debugger gives you access to all the state in the system. It can print the stack traces of all the threads in the system together with their function arguments in each stack frame. It allows you to easily walk complex data structures in memory or all the locations in any memory region. I personally agree overall with Brian Kernighan’s quote in the beginning of this section but if we were to take the point about print statements literally, I don’t think it is applicable to debugging kernels in production (especially in an appliance setting). It would be a huge waste of time and resources (not to mention very error-prone) to add print statements, recompile, and deploy a whole kernel during an escalation. One could make the point of placing smart logging during development and access that in production, but experience has shown us that it is hard to foresee what you’ll need in the future until it’s too late. A debugger that has access to all the state in a running system or a crash dump allows you to print anything you want on-demand.
A good debugger presents data to you in a precise and readable format. To illustrate this point, I use an example from the ZFS code base. In ZFS, on disk block pointers are described by the following structure:
typedef struct blkptr {
dva_t blk_dva[SPA_DVAS_PER_BP]; /* Data Virtual Addresses */
uint64_t blk_prop; /* size, compression, type, etc */
uint64_t blk_pad[2]; /* Extra space for the future */
uint64_t blk_phys_birth; /* txg when block was allocated */
uint64_t blk_birth; /* transaction group at birth */
uint64_t blk_fill; /* fill count */
zio_cksum_t blk_cksum; /* 256-bit checksum */
} blkptr_t;
Most debuggers can pretty-print structures in memory like this:
(blkptr_t){
.blk_dva = (dva_t [3]){
{
.dva_word = (uint64_t [2]){ 4294967297, 8388625, },
},
{
.dva_word = (uint64_t [2]){ 8589934593, 4194305, },
},
{
.dva_word = (uint64_t [2]){ 1, 4194305, },
},
},
.blk_prop = (uint64_t)9226476022604496903,
.blk_pad = (uint64_t [2]){},
.blk_phys_birth = (uint64_t)0,
.blk_birth = (uint64_t)1101,
.blk_fill = (uint64_t)120,
.blk_cksum = (zio_cksum_t){
. zc_word = (uint64_t [4]){ 48532393572, 4958534444355, 258227437375945, 9143840630644146, },
},
}
This is nice but the debugger could do better. Each dva_word
of the blk_dva
member is comprised by 2 uint64_t
s which actually describes the device that
the block belongs to, together with the block’s offset within that device, and
the block’s size. In addition, the blk_prop
member is really a set of flags
described by the bits set in it. With that in mind a good debugger can be
taught to optionally print these kind of structures succinctly like this:
DVA[0]=<1:100002200:200> // <--- this is blk_dva[0].dva_word
DVA[1]=<2:80000200:200>
DVA[2]=<0:80000200:200>
[L0 OBJSET] FLETCHER_4 LZ4 LE gang unique triple // <--- this is blk_prop
size=1000L/200P birth=1101L/1101P fill=120
cksum=b4cc18e64:4827faf2543:eadb42ad11c9:207c464cac1db2
Pretty-printing structures is not the only thing to get from this point. The concept of pretty-printing data is really helpful in a variety of other contexts. For example, in the MDB output below, a nice report is produced were the stack traces of all the threads in the system are aggregated and printed with a thread count:
> ::stacks
THREAD STATE SOBJ COUNT
---------------------------------------------------
ffffff000d611c40 SLEEP CV 577
swtch+0x141
cv_wait+0x70
taskq_thread_wait+0xbe
taskq_thread+0x37c
thread_start+8
ffffff0382dbb440 SLEEP CV 36
swtch+0x141
cv_timedwait_sig_hires+0x39d
cv_timedwait_sig_hrtime+0x2a
nanosleep+0x18a
_sys_sysenter_post_swapgs+0x149
ffffff0382dbbb80 SLEEP CV 33
swtch+0x141
cv_wait_sig_swap_core+0x1b9
cv_wait_sig_swap+0x17
cv_waituntil_sig+0xbd
lwp_park+0x15e
syslwp_park+0x63
_sys_sysenter_post_swapgs+0x149
... <cropped> ...
ffffff000d755c40 SLEEP CV 1
swtch+0x141
cv_timedwait_hires+0xec
cv_timedwait+0x5c
l2arc_feed_thread+0xc6
thread_start+8
ffffff000d74fc40 SLEEP CV 1
swtch+0x141
cv_timedwait_hires+0xec
dbuf_evict_thread+0x10c
thread_start+8
Introspecting data from the system’s state, combining them, and printing them in a form that emphasizes in answering the user’s questions is an essential ingredient for a good debugger. In order for the debugger to be able to print data in various representations though, someone needs to teach it how to gather and interpret it for each case, which brings us to the next point.
A good debugger is easily extensible. A developer that’s working on a new
feature should be able to extend the debugger with new commands to make it
aware of the new feature, without having to recompile the actual debugger.
MDB didn’t know from its inception how to print ZFS block pointers in the form
that we showed above. We had to teach it how to interpret the blkptr
struct.
MDB has the concept of modules, where a developer can write C code that uses
MDB’s API, compile it, and generate an object that can later be loaded by MDB
in effect extending the debugger without recompiling it. GDB developers took
this idea slightly further and embedded a Python interpreter in GDB, allowing
users to define new commands in a Python file or even on the fly during a
debugging session.
A good debugger is flexible and doesn’t get in your way. Before GDB allowed users to extend it on the fly with Python, it was basically just a prompt with a laundry-list of commands. The user’s ability to examine their target was limited to whatever the available commands were, effectively limiting the amount of questions the user can ask. Now that GDB has a Python API that can be used to access all the state of the target, that limitation is gone but the experience is still not optimal. You can ask any question you want, as long as you are willing to write Python code to get the answer. Debugging is an activity that requires focus and writing Python code during that time can break the user’s focus (not to mention frustrate them if they get their spaces wrong and get syntax errors). MDB found a sweet spot on that front by allowing you to chain commands together with pipes like a UNIX shell. The next section provides a quick demonstration of how a few simple commands can be combined together through pipes to answer one-off questions about the system’s state.
~ A Quick Demonstration of MDB pipes
To illustrate how powerful the concept of pipes in MDB is, let’s assume that we want to print out the offsets and sizes of all the memory segments used by all processes in a live system. The relevant Illumos structures together with their relevant members are displayed below:
typedef struct proc {
... <cropped> ...
struct as *p_as; /* process address space pointer */
... <cropped> ...
} proc_t;
struct as {
... <cropped> ...
avl_tree_t a_segtree; /* segments in this address space. */
... <cropped> ...
}
struct seg {
caddr_t s_base; /* base virtual address */
size_t s_size; /* size in bytes */
... <cropped> ...
};
Each process in the system is described by a proc_t
. Each proc_t
has a
pointer to a struct as
that describes its address space. Each address
space has a_segtree
which is an AVL tree where each node is a struct seg
.
Each address space segment is described by a struct seg
- a struct which
contains the two fields that we care for, s_base
and s_size
. So with the
above in mind we start composing our pipeline. First of all, we need to go
through all the proc_t
structures, so we use the ::walk
command and pass
it proc
as an argument (there are a lot of details that I skip here - the
important part is understanding what’s happening, not the terminology).
> ::walk proc
0xfffffffffbc49080
0xffffff31f3f83020
0xffffff31f3f8c018
0xffffff31f6fd4048
0xffffff31fdc18020
0xffffff31f4b1f050
0xffffff32303d3050
0xffffff32aaece050
... <cropped> ...
So the above output shows the command printing all the pointers to all the
proc_t
structures on the system. If we wanted to print the contents of
all of these structures we could pipe the output of this command to ::print
and instruct MDB to print all those pointers as proc_t
structures:
> ::walk proc |::print proc_t
{
p_exec = 0
p_as = kas
p_lockp = p0lock
p_crlock = {
_opaque = [ 0 ]
}
p_cred = 0xffffff0373572db0
... <cropped> ...
}
{
p_exec = 0xffffff03b8f68080
p_as = 0xffffff0388b04d48
p_lockp = 0xffffff03735b3440
p_crlock = {
_opaque = [ 0 ]
}
p_cred = 0xffffff0389781d38
p_swapcnt = 0
... <cropped> ...
To continue our pipeline, we are only interested in the p_as
member of all
of these proc_t
structures which point to their respective address space
structures (struct as
). Then from these ones we are only interested in their
a_segtree
member which is a pointer to the AVL tree of all the address space
segments of each process. To print that field for all the processes we
continue our pipeline this:
> ::walk proc |::print proc_t p_as->a_segtree
p_as->a_segtree = {
p_as->a_segtree.avl_root = kvseg+0x20
p_as->a_segtree.avl_compar = as_segcompar
p_as->a_segtree.avl_offset = 0x20
p_as->a_segtree.avl_numnodes = 0x9
p_as->a_segtree.avl_size = 0x60
}
p_as->a_segtree = {
p_as->a_segtree.avl_root = kvseg+0x20
p_as->a_segtree.avl_compar = as_segcompar
p_as->a_segtree.avl_offset = 0x20
p_as->a_segtree.avl_numnodes = 0x9
p_as->a_segtree.avl_size = 0x60
}
... <cropped> ...
So now we have access to all the the AVL trees that contain all the segments
of each process (MDB even pretty-printed them for us from the type info) and
we want to go through all the nodes in those trees. We can do so by using
the ::walk
command again and specifying that we are walking an AVL tree
this time:
> ::walk proc |::print proc_t p_as->a_segtree |::walk avl
0xfffffffffbc85280
0xfffffffffbc50160
0xfffffffffbc4e4e0
0xfffffffffbc51280
0xfffffffffbc50080
... <cropped> ...
Now all we need to do is to use ::print
again to specify that these pointers
are struct seg
structures and instruct it to print the two fields that we
are interested in - s_base
and s_size
:
> ::walk proc |::print proc_t p_as->a_segtree |::walk avl |::print struct seg s_base s_size
s_base = 0xfffffe0000000000
s_size = 0x200000000
s_base = 0xffffff0000000000
s_size = 0xd600000
s_base = 0xffffff000d600000
s_size = 0x80000000
s_base = 0xffffff008d600000
s_size = 0x29f400000
s_base = 0xffffff036ca00000
s_size = 0x4000000
... <cropped> ...
Voila! Now to make our initial question more like something that you would ask
in the real world, let’s say that we are curious to know the size of the address
space starting at offset 0xffffff036ca00000
. The output of the above command
is humongous and it is not ordered by s_base
, which doesn’t help. Using the
exclamation mark operator (shell pipe
operator in MDB terms) we can pipe the
output of an MDB command as the input to our UNIX shell effectively continuing
the MDB pipeline as a shell pipeline. In our example case, we can use grep
to
locate 0xffffff036ca00000
and print the line after it which should be the
value of its s_size
member:
> ::walk proc |::print proc_t p_as->a_segtree |::walk avl |::print struct seg s_base s_size ! grep 0xffffff036ca00000 -A 1
s_base = 0xffffff036ca00000
s_size = 0x4000000
~ State of Kernel Debuggers in Linux
In order to fill the MDB hole in our transition to Linux we were faced with 3 options: Create a debugger from scratch, port MDB to Linux, or pick up an existing project and implement the functionality that we need on top of it. The trade-offs involved in the above decision are multiple and include factors beyond technical ones such as the deadlines for our first release that would require a minimum viable product, or the benefits and risks of creating and maintaining a debugger ourselves. After surveying the landscape of debugging tools in the Linux world we decided that the best way forward for us was to take the third approach.
We first looked at vanilla GDB and KGDB. The main benefits of GDB is that it is a widely used debugger that works for almost any architecture out there, and can be easily extended through its Python API. Unfortunately plain GDB doesn’t work with a running kernel nor kernel crash dumps out of the box anymore. The introduction of kernel address space randomization (KASLR) and various changes in how initial symbols are set up and accessed in the kernel made it hard for GDB to correctly attach to any kernel target. KGDB deals with those issues for live systems but involves having a second system which uses GDB to attach to the first one in order to debug it. This is impossible to do given how we currently ship our product to our customers and overall inflexible for production, especially for on-prem customers.
The next place that we looked at was Dave Anderson’s crash
utility. This tool consists of a layer that understands the kernel’s structure
and an embedded version of GDB 7.6 to provide a familiar debugging experience.
It can also debug multiple targets such as kernel crash dumps created by VMware
snapshots or KVM. Unfortunately, it is not very easy to extend. It comes with a
module system and a C API but the API is bare and not well-designed (it doesn’t
give you easy access to things like type information, error-handling is tricky,
etc…). We did create a prototype that enabled the Python interpreter in the
GDB version embedded into crash
but the interaction between components is
mostly one-directional (the crash
layer accesses the GDB API but not the
opposite) making it hard to request and interpret data from crash
in the
Python interpreter of GDB.
We then stumbled upon Jeff Mahoney’s crash-python.
Jeff’s tool is a patched version of GDB that can read kernel crash dumps by
leveraging Petr Tesarik’s libkdumpfile.
It is overall well-designed and put together. The downstream patch on GDB
is a Python wrapper on top of GDB’s Target API which is written in C, and is
used to plug into libkdumpfile which also exposes a Python API. Unfortunately
that patch has been in the GDB mailing list for years now with no updates
even though Jeff has been good at staying up to date with upstream changes.
There is also the problem that crash-python
doesn’t work with live-systems.
At Delphix, we did create a prototype library in Python that plugs into Jeff’s
GDB Python Target API with some results, but there were still a few issues
that would require us to change internal parts of GDB in order to implement a
proper solution.
Around that time we came across Omar Sandoval’s drgn
project (pronounced “dragon”). drgn
is a tool that leverages a small C library
(libdrgn
) to enable the introspection of running kernels and vmcores in Python,
either through the REPL or in the form of a script. It’s API and Object model
are well-designed, and its command execution speed (including startup) is quick.
The tool’s approach and design is similar to Plan 9’s Acid
debugger but a defining difference is the choice of the scripting language.
Instead of inventing its own language, drgn
leverages the Python runtime
which means that drgn
scripts can leverage any 3rd party library from Python’s
ecosystem. For example, a drgn
script could import plotly
and use it to generate useful visualizations of data introspected from the
kernel.
While drgn
is still a young project that currently misses certain features
(e.g. function arguments in stack traces), it still checks most of the boxes in
our checklist. Thus, we decided to choose it as our base for kernel debugging
in Linux. Omar has been open to accepting our changes and the community around
the project, even though small, has been growing.
~ The Slick Debugger
Based on our criteria of what makes a good debugger, drgn
is still not an
optimal solution in terms user experience. The user needs to either write
one-off Python scripts or define functions and loops in the Python REPL. This
is why we decided to add another layer on top of it which we ended up calling
SDB (short for Slick Debugger). SDB is a debugger written in Python that
leverages the drgn
API to provide a debugging experience similar to MDB.
In simple terms, a drgn
user can leverage the constructs provided by SDB
to make their one-off drgn
scripts re-usable, plugging the output of one
script as the input to another.
In MDB the currency of commands moved through pipes was integer values and
pointers. In SDB it is drgn
objects and this is what makes it “slick”.
drgn
objects come with their own context, such as type information, allowing
commands to be smarter and do more work for the user. To demonstrate, in the
MDB example above we constructed the following pipeline:
> ::walk proc |::print proc_t p_as->a_segtree |::walk avl |::print struct seg s_base s_size ! grep 0xffffff036ca00000 -A 1
In SDB the same pipeline would look something like this:
> procs | member p_as->a_segtree | walk | cast struct seg | filter obj.s_base == 0xffffff036ca00000 | member s_size
Note the following differences:
procs
returns the list of all processes - no need for the::walk <arg>
command.- In SDB we are passing objects together with the type and thus we don’t need to type cast as we do with
::print
. - The
walk
command in SDB chooses the write walk for the user given the type of the object it is passed. - A basic
filter
command is easy to implement in SDB, allowing the user to avoid hacky text parsing withgrep
.
The code for SDB can be found in our Github repo.
Just like drgn
, it is a young project and so we are still in the process of
rewriting some parts, dealing with rough edges, and adding more features.
While still in early stages, it’s been used with some success internally and
currently ships with all our new internal VMs. I am planning to post more
about the development of SDB, including a tutorial for using it and writing
new commands, so stay tuned!
~ Appendix A: Summarized Timeline and Hidden Figures
The journey to finding an MDB replacement in the Linux world has been long with a lot of twists and dead ends along the way, and there is still lot of work to be done. That said, it is safe to say that we are at the point that we just need to sit down and do the work. Getting to this point wouldn’t have been possible without the team-effort and coordination of multiple individuals including a few external to Delphix.
Prakash Surya and Don Brady were instrumental in constructing an MDB prototype
in illumos that can read Linux
crash dumps. Robert Mustacchi was also helpful by letting us pick his brain
when it came to CTF internals. Prakash and Tom Caputi were the people that put
the work and created a crash-python prototype that can analyze live-systems
successfully. Matt Ahrens and Paul Dagnelie implement pipes in the first SDB
prototype that was based on top of crash-python. Once that
was in place, multiple people started contributing commands and submitting
feedback for SDB - in alphabetical order: Don Brady(blkptr
), Paul Dagnelie
(zfs_dbgmsg
), John Gallagher(stacks
), Sara Hartse(stacks
), Prashanth
Sreenivasa(arcstat
), George Wilson(spa
& vdev
), Pavel Zakharov. Pavel
was also the illustrator of the first SDB logo displayed at the end of this
post. Omar Sandoval, the creator of drgn
, was very open and helpful with his
feedback when we tried to get our patches upstreamed to drgn
. Once the new
SDB+drgn
prototype proved viable, Prakash single-handedly ported all of our
existing code on top of drgn
. Once that was done and we were ready to create
our open source repo, Karyn Ritter was the person that helped us establishing a
license for SDB through proper procedures. Matt Ahrens, Sebastian Roy, Eric
Shrock, and George Wilson were our advocates throughout these times within the
System Platforms leadership, convincing upper management that this project is
worth pursuing. To all of the above people, thank you for all your work, advice,
and feedback. I’m excited to see the things that we’ll do next!