Core Dump

Libdmem
Posted on January 10, 2020.

At Delphix we deal with userland core dumps as much as kernel crash dumps. The majority of the time these userland dumps are generated by memory-related issues - a Java process hitting an OutOfMemory error for example. In such case, we generally use VisualVM or EclipseMAT to analyze the memory consumption by class, object, etc.. then look at the number of threads within that process, and then if that’s not enough we look into the native heap. My former colleague, Dan Kimmel, wrote an excellent post for these memory-related issues and how we deal with them in illumos/DelphixOS. Most of the insights in his post carry over to Linux as-is, except the analysis of the native heap.

libumem (a library that can be though of as the kernel slab allocator but for the userland in illumos) was build with debuggability in mind and it can optionally keep track of extra metadata in memory so that mdb (the illumos debugger) can print useful information when debugging memory problems such as memory leaks and memory exhaustion. At Delphix, we’ll soon be in the process of figuring out the procedures/tools that we’ll be using to deal with such issues in Linux which I believe will look something like this:

  1. Look what the existing memory allocators, like glibc and jemalloc, offer in terms of debugging infrastructure.

  2. Consider how appropriate existing ports of the illumos libumem in Linux or Bonwick’s newer libvmem from DSSD are for our use-case.

  3. Once these have been examined, pick the one that gives the best trade-off in terms of performance, debuggability, community involvement, and maintenance burden. Then write SDB commands for it and potentially attempt to contribute any missing functionality upstream.

While the above is most probably what will end up happening there is always the option of creating a solution from scratch — an option that I believe should always be considered. To be pragmatic, I don’t think that writing a memory allocator from scratch just for our use-case is worthwhile given our current resources and constraints regardless of how fun it may sound at first. On the other hand though, attempting to implement just the debugging facilities that we are missing in an allocator-agnostic way may not be as far fetched.

With the above idea in mind, I spent a few hours between Christmas Eve and New Years to write libdmem. The library is in experimental stages (and probably will remain like that) and it provides the bare-minimal in under 500 lines of actual C code. The way that it works is that it interposes its own malloc() and friends functions on top of their actual implementations (tested on top of glibc and jemalloc so far) and for each allocation it asks for a few more bytes (~112 bytes currently) to store its metadata. The metadata currently consist of a pointer to the thread that made the allocation, 10 pointers that record the program counters at the top of the stack, and 2 pointers that allow the segment to be added in a doubly-linked list.

I’ve successfully ran standard Linux utilities (ls, cat, etc..) and few more complex programs like vim, wget, and python3 with libdmem interposed on top of glibc and jemalloc. I’ve also generated core dumps from these runs and ensured that I could walk the linked lists and print the stacks successfully. For a quick demonstration on the driver program below:

int
main(void)
{
    char *p = malloc(512);
    for (int i = 0; i < 510; i++)
        p[i] = 'A' + (i % 26);
    p[511] = '\0';
    printf("%s\n", p);
    free(p);
    /* leak! */
    p = malloc(2048);
    for (int i = 0; i < 2046; i++)
        p[i] = 'A' + (i % 26);
    p[2047] = '\0';
    printf("%s\n", p);
    return 0;
}

We run the program and interpose libdmem on it like this:

$ DMEM_OPTS=trace-stderr,log-allocs,abort-shutdown LD_PRELOAD=./libdmem.so ./driver

Because we’ve enabled trace-stderr we print all the calls to malloc() and friends as they happen:

malloc(512) = 0x557abe66f2d0  // <-- first malloc() call
malloc(1024) = 0x557abe66f550 // <-- printf() allocation
ABCDEFGH...<cropped>...
free(0x557abe66f2d0)
malloc(2048) = 0x557abe66f9d0 // <-- second malloc() call
ABCDEFGH...<cropped>...

Generating a core-dump at the end of the program run we use a rudimentary gdb command that I wrote in Python that goes through all the leaks and shows the stack trace of their allocation:

$ gdb driver /var/crash/core.driver.29714.1577846611
(gdb) source ./libdmem-gdb.py
(gdb) show_alloc_stacks
0x557abe66f960 allocated from 0x7f7a52ec8740 at:
        malloc+287
        main+1656
        __libc_start_main+231
        _start+42
0x557abe66f4e0 allocated from 0x7f7a52ec8740 at:
        malloc+287
        __GI__IO_file_doallocate+140
        __GI__IO_doallocbuf+121
        _IO_new_file_overflow+408
        _IO_new_file_xsputn+189
        _IO_puts+207
        main+1638
        __libc_start_main+231
        _start+42

The first stack trace is from the leak in our C program (the 2048-byte allocation). The second stack trace is not an actual leak. The allocation happened as part of the first printf() call and is actually freed after our program exits as glibc gets unloaded. Unfortunately, I wasn’t able to get a core dump exactly at that point, so the allocation shows up in the above report.

I’ve learned a lot writing this library and many of those lessons I’ve tried documenting in the comments of the code (e.g. interposing malloc() correctly while creating a reference to the underlying malloc() implementation is not as easy as it sounds and LD_DEBUG is invaluable for debugging issues in dynamic loading of shared objects). If I ever get the time I’d be interested to try out libdmem on one of our performance machines at Delphix to see what’s the slowdown and memory overhead that it induces for our workloads.