Core Dump

ZFS Flamegraph Analysis Example [μ]
Posted on August 18, 2018.

Doing sequential reads on a working set that is cached in the ARC should yield the best case performance for reads in ZFS, as disk accesses are avoided. An experiment like this can uncover CPU bottlenecks in the read codepath from the moment the read is issued to the ARC access and back. sequential_reads_arc_cached.ksh which is used to test for performance regressions does exactly that.

An interesting detail of this test is that running it on a system that just booted, gives consistently better results the first time compared to subsequent times. On one of my test machines I’ve seen results of 10950 MB/s on the first run dropping to 8689 MB/s in subsequent runs. The reason is that, in the first run of the test in a fresh machine, the dbuf cache is being populated for the first time and there is no need to evict anything. This is unlike subsequent runs that need to evict entries from the dbuf cache in order to cache their dbufs.

While root-causing the issue with Matt and analyzing the flamegraphs from the different runs, I decided to take some notes and share them as I believe this information can be of value to others.

The Flamegraph

flamegraph Interactive version can be found [here](/img/flame/sequential_arc_cached_reads/stock-1m-64.svg).

The bottom of the flamegraph indicates three things:

  • The CPU is never idle.
  • The majority of the time is spent in the read codepath.
  • Some portion of the CPU time is spent on the dbuf-eviction thread. This part indicates how much time ZFS spends evicting dbufs.
zfs_read

Under zfs_read() time is spent in three places. The majority of the time is spent dmu_buf_hold_array_by_dnode() where we read the actual blocks from the ARC. Then there is uiomove(), which copies the data read into buffers that will be eventually returned to the userspace. uiomove() is a good indicator of the work being done (the bigger the percent spent there, the more IOPS we get in general). Finally, there is a small amount of time spent releasing the buffers ZFS was holding.

reads

Inside dmu_buf_hold_array_by_dnode() there are two paths that call into dbuf_read() to read buffers from the ARC. The one that is called directly from dmu_buf_hold_array_by_dnode() gets the data that will be copied by uiomove() to the buffers returned to the userspace. This is the data the application that issued the read cares about.

The other path to dbuf_read() is through dbuf_hold(). In general, these calls indicate time spent reading indirect blocks. Since the experiment recorded by the flamegraph was doing only sequential reads, it is not suprising that the amount spent reading indirect blocks is minimal.