Core Dump

ZFS Storage Pool Checkpoint
Posted on April 21, 2017.

Background

During the OpenZFS summit last year (2016), Dan Kimmel and I quickly hacked together the zpool checkpoint command in ZFS, which allows reverting an entire pool to a previous state. Since it was just for a hackathon, our design was bare bones and our implementation far from complete. Around a month later, we had a new and almost complete design within Delphix and I was able to start the implementation on my own. I completed the implementation last month, and we’re now running regression tests, so I decided to write this blog post explaining what a storage pool checkpoint is, why we need it within Delphix, and how to use it.

Motivation

The Delphix product is basically a VM running DelphixOS (a derivative of illumos) with our application stack on top of it. During an upgrade, the VM reboots into the new OS bits and then runs some scripts that update the environment (directories, snapshots, open connections, etc.) for the new version of our app stack. Software being software, failures can happen at different points during the upgrade process. When an upgrade script that makes changes to ZFS fails, we have a corresponding rollback script that attempts to bring ZFS and our app stack back to their previous state. This is very tricky as we need to undo every single modification applied to ZFS (including dataset creation and renaming, or enabling new zpool features).

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”.

In the Delphix product, the scenario is a specific case of the generic use case. We take a checkpoint during upgrade before performing any changes to ZFS and rewind back to it if something unexpected happens.

Usage

All the reference material on how to use zpool checkpoint is part of the zfs(1m) man page. That said, this section demonstrates most of its functionality with a simple example.

First we create a pool and some dummy datasets:

$ zpool create testpool c1t1d0
$ zfs create testpool/testfs0
$ zfs create testpool/testfs1
$ zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  7.50G   194K  7.50G        -         -     0%     0%  1.00x  ONLINE  -

As you can see, the zpool list command has a new column called CKPOINT that is currently empty. Then we take a checkpoint and perform some destructive actions:

$ zpool checkpoint testpool
$ zfs destroy testpool/testfs0
$ zfs rename testpool/testfs1 testpool/testfs2
$ zfs list -r testpool
NAME               USED  AVAIL  REFER  MOUNTPOINT
testpool           109K  7.27G    23K  /testpool
testpool/testfs2    23K  7.27G    23K  /testpool/testfs2
$ zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  7.50G   290K  7.50G    88.5K         -     0%     0%  1.00x  ONLINE  -
$ zpool status testpool
  pool: testpool
 state: ONLINE
  scan: none requested
checkpoint: created Sat Apr 22 08:46:54 2017, consumes 88.5K
config:

        NAME        STATE     READ WRITE CKSUM
        testpool    ONLINE       0     0     0
          c1t1d0    ONLINE       0     0     0

errors: No known data errors

After checkpointing the pool, destroying the first dataset and renaming the second one, we see that zpool list and zpool status have some new information for us. The CKPOINT column has been populated now and is the same number as the one displayed by zpool status in the checkpoint row. This amount of space has been freed in the current state of the pool, but we keep it around because it is part of the checkpoint (so we can later rewind back to it if we choose to).

To take a look at the checkpointed state of the pool without actually rewinding to it we can do the following:

$ zpool export testpool
$ zpool import -o readonly=on --rewind-to-checkpoint testpool
$ zfs list -r testpool
NAME               USED  AVAIL  REFER  MOUNTPOINT
testpool           129K  7.27G    23K  /testpool
testpool/testfs0    23K  7.27G    23K  /testpool/testfs0
testpool/testfs1    23K  7.27G    23K  /testpool/testfs1
$ zpool export testpool
$ zpool import  testpool
$ zfs list -r testpool
NAME               USED  AVAIL  REFER  MOUNTPOINT
testpool           115K  7.27G    23K  /testpool
testpool/testfs2    23K  7.27G    23K  /testpool/testfs2

To rewind the whole pool back to the checkpointed state, discarding any change that happened after it was taken:

$ zpool export testpool
$ zpool import --rewind-to-checkpoint testpool
$ zfs list -r testpool
NAME               USED  AVAIL  REFER  MOUNTPOINT
testpool           129K  7.27G    23K  /testpool
testpool/testfs0    23K  7.27G    23K  /testpool/testfs0
testpool/testfs1    23K  7.27G    23K  /testpool/testfs1

To discard the checkpoint when it is not needed anymore:

$ zpool checkpoint --discard testpool
$ zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  7.50G   289K  7.50G        -         -     0%     0%  1.00x  ONLINE  -

Caveats

Taking a checkpoint of the whole state of the pool and rewinding back to it is a heavy-weight solution and imposing some limits to its usage was required to ensure that users cannot introduce any unrecoverable errors to their storage pools. Pools that have a checkpoint disallow any change to their vdev configuration besides the addition of new vdevs. Specifically, vdev attach/detach/remove, mirror splitting, and reguid are not allowed while you have a checkpoint. The hypothetical example is that we wouldn’t be able to recover a pool where a rewind was attempted and the config expected to find data in a vdev that was removed after the checkpoint was taken. Similar reasoning was used for the other aforementioned operations. As for vdev add, although it is supported, the user has to re-add the device in the case of a rewind.

Taking a checkpoint while a device is being removed is also disallowed. Again, the checkpoint references the whole state of the pool including the state of any ongoing operations. Device removal for top-level vdevs is a vdev config change that can span multiple transactions. It would be wrong to rewind back to the checkpoint and attempt to finish the removal of a device that has been removed already.

Another important note for pools that have a checkpoint is the fact that scrubs do not traverse any checkpointed data that has been freed in the current state of the pool. Thus, in the case of errors, these data cannot be detected from the current state of the pool (zdb(1m) can actually detect them, but cannot repair them).

Finally, due to a technical reason that I will go through in a different post, dataset reservations may be unenforceable while a checkpoint exists and the pool is struggling to find space.


I hope that this post was helpful. Another post should be out soon with more technical details about how this feature is implemented and hopefully answer why the above caveats currently exist. Feel free to email me with any questions or feedback.