Background
During the OpenZFS summit last year (2016), Dan Kimmel and I quickly hacked
together the zpool checkpoint
command in ZFS, which allows reverting an
entire pool to a previous state. Since it was just for a hackathon, our
design was bare bones and our implementation far from complete. Around a
month later, we had a new and almost complete design within Delphix and I
was able to start the implementation on my own. I completed the implementation
last month, and we’re now running regression tests, so I decided to write this
blog post explaining what a storage pool checkpoint is, why we need it within
Delphix, and how to use it.
Motivation
The Delphix product is basically a VM running DelphixOS (a derivative of illumos) with our application stack on top of it. During an upgrade, the VM reboots into the new OS bits and then runs some scripts that update the environment (directories, snapshots, open connections, etc.) for the new version of our app stack. Software being software, failures can happen at different points during the upgrade process. When an upgrade script that makes changes to ZFS fails, we have a corresponding rollback script that attempts to bring ZFS and our app stack back to their previous state. This is very tricky as we need to undo every single modification applied to ZFS (including dataset creation and renaming, or enabling new zpool features).
The idea of Storage Pool Checkpoint (aka zpool checkpoint
) deals with exactly
that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme
rewind that doesn’t corrupt your data). It remembers the entire state of the pool
at the point that it was taken and the user can revert back to it later or discard
it. Its generic use case is an administrator that is about to perform a set of
destructive actions to ZFS as part of a critical procedure. She takes a checkpoint
of the pool before performing the actions, then rewinds back to it if one of them
fails or puts the pool into an unexpected state. Otherwise, she discards it. With
the assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
In the Delphix product, the scenario is a specific case of the generic use case. We take a checkpoint during upgrade before performing any changes to ZFS and rewind back to it if something unexpected happens.
Usage
All the reference material on how to use zpool checkpoint
is part of the
zfs(1m)
man page. That said, this section demonstrates most of its
functionality with a simple example.
First we create a pool and some dummy datasets:
$ zpool create testpool c1t1d0
$ zfs create testpool/testfs0
$ zfs create testpool/testfs1
$ zpool list testpool
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
testpool 7.50G 194K 7.50G - - 0% 0% 1.00x ONLINE -
As you can see, the zpool list
command has a new column called CKPOINT
that
is currently empty. Then we take a checkpoint and perform some destructive
actions:
$ zpool checkpoint testpool
$ zfs destroy testpool/testfs0
$ zfs rename testpool/testfs1 testpool/testfs2
$ zfs list -r testpool
NAME USED AVAIL REFER MOUNTPOINT
testpool 109K 7.27G 23K /testpool
testpool/testfs2 23K 7.27G 23K /testpool/testfs2
$ zpool list testpool
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
testpool 7.50G 290K 7.50G 88.5K - 0% 0% 1.00x ONLINE -
$ zpool status testpool
pool: testpool
state: ONLINE
scan: none requested
checkpoint: created Sat Apr 22 08:46:54 2017, consumes 88.5K
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
errors: No known data errors
After checkpointing the pool, destroying the first dataset and renaming the
second one, we see that zpool list
and zpool status
have some new information
for us. The CKPOINT
column has been populated now and is the same number as
the one displayed by zpool status
in the checkpoint
row. This amount
of space has been freed in the current state of the pool, but we keep it around
because it is part of the checkpoint (so we can later rewind back to it if we
choose to).
To take a look at the checkpointed state of the pool without actually rewinding to it we can do the following:
$ zpool export testpool
$ zpool import -o readonly=on --rewind-to-checkpoint testpool
$ zfs list -r testpool
NAME USED AVAIL REFER MOUNTPOINT
testpool 129K 7.27G 23K /testpool
testpool/testfs0 23K 7.27G 23K /testpool/testfs0
testpool/testfs1 23K 7.27G 23K /testpool/testfs1
$ zpool export testpool
$ zpool import testpool
$ zfs list -r testpool
NAME USED AVAIL REFER MOUNTPOINT
testpool 115K 7.27G 23K /testpool
testpool/testfs2 23K 7.27G 23K /testpool/testfs2
To rewind the whole pool back to the checkpointed state, discarding any change that happened after it was taken:
$ zpool export testpool
$ zpool import --rewind-to-checkpoint testpool
$ zfs list -r testpool
NAME USED AVAIL REFER MOUNTPOINT
testpool 129K 7.27G 23K /testpool
testpool/testfs0 23K 7.27G 23K /testpool/testfs0
testpool/testfs1 23K 7.27G 23K /testpool/testfs1
To discard the checkpoint when it is not needed anymore:
$ zpool checkpoint --discard testpool
$ zpool list testpool
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
testpool 7.50G 289K 7.50G - - 0% 0% 1.00x ONLINE -
Caveats
Taking a checkpoint of the whole state of the pool and rewinding back to it is
a heavy-weight solution and imposing some limits to its usage was required to
ensure that users cannot introduce any unrecoverable errors to their storage pools.
Pools that have a checkpoint disallow any change to their vdev configuration besides
the addition of new vdevs. Specifically, vdev attach/detach/remove
, mirror splitting,
and reguid
are not allowed while you have a checkpoint. The hypothetical example is
that we wouldn’t be able to recover a pool where a rewind was attempted and the config
expected to find data in a vdev
that was removed after the checkpoint was taken. Similar reasoning was used for
the other aforementioned operations. As for vdev add
, although it is supported,
the user has to re-add the device in the case of a rewind.
Taking a checkpoint while a device is being removed is also disallowed. Again, the checkpoint references the whole state of the pool including the state of any ongoing operations. Device removal for top-level vdevs is a vdev config change that can span multiple transactions. It would be wrong to rewind back to the checkpoint and attempt to finish the removal of a device that has been removed already.
Another important note for pools that have a checkpoint is the fact that scrubs
do not traverse any checkpointed data that has been freed in the current state
of the pool. Thus, in the case of errors, these data cannot be detected from
the current state of the pool (zdb(1m)
can actually detect them, but cannot
repair them).
Finally, due to a technical reason that I will go through in a different post, dataset reservations may be unenforceable while a checkpoint exists and the pool is struggling to find space.
I hope that this post was helpful. Another post should be out soon with more technical details about how this feature is implemented and hopefully answer why the above caveats currently exist. Feel free to email me with any questions or feedback.