Oct 31, 2005 - ZFS: The Last Word in Filesystems
Boo!
Halloween has been a special day for ZFS since its inception.
On 10/31/2001, we got the user-level prototype working.
On 10/31/2002, we got the first in-kernel mount.
And today, 10/31/2005, we integrated into Solaris. ZFS will hit the street in a couple of weeks via Solaris Express.
We (the ZFS team) will have much more to say about ZFS in the coming weeks. But tonight, I just want to tell you what it was like to drive this thing home.
The ZFS team is distributed: we have people working in Menlo Park, California; Broomfield, Colorado; and several other places. This week, we flew everyone in to Menlo Park and took over a giant conference room.
The first thing you notice is the heat. These rooms are made for 100 watt humans, not multi-gigahertz CPUs and associated paraphernalia. And, like any big company, Sun is all about saving money in the dumbest ways – like turning off the A/C at night, and outsourcing the people who could possibly turn it back on.
At first, things went pretty well. We comandeered the Nob Hill conference room, which has a long table and lots of power and network taps. We brought in a bunch of machines and created ‘Camp ZFS’. Each new box amped up the heat level, to the point that it became difficult to think. So we grabbed a 36-inch fan from one of the labs to get the air flowing. That was a huge help, although it sounded like you were wearing a pair of lawn mowers as headphones.
On Sunday, we plugged in one more laptop. That was it – we blew the circuit breaker! So here we are, less than 24 hours from our scheduled integration date, and all of our computers are without power – and the room containing the circuit breakers was locked. (Thanks, OSHA!)
So we took a three-pronged approach: (1) went through the Approved Process to get power restored (ETA: April); (2) hunted down someone from campus security to get us key access to the electrical room (ETA: Sunday night); and (3) sent our manager to Home Depot to buy a bunch of 100-foot extension cords so we could, if necessary, run Nob Hill off of the adjacent lab’s power grid (ETA: 30 minutes).
All three came through. We ran half of the load over extension cords to the lab, the other half on the Nob Hill circuit. It took a bit of experimentation to find a load balance that would stay up without tripping the breaker again. Apparently, we had angered it. (Even now, I’m typing this blog entry courtesy of a shiny new yellow extension cord.)
Meanwhile, the clock was ticking.
At the end of a large project like this, it’s never the technology that kills you – it’s the process, the cleanup of home-grown customizations, the packaging, the late-breaking code review comments, the collapsing of SCCS deltas, stuff like that. With power back up, we slogged on until about 4AM. Everything looked good, so we went home to sleep. Actually, some people just crashed on the couches in my office, and Bill’s office next door.
By 10AM Monday we were back, making sure that all the final tests had run successfully, and working through the last bits of paperwork with the Solaris release team. After five years of effort, it was time to type ‘putback’.
Denied! The permissions on a directory were wrong.
Fix them up, try again.
Denied! One more TeamWare directory with wrong permissions. Fine, fix that too.
Last try… and there it goes! 584 files, 92,000 lines of change, 56 patents, 5 years… and there it is. Just like that.
Fortunately we were prepared for either success or failure. We had brought in a massive array of vodka, tequila, wine… you name it, we had it. And not in small quantities.
As I said at the beginning, we’ll have lots more to say in the coming days. But right now, it’s time to sleep!
Jeff
Dec 12, 2005 - SEEK_HOLE and SEEK_DATA for sparse files
A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you’ve just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.
File-based backup and archiving programs like tar, cpio, and rsync can
detect runs of zeroes and ignore them, so that the archives they produce aren’t
filled with zeroes. Still, this is terribly inefficient.  If you’re a backup
program, what you really want is a list of the non-zero segments in the file.
But historically there’s been no way to get this information without looking at
filesystem-specific metadata.
Desperately seeking segments
As part of the ZFS project, we introduced two general extensions to lseek(2):
SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero
regions of sparse files. Quoting from the new man page:
* If whence is SEEK_HOLE, the offset of the start of  the next  hole greater
  than or equal to the supplied offset is returned. The definition of a hole is
provided  near the end of the DESCRIPTION.
*  If whence is SEEK_DATA, the file pointer is set to  the start  of the next
   non-hole file region greater than or equal to the supplied offset.  [...]
   A "hole" is defined as a contiguous  range  of  bytes  in  a file,  all
   having the value of zero, but not all zeros in a file are guaranteed to be
   represented as holes returned with SEEK_HOLE. Filesystems are allowed to expose
   ranges of zeros with SEEK_HOLE, but not required to.  Applications  can  use
   SEEK_HOLE  to  optimise  their behavior for ranges of zeros, but must not
   depend on it to find all such ranges in a file.  The  existence  of  a  hole
   at the end of every data region allows for easy programming and implies that a
   virtual  hole exists  at  the  end  of  the  file. Applications should use
   fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE) to  determine if a
   filesystem supports SEEK_HOLE. See fpath- conf(2).  For filesystems that do not
   supply information about  holes, the file will be represented as one entire
   data region.
Any filesystem can support SEEK_HOLE/SEEK_DATA. Even a filesystem like UFS,
which has no special support for sparseness, can walk its block pointers much
faster than it can copy out a bunch of zeroes.  Even if the search algorithm is
linear-time, at least the constant is thousands of times smaller.
Sparse file navigation in ZFS
A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE
and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating
the number of data blocks beneath it in the tree (see below). Fill counts
reveal where holes and data reside, so that ZFS can find both holes and data in
guaranteed logarithmic time.
  L4                           6
              -----------------------------------
  L3          5                0                1
         -----------      -----------      -----------
  L2     3    0    2                       0    0    1
        ---  ---  ---                     ---  ---  ---
  L1    111       101                               001
An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer’s fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there’s data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.
Portability
At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific.  I encourage
(implore? beg?) other operating systems to adopt these lseek(2) extensions
verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous
feature that every backup and archiving program can rely on.  It’s long
overdue.
May 2, 2006 - Smokin’ Mirrors
Resilvering – also known as resyncing, rebuilding, or reconstructing – is the process of repairing a damaged device using the contents of healthy devices. This is what every volume manager or RAID array must do when one of its disks dies, gets replaced, or suffers a transient outage.
For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5 it’s only slightly more complicated: instead of copying one disk to another, all of the other disks in the RAID-5 stripe must be XORed together. But the basic idea is the same.
In a traditional storage system, resilvering happens either in the volume manager or in RAID hardware. Either way, it happens well below the filesystem.
But this is ZFS, so of course we just had to be different.
In a previous post I mentioned that RAID-Z resilvering requires a different
approach, because it needs the filesystem metadata to determine the RAID-Z
geometry. In effect, ZFS does a 'cp -r' of the storage pool’s block tree from
one disk to another.  It sounds less efficient than a straight whole-disk copy,
and traversing a live pool safely is definitely tricky (more on that in a
future post).  But it turns out that there are so many advantages to
metadata-driven resilvering that we’ve chosen to use it even for simple
mirrors.
The most compelling reason is data integrity. With a simple disk copy, there’s no way to know whether the source disk is returning good data.
End-to-end data integrity requires that each data block be verified against an independent checksum – it’s not enough to know that each block is merely consistent with itself, because that doesn’t catch common hardware and firmware bugs like misdirected reads and phantom writes.
By traversing the metadata, ZFS can use its end-to-end checksums to detect and correct silent data corruption, just like it does during normal reads. If a disk returns bad data transiently, ZFS will detect it and retry the read. If it’s a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will use the checksum to determine which one is correct, copy the data to the new disk, and repair the damaged disk.
A simple whole-disk copy would bypass all of this data protection. For this reason alone, metadata-driven resilvering would be desirable even it it came at a significant cost in performance.
Fortunately, in most cases, it doesn’t. In fact, there are several advantages to metadata-driven resilvering:
- 
Live blocks only. ZFS doesn’t waste time and I/O bandwidth copying free disk blocks because they’re not part of the storage pool’s block tree. If your pool is only 10-20% full, that’s a big win. 
- 
Transactional pruning. If a disk suffers a transient outage, it’s not necessary to resilver the entire disk – only the parts that have changed. I’ll describe this in more detail in a future post, but in short: ZFS uses the birth time of each block to determine whether there’s anything lower in the tree that needs resilvering. This allows it to skip over huge branches of the tree and quickly discover the data that has actually changed since the outage began. What this means in practice is that if a disk has a five-second outage, it will only take about five seconds to resilver it. And you don’t pay extra for it – in either dollars or performance – like you do with Veritas change objects. Transactional pruning is an intrinsic architectural capability of ZFS. 
- 
Top-down resilvering. A storage pool is a tree of blocks. The higher up the tree you go, the more disastrous it is to lose a block there, because you lose access to everything beneath it. 
Going through the metadata allows ZFS to do top-down resilvering. That is, the very first thing ZFS resilvers is the uberblock and the disk labels. Then it resilvers the pool-wide metadata; then each filesystem’s metadata; and so on down the tree. Throughout the process ZFS obeys this rule: no block is resilvered until all of its ancestors have been resilvered.
It’s hard to overstate how important this is. With a whole-disk copy, even when it’s 99% done there’s a good chance that one of the top 100 blocks in the tree hasn’t been copied yet. This means that from an MTTR perspective, you haven’t actually made any progress: a second disk failure at this point would still be catastrophic.
With top-down resilvering, every single block copied increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available.
Priority-based resilvering. ZFS doesn’t do this one yet, but it’s in the
pipeline. ZFS resilvering follows the logical structure of the data, so it
would be pretty easy to tag individual filesystems or files with a specific
resilver priority. For example, on a file server you might want to resilver
calendars first (they’re important yet very small), then /var/mail, then home
directories, and so on.
What I hope to convey with each of these posts is not just the mechanics of how a particular feature is implemented, but to illustrate how all the parts of ZFS form an integrated whole. It’s not immediately obvious, for example, that transactional semantics would have anything to do with resilvering – yet transactional pruning makes recovery from transient outages literally orders of magnitude faster. More on how that works in the next post.
Nov 4, 2006 ZFS Block Allocation
Block allocation is central to any filesystem. It affects not only performance, but also the administrative model (e.g. stripe configuration) and even some core capabilities like transactional semantics, compression, and block sharing between snapshots. So it’s important to get it right.
There are three components to the block allocation policy in ZFS:
- Device selection (dynamic striping)
- Metaslab selection
- Block selection
By design, these three policies are independent and pluggable. They can be changed at will without altering the on-disk format, which gives us lots of flexibility in the years ahead.
So… let’s go allocate a block!
1. Device selection (aka dynamic striping).
Our first task is device selection. The goal is to spread the load across all devices in the pool so that we get maximum bandwidth without needing any notion of stripe groups. You add more disks, you get more bandwidth. We call this dynamic striping – the point being that it’s done on the fly by the filesystem, rather than at configuration time by the administrator.
There are many ways to select a device. Any policy would work, including just picking one at random. But there are several practical considerations:
If a device was recently added to the pool, it’ll be relatively empty. To address such imbalances, we bias the allocation slightly in favor of underutilized devices. This keeps space usage uniform across all devices.
All else being equal, round-robin is a fine policy, but it’s critical to get the granularity right. If the granularity is too coarse (e.g. 1GB), we’ll only get one device worth of bandwidth when doing sequential reads and writes. If the granularity is too fine (e.g. one block), we’ll waste any read buffering the device can do for us. In practice, we’ve found that switching from one device to the next every 512K works well for the current generation of disk drives.
That said, for intent log blocks, it’s better to round-robin between devices each time we write a log block. That’s because they’re very short-lived, so we don’t expect to ever need to read them; therefore it’s better to optimize for maximum IOPs when writing log blocks. Neil Perrin integrated support for this earlier today.
More generally, we’ll probably want different striping policies for different types of data: large/sequential, small/random, transient (like the intent log), and dnodes (clumping for spatial density might be good). This is fertile ground for investigation.
If one of the devices is slow or degraded for some reason, it should be avoided. This is work in progress.
2. Metaslab selection.
We divide each device into a few hundred regions, called metaslabs, because the overall scheme was inspired by the slab allocator. Having selected a device, which metaslab should we use? Intuitively it seems that we’d always want the one with the most free space, but there are other factors to consider:
Modern disks have uniform bit density and constant angular velocity. Therefore, the outer recording zones are faster (higher bandwidth) than the inner zones by the ratio of outer to inner track diameter, which is typically around 2:1. We account for this by assigning a higher weight to the free space in lower-LBA metaslabs. In effect, this means that we’ll select the metaslab with the most free bandwidth rather than simply the one with the most free space.
When a pool is relatively empty, we want to keep allocations in the outer (faster) regions of the disk; this improves both bandwidth and latency (by minimizing seek distances). Therefore, we assign a higher weight to metaslabs that have been used before.
All of these considerations can be seen in the function metaslab_weight(). Having defined a weighting scheme, the selection algorithm is simple: always select the metaslab with the highest weight.
3. Block selection.
Having selected a metaslab, we must choose a block within that metaslab. The current allocation policy is a simple variation on first-fit; it seems likely that we can do better. In the future I expect that we’ll have not only a better algorithm, but a whole collection of algorithms, each optimized for a specific workload. Anticipating this, the block allocation code is fully vectorized; see space_map_ops_t for details.
The mechanism (as opposed to policy) for keeping track of free space in a metaslab is a new data structure called a space map, which I’ll describe in the next post.
Sep 14, 2007 - Space Maps
Every filesystem must keep track of two basic things: where your data is, and where the free space is.
In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that’s allocated; and the allocated space can be found by traversing the entire filesystem from the root. Any block that cannot be found by traversal from the root is, by definition, free.
In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size. To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space. In this post we’ll examine the most common methods, why they don’t scale well, and the new approach we devised for ZFS.
Bitmaps
The most common way to represent free space is by using a bitmap. A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free. The overhead for a bitmap is quite low: 1 bit per block. For a 4K blocksize, that’s 1/(4096*8) = 0.003%. (The 8 comes from 8 bits per byte.)
For a 1GB filesystem, the bitmap is 32KB – something that easily fits in memory, and can be scanned quickly to find free space. For a 1TB filesystem, the bitmap is 32MB – still stuffable in memory, but no longer trivial in either size or scan time. For a 1PB filesystem, the bitmap is 32GB, and that simply won’t fit in memory on most machines. This means that scanning the bitmap requires reading it from disk, which is slower still.
Clearly, this doesn’t scale.
One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk. For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size. The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it’s easy to find a bitmap with free space, and it’s quick to scan that bitmap.
But there’s still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed. The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees. Something as simple as ‘rm -rf’ can cause blocks all over the platter to be freed. With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again. That’s two million disk I/Os to free a measly 4GB – and that’s just not reasonable, even as worst-case behavior.
More than any other single factor, this is why bitmaps don’t scale: because frees are often random, and bitmaps that don’t fit in memory perform pathologically when they are accessed randomly.
B-trees
Another common way to represent free space is with a B-tree of extents. An extent is a contiguous region of free space described by two integers: offset and length. The B-tree sorts the extents by offset so that contiguous space allocation is efficient. Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.
What to do?
Deferred frees
One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks. When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality. Not ideal, but it helps.
But what if we went further?
Space maps: log-structured free lists
Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?
Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?
That is precisely what ZFS does. ZFS divides the space on each virtual device into a few hundred regions called metaslabs. Each metaslab has an associated space map, which describes that metaslab’s free space. The space map is simply a log of allocations and frees, in time order. Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it’s represented on disk by appending the extent (a couple of integers) to the space map object – and appends have perfect locality. Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it’s an allocation, not a free).
When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab’s space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset. This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space. ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.
Space maps have several nice properties:
- 
They don’t require initialization: a space map with no entries indicates that there have been no allocations and no frees, so all space is free. 
- 
They scale: because space maps are append-only, only the last block of the space map object needs to be in memory to ensure excellent performance, no matter how much space is being managed. 
- 
They have no pathologies: space maps are efficient to update regardless of the pattern of allocations and frees. 
- 
They are equally efficient at finding free space whether the pool is empty or full (unlike bitmaps, which take longer to scan as they fill up). 
Finally, note that when a space map is completely full, it is represented by a single extent. Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.