Sunday, May 06, 2007


ext3cow -- snapshots for ext3

A couple of days ago, I wrote about chunkfs. I also recently ran across ext3cow. Chunkfs has a very specific goal, that of minimizing fsck time. Similarly, ext3cow has a very specific goal -- providing snapshots. The motivation seems to be very business/regulatory-compliance-related, as evidenced by the publication list related to ext3cow (Building Regulatory Compliant Storage Systems, Secure Deletion for a Versioning File System, Verifiable Audit Trails for a Versioning File System, Limiting Liability in a Federally Compliant File System, etc.) But snapshots are a generally useful feature outside the world of compliance, as anyone who's ever accidentally deleted a few days' work on a programming assignment can attest.

(Note that a snapshotting file system is not, in and of itself, a versioning file system. A versioning file system maintains intermediate versions of files. If I make three separate modifications to a file on the same day, I get three copies of the file. A snapshotting file system only maintains the versions of files that existed when a snapshot was taken. If I make three separate modifications to a file on the same day, and a snapshot is taken once a day, I only get the last version of the file. With respect to the relevant regulations, however, the final version of a file as it existed on any given day (or some other specified time period) may be all that matters.)

One of the design features of ext3cow is that changes to support snapshotting were localized to the ext3 code itself, so that no changes were necessary to the VFS data structures or interfaces. This certainly makes life easy, as it can be used with otherwise stock kernels. This also places some restrictions on what can be done with this filesystem.

There are a few parts to how ext3cow works. First, they've added a field to the superblock to contain what they call the epoch counter. This is merely the timestamp of the last snapshot of the file system, in seconds since the epoch.

They've added three fields to the inode itself: an epoch counter, a copy-on-write bitmap, and a next inode pointer. The epoch counter of a file gets updated with the epoch counter of the file system every time a write to the inode occurs. If the epoch counter of an inode is older than the file system epoch, it means that a new snapshot has been taken since the file was last updated, and the copy-on-write mechanism comes into play. This mechanism involves the other two new fields.

Multiple versions of a file can share data blocks, as long as those data blocks haven't changed. (This is the efficiency gained by copy-on-write mechanisms in general.) In ext3cow, the copy-on-write bitmap keeps track of which data blocks are shared between versions of a file, essentially indicating whether each block is read-only (a new block needs to be allocated before modifying the data) or read-write (a new block has already been allocated, thus the data can be modified in-place.) If an update needs to be made to a read-only block, a new block is allocated, the inode is updated to point to the new block, the new block is written out, and the COW bitmap is updated to reflect that the block has been allocated. (Note that the old block is likely not copied to the new block, as it was likely first read in before being modified. The copy of the data thus already exists in memory.)

The copy-on-write mechanism in ext3cow involves allocating new inodes for the snapshotted versions of files, thus the next-inode pointer in the inode. (Note that the live inode necessarily remains the same so that things like NFS continue to work.) The versions of a file are represented by a linked list via the next-inode pointer, with the current inode at the head of the list to make the common-case access fast. When the inode for the snapshotted version of a file is allocated, all of the data block pointers (and indirect pointers, etc.) are the same for the snapshotted inode and the live inode. The live inode will only have data blocks unique to itself when modifications or additions are made to the file.

The final metadata modification they've made is to the directory entry data structure. For each directory entry, they've added a birth epoch and a death epoch, essentially a range during which that inode is live. This allows multiple different instances of a filename to exist in a directory (as the lifetimes wouldn't overlap), and it avoids muddying the namespace with old versions of files (e.g., an ls of a current directory will show only those directory entries with an unspecified death epoch.)

So, given that I'm a proponent of ZFS, how does ext3cow compare? To some extent, that's not really a fair question, as the scope of ZFS is much larger than the scope of ext3cow. The main purpose of ext3cow is to provide snapshots, whereas snapshots in ZFS are almost merely a side-effect of other design goals. Ext3cow is certainly a creative use of what's available in ext3 to provide the ability to take snapshots.

Given that it's not fair to compare the two as file systems per se, I will point out some similarities and differences that are interesting to note.

A similarity between the two is in the amount of work required to take a snapshot. There is a constant amount of work involved in each case. For ext3cow, the work involved is merely updating the snapshot epoch in the superblock. For ZFS, the work involved is merely noting that we don't want to garbage-collect a specific version of the uberblock.

As for differences, the first to note is that ext3cow allocates new inodes for snapshotted files (but only when the current file changes.) Given that ext3 doesn't support dynamic inode allocation, this means that snapshotted files will consume that limited resources. While in general this is unlikely to be a problem, it could affect certain use cases such as frequently-snapshotted and highly active file systems or even moderately active file systems with long snapshot retention requirements. Of course, in these cases, disk space is more likely to be the overriding concern.

In contrast, ZFS does not allocate new inodes for snapshotted files, it uses the inode as it existed when the snapshot was taken. Of course, given that all file system updates are copy-on-write (even in the absence of snapshots), one could argue that ZFS is constantly allocating new inodes for modified files. From this point of view, the only resource ZFS isn't consuming for snapshots is inode numbers.

Another difference is that ext3cow snapshots are essentially only snapshots of the data and not a snapshot of the file system per se. Ext3cow doesn't support snapshotting the metadata itself. The metadata for a snapshotted file can change at some later time when the live file is modified. A new inode is allocated, the old metadata is copied over, and, at a minimum, the inode number and the next-inode pointer are modified.

In contrast, ZFS snapshots are a point-in-time view of the state of the entire file system, including metadata. This is a result of the copy-on-write mechanism in ZFS. File changes involve copy-on-write modifications, not only of the data blocks, but also of the metadata up the tree to the superblock. When a snapshot is taken, that version of the uberblock is preserved, along with all of the metadata and data blocks referenced via this uberblock. The metadata doesn't need to modified, and new inodes don't need to be allocated for snapshotted files.

Of course, whether or not the metadata is maintained in a pristine state is mostly of theoretical interest. It likely doesn't matter for regulatory compliance, where the data itself is the primary concern. (Well, I'm sure a highly-paid lawyer could probably use this point to introduce doubt about the validity of snapshotted files, if there were ever a court case involving very large sums of money.) And I'm sure this subtlety doesn't matter to the student who recovers a file representing many hours of work when a deadline is only a very few hours away.

Another difference is the amount of work involved in recovering a snapshotted version of a file. With ext3cow, this involves a linked-list traversal of next-inode pointers while comparing the desired file time to the liveness range of each inode in turn. With ZFS, this involves merely looking up the file in an alternate version of the filesystem, which is no more work than finding the current version of that file. But again, this is likely a difference that won't matter. Looking up the snapshotted version of a file is the rare case, and according to the authors' experimental results, the penalty is negligible.

One subtle difference between the two file systems is that ext3cow allows for individual file snapshots, which ZFS doesn't support. That ext3cow supports this almost falls out of the implementation -- the epoch counter for the file is updated to the current time rather than the file system epoch, and the copy-on-write mechanism catches this condition. Similarly, the implementation of snapshots in ZFS prevents this. Snapshots are implemented using the uberblock. Taking a snapshot of a single file would involve also snapshotting the metadata up the tree to the uberblock. Guaranteeing that the snapshot is only of a single file would involve pruning away all other directories and files. Given that snapshotting is such a cheap operation in ZFS, it doesn't make sense to do so.

Thank you for this writeup. I'm sure it isn't the most fascinating topic, but since I'm currently looking for a quick way to snapshot my Zimbra open-source server to minimize downtimes for backups, this was helpful in my understanding of ext3cow as well as options.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?