Friday, May 04, 2007


ChunkFS -- minimizing fsck time

I ran across chunkfs this week. (Another link and the paper.) Actually, I ran across a mention of it a month or so ago, but I only really looked at it this week, having been pointed there by Joerg's blog.

The point of chunkfs is to minimize fsck time. Having spent a sleepless night waiting for a 2TB ext3 file system to fsck, I can understand the desire to minimize this. As the paper points out, the trend in storage growth means that fsck's of normal filesystems will soon take on the order of days or tens of days. This quote from the paper is particularly interesting: The fraction of time data is unavailable while running fsck may asymptotically approach unity in the absence of basic architectural changes.

The approach of chunkfs is to break up the file system into multiple mostly-independent fault domains, the intent being to limit the need for fsck to just one of these chunks (or possibly a small number of them.) The idea is similar in concept to having many small file systems rather than one large one, but it hides the details (and administrative hassle) from the user, so that the illusion is of a single large file system.

From this page we get this bit of information that isn't mentioned in the paper:
Chunkfs makes these numbers unique by putting the chunk number in the upper eight bits of every inode number. As a result, there is a maximum of 256 chunks in any chunkfs filesystem.
(I haven't actually verified this for myself, so take this with a grain of salt.) Given this, that one-week fsck is reduced to less than an hour, assuming that only a single chunk needs to be checked. That 256x speed-up in fsck time is certainly impressive.

The authors also suggest the possibility of doing on-line fsck's. The feasibility of this suggestion is based on observations that metadata updates tend to be localized, with the prediction that chunks would spend a majority of their time idle with respect to metadata updates. On-line fsck's would eliminate the need for file system for down time and could possibly prevent events like server reboots due to file system corruption (i.e., problems could be found in a controlled fashion.)

I would argue, however, that any effort to minimize fsck time is misguided. I'm of the opinion that the time would be better spent working on ways to eliminate the need for fsck.

Actually, let me step back from that statement for a second and extrapolate from the authors' work. They suggest a divide-and-conquer approach to create separate fault domains. These chunks would generally consist of some subset of the files in the file system. We could imagine growing the storage, and/or shrinking the chunk size, to such a degree that chunks are subsets of files, possibly even individual blocks. In this case, consistency checks would be done at the sub-file level. Given that the file system could support on-line fsck's, we could have an almost constant amount of background consistency checking of parts of file systems.

Taken to this extreme, chunkfs starts to look a lot like ZFS. ZFS is constantly verifying the consistency of metadata in its file systems. Admittedly, it's only doing so for the metadata for the files that are actually in use, and it's only doing so as that metadata is being read or written (assuming a scrub is not running), but one could argue that that's a benefit, as it's not wasting cycles checking data that's not in use. (There's the added benefit that ZFS is also checksumming the data, thus giving a correctness guarantee that chunkfs doesn't provide.)

I won't go as far to argue that the ZFS implementation is the only possible way to eliminate the need for fsck, but I would argue that they've done the right thing in doing so.

Here's actually a pretty interesting quote with respect to this, ironically enough from one of the authors of chunkfs:
The on-disk state is always valid. No more fsck, no need to replay a journal. We use a copy-on-write, transactional update system to correctly and safely update data on disk. There is no window where the on-disk state can be corrupted by a power cycle or system panic. This means that you are much less likely to lose data than on a file system that uses fsck or journal replay to repair on-disk data corruption after a crash. Yes, supposedly fsck-free file systems already exist - but then explain to me all the time I've spent waiting for fsck to finish on ext3 or logging UFS - and the resulting file system corruption.
And there's another interesting note from the paper:
Those inclined to dismiss file system bugs as a significant source of file system corruption are invited to consider the recent XFS bug in Linux 2.6.17 requiring repair via the XFS file system repair program.
I had originally intended to point out that a filesystem repair program is another possible entry point for bugs, but this FAQ entry proved my point for me:
To add insult to injury, xfs_repair(8) is currently not correcting these directories on detection of this corrupt state either. This xfs_repair issue is actively being worked on, and a fixed version will be available shortly.
Both of these problems have since been fixed, apparently, but this does give strength to my point. The added complexity of a file system repair program should be taken into consideration in this argument. Of course, one could argue that there is very little added complexity because the repair program shares much of the same code with the file system itself. This means, of course, that it shares the same bugs. On the other hand, one could maintain a separate code base for the repair program, but that is a considerable amount of added complexity.


XFS filesystems are divided into a number of equally sized chunks called Allocation Groups. Each AG can almost be thought of as an individual filesystem that maintains it's own space usage. Each AG can be up to one terabyte in size (512 bytes * 231), regardless of the underlying device's sector size.
Each AG has the following characteristics:
· A super block describing overall filesystem info
· Free space management
· Inode allocation and tracking
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?