Thursday, May 17, 2007


Re-linking an unlinked file with fsdb

Given that you have an unlinked file on a file system, how do you relink it into the file system?

One of the things you can do is just work around the problem. You could just copy /proc/XXX/fd/Y, which gets you the contents of the file but not the actual file itself.

What you really want to be able to do is "ln /proc/XXX/fd/Y /some/dir/foo". Theoretically, there's no reason why this shouldn't work, given that you know the inode number of the unlinked file within its file system via /proc/XXX/fd/Y, but it doesn't currently work to do this.

I tried the following with fsdb (on Solaris on a UFS file system), and it worked to relink the file into the file system. Doing something like this isn't generally advisable, as it involves manipulating the disk directly. This bypasses the file-caching mechanism in the kernel, so you'll end up with a disk image that doesn't match what the kernel thinks it should look like. In my case, things were fine after a reboot and fsck, but I may simply have gotten lucky.

(I was prompted to try this after discovering that it's possible to unlink(1M) a non-empty directory. I was curious to see whether or not the inodes in that directory were orphaned by doing so, and it turns out that they are. Given that I had a file system with orphaned inodes, I wanted to see if I could remedy the situation. I was mostly interested to see if I could fix the live file system, which I was unfortunately unable to do.)

# fsdb -o w /dev/md/rdsk/d1
fsdb of /dev/md/rdsk/d1 (Opened for write) -- last mounted on /var
fs_clean is currently set to FSLOG
fs_state consistent (fs_clean CAN be trusted)
/dev/md/rdsk/d1 > :cd tmp
/dev/md/rdsk/d1 > :ls -l
i#: b6          ./
i#: 2           ../
i#: 4d9a        .java/
i#: 1279        SMB--0A40081A-044_0
i#: 4dbd        autosave/
i#: 4970        chadfoo/
i#: 127b        cifs_0
i#: 127c        host_0
i#: 4d9e        mancache/
i#: 1277
i#: 1276
i#: 127a        smb--0a40081a-044_0
/dev/md/rdsk/d1 >

The unlinked file that I'm concerned about existed in the chadfoo directory. (Not that this really matters, given that the scope of the name space for these inode numbers is /var, but I want to link it back to its original location.)

The plan is to create an empty file ("bigfile") in /var/tmp/chadfoo and use fsdb to change the inode number for that particular directory entry to be that of the unlinked file. So that I don't lose the inode of the empty file, I create a second link to it, "bigfile2", and I fix link counts afterwards so that everything's kosher.

Here, I look at the directory entries for inode 4970 (/var/tmp/chadfoo):

/dev/md/rdsk/d1 > 4970:inode; 0:dir?d
i#: 4970        .
/dev/md/rdsk/d1 > 4970:inode; 1:dir?d
i#: b6          ..
/dev/md/rdsk/d1 > 4970:inode; 2:dir?d
i#: 4a62        bigfile
/dev/md/rdsk/d1 > 4970:inode; 3:dir?d
i#: 4a62        bigfile2
/dev/md/rdsk/d1 >

And then I set the inode number for directory entry number 2 to be that of the unlinked file (decimal 19037):

/dev/md/rdsk/d1 > 4970:inode; 2:dir=0t19037
i#: 4a5d        bigfile
/dev/md/rdsk/d1 >

What I've done doesn't change the link counts, though, so I need to do this manually:

/dev/md/rdsk/d1 > 4a62:inode?i
i#: 4a62           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 2              bs: 0              sz : c_flags : 0           0

        accessed: Wed May 16 11:12:22 2007
        modified: Wed May 16 11:12:22 2007
        created : Wed May 16 11:13:02 2007
/dev/md/rdsk/d1 > :ln=1
i#: 4a62           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 1              bs: 0              sz : c_flags : 0           0

        accessed: Wed May 16 11:12:22 2007
        modified: Wed May 16 11:12:22 2007
        created : Wed May 16 11:13:02 2007
/dev/md/rdsk/d1 > 4a5d:inode?i
i#: 4a5d           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 0              bs: 200410         sz : c_flags : 0           40000000

db#0: 28e00        db#1: 28e48        db#2: 28e50        db#3: 28e58
db#4: 28e60        db#5: 28e68        db#6: 28e70        db#7: 28e78
db#8: 28e80        db#9: 28e88        db#a: 28e90        db#b: 28e98
ib#0: 204748       ib#1: cc808
        accessed: Wed May 16 10:39:20 2007
        modified: Wed May 16 10:39:44 2007
        created : Wed May 16 10:39:44 2007
/dev/md/rdsk/d1 > :ln=1
i#: 4a5d           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 1              bs: 200410         sz : c_flags : 0           40000000

db#0: 28e00        db#1: 28e48        db#2: 28e50        db#3: 28e58
db#4: 28e60        db#5: 28e68        db#6: 28e70        db#7: 28e78
db#8: 28e80        db#9: 28e88        db#a: 28e90        db#b: 28e98
ib#0: 204748       ib#1: cc808
        accessed: Wed May 16 10:39:20 2007
        modified: Wed May 16 10:39:44 2007
        created : Wed May 16 10:39:44 2007
/dev/md/rdsk/d1 >

And at this point, I reboot into single user, fsck /var (which shows a few problems to be corrected), and things are fine.

Sunday, May 06, 2007


ext3cow -- snapshots for ext3

A couple of days ago, I wrote about chunkfs. I also recently ran across ext3cow. Chunkfs has a very specific goal, that of minimizing fsck time. Similarly, ext3cow has a very specific goal -- providing snapshots. The motivation seems to be very business/regulatory-compliance-related, as evidenced by the publication list related to ext3cow (Building Regulatory Compliant Storage Systems, Secure Deletion for a Versioning File System, Verifiable Audit Trails for a Versioning File System, Limiting Liability in a Federally Compliant File System, etc.) But snapshots are a generally useful feature outside the world of compliance, as anyone who's ever accidentally deleted a few days' work on a programming assignment can attest.

(Note that a snapshotting file system is not, in and of itself, a versioning file system. A versioning file system maintains intermediate versions of files. If I make three separate modifications to a file on the same day, I get three copies of the file. A snapshotting file system only maintains the versions of files that existed when a snapshot was taken. If I make three separate modifications to a file on the same day, and a snapshot is taken once a day, I only get the last version of the file. With respect to the relevant regulations, however, the final version of a file as it existed on any given day (or some other specified time period) may be all that matters.)

One of the design features of ext3cow is that changes to support snapshotting were localized to the ext3 code itself, so that no changes were necessary to the VFS data structures or interfaces. This certainly makes life easy, as it can be used with otherwise stock kernels. This also places some restrictions on what can be done with this filesystem.

There are a few parts to how ext3cow works. First, they've added a field to the superblock to contain what they call the epoch counter. This is merely the timestamp of the last snapshot of the file system, in seconds since the epoch.

They've added three fields to the inode itself: an epoch counter, a copy-on-write bitmap, and a next inode pointer. The epoch counter of a file gets updated with the epoch counter of the file system every time a write to the inode occurs. If the epoch counter of an inode is older than the file system epoch, it means that a new snapshot has been taken since the file was last updated, and the copy-on-write mechanism comes into play. This mechanism involves the other two new fields.

Multiple versions of a file can share data blocks, as long as those data blocks haven't changed. (This is the efficiency gained by copy-on-write mechanisms in general.) In ext3cow, the copy-on-write bitmap keeps track of which data blocks are shared between versions of a file, essentially indicating whether each block is read-only (a new block needs to be allocated before modifying the data) or read-write (a new block has already been allocated, thus the data can be modified in-place.) If an update needs to be made to a read-only block, a new block is allocated, the inode is updated to point to the new block, the new block is written out, and the COW bitmap is updated to reflect that the block has been allocated. (Note that the old block is likely not copied to the new block, as it was likely first read in before being modified. The copy of the data thus already exists in memory.)

The copy-on-write mechanism in ext3cow involves allocating new inodes for the snapshotted versions of files, thus the next-inode pointer in the inode. (Note that the live inode necessarily remains the same so that things like NFS continue to work.) The versions of a file are represented by a linked list via the next-inode pointer, with the current inode at the head of the list to make the common-case access fast. When the inode for the snapshotted version of a file is allocated, all of the data block pointers (and indirect pointers, etc.) are the same for the snapshotted inode and the live inode. The live inode will only have data blocks unique to itself when modifications or additions are made to the file.

The final metadata modification they've made is to the directory entry data structure. For each directory entry, they've added a birth epoch and a death epoch, essentially a range during which that inode is live. This allows multiple different instances of a filename to exist in a directory (as the lifetimes wouldn't overlap), and it avoids muddying the namespace with old versions of files (e.g., an ls of a current directory will show only those directory entries with an unspecified death epoch.)

So, given that I'm a proponent of ZFS, how does ext3cow compare? To some extent, that's not really a fair question, as the scope of ZFS is much larger than the scope of ext3cow. The main purpose of ext3cow is to provide snapshots, whereas snapshots in ZFS are almost merely a side-effect of other design goals. Ext3cow is certainly a creative use of what's available in ext3 to provide the ability to take snapshots.

Given that it's not fair to compare the two as file systems per se, I will point out some similarities and differences that are interesting to note.

A similarity between the two is in the amount of work required to take a snapshot. There is a constant amount of work involved in each case. For ext3cow, the work involved is merely updating the snapshot epoch in the superblock. For ZFS, the work involved is merely noting that we don't want to garbage-collect a specific version of the uberblock.

As for differences, the first to note is that ext3cow allocates new inodes for snapshotted files (but only when the current file changes.) Given that ext3 doesn't support dynamic inode allocation, this means that snapshotted files will consume that limited resources. While in general this is unlikely to be a problem, it could affect certain use cases such as frequently-snapshotted and highly active file systems or even moderately active file systems with long snapshot retention requirements. Of course, in these cases, disk space is more likely to be the overriding concern.

In contrast, ZFS does not allocate new inodes for snapshotted files, it uses the inode as it existed when the snapshot was taken. Of course, given that all file system updates are copy-on-write (even in the absence of snapshots), one could argue that ZFS is constantly allocating new inodes for modified files. From this point of view, the only resource ZFS isn't consuming for snapshots is inode numbers.

Another difference is that ext3cow snapshots are essentially only snapshots of the data and not a snapshot of the file system per se. Ext3cow doesn't support snapshotting the metadata itself. The metadata for a snapshotted file can change at some later time when the live file is modified. A new inode is allocated, the old metadata is copied over, and, at a minimum, the inode number and the next-inode pointer are modified.

In contrast, ZFS snapshots are a point-in-time view of the state of the entire file system, including metadata. This is a result of the copy-on-write mechanism in ZFS. File changes involve copy-on-write modifications, not only of the data blocks, but also of the metadata up the tree to the superblock. When a snapshot is taken, that version of the uberblock is preserved, along with all of the metadata and data blocks referenced via this uberblock. The metadata doesn't need to modified, and new inodes don't need to be allocated for snapshotted files.

Of course, whether or not the metadata is maintained in a pristine state is mostly of theoretical interest. It likely doesn't matter for regulatory compliance, where the data itself is the primary concern. (Well, I'm sure a highly-paid lawyer could probably use this point to introduce doubt about the validity of snapshotted files, if there were ever a court case involving very large sums of money.) And I'm sure this subtlety doesn't matter to the student who recovers a file representing many hours of work when a deadline is only a very few hours away.

Another difference is the amount of work involved in recovering a snapshotted version of a file. With ext3cow, this involves a linked-list traversal of next-inode pointers while comparing the desired file time to the liveness range of each inode in turn. With ZFS, this involves merely looking up the file in an alternate version of the filesystem, which is no more work than finding the current version of that file. But again, this is likely a difference that won't matter. Looking up the snapshotted version of a file is the rare case, and according to the authors' experimental results, the penalty is negligible.

One subtle difference between the two file systems is that ext3cow allows for individual file snapshots, which ZFS doesn't support. That ext3cow supports this almost falls out of the implementation -- the epoch counter for the file is updated to the current time rather than the file system epoch, and the copy-on-write mechanism catches this condition. Similarly, the implementation of snapshots in ZFS prevents this. Snapshots are implemented using the uberblock. Taking a snapshot of a single file would involve also snapshotting the metadata up the tree to the uberblock. Guaranteeing that the snapshot is only of a single file would involve pruning away all other directories and files. Given that snapshotting is such a cheap operation in ZFS, it doesn't make sense to do so.

Friday, May 04, 2007


ChunkFS -- minimizing fsck time

I ran across chunkfs this week. (Another link and the paper.) Actually, I ran across a mention of it a month or so ago, but I only really looked at it this week, having been pointed there by Joerg's blog.

The point of chunkfs is to minimize fsck time. Having spent a sleepless night waiting for a 2TB ext3 file system to fsck, I can understand the desire to minimize this. As the paper points out, the trend in storage growth means that fsck's of normal filesystems will soon take on the order of days or tens of days. This quote from the paper is particularly interesting: The fraction of time data is unavailable while running fsck may asymptotically approach unity in the absence of basic architectural changes.

The approach of chunkfs is to break up the file system into multiple mostly-independent fault domains, the intent being to limit the need for fsck to just one of these chunks (or possibly a small number of them.) The idea is similar in concept to having many small file systems rather than one large one, but it hides the details (and administrative hassle) from the user, so that the illusion is of a single large file system.

From this page we get this bit of information that isn't mentioned in the paper:
Chunkfs makes these numbers unique by putting the chunk number in the upper eight bits of every inode number. As a result, there is a maximum of 256 chunks in any chunkfs filesystem.
(I haven't actually verified this for myself, so take this with a grain of salt.) Given this, that one-week fsck is reduced to less than an hour, assuming that only a single chunk needs to be checked. That 256x speed-up in fsck time is certainly impressive.

The authors also suggest the possibility of doing on-line fsck's. The feasibility of this suggestion is based on observations that metadata updates tend to be localized, with the prediction that chunks would spend a majority of their time idle with respect to metadata updates. On-line fsck's would eliminate the need for file system for down time and could possibly prevent events like server reboots due to file system corruption (i.e., problems could be found in a controlled fashion.)

I would argue, however, that any effort to minimize fsck time is misguided. I'm of the opinion that the time would be better spent working on ways to eliminate the need for fsck.

Actually, let me step back from that statement for a second and extrapolate from the authors' work. They suggest a divide-and-conquer approach to create separate fault domains. These chunks would generally consist of some subset of the files in the file system. We could imagine growing the storage, and/or shrinking the chunk size, to such a degree that chunks are subsets of files, possibly even individual blocks. In this case, consistency checks would be done at the sub-file level. Given that the file system could support on-line fsck's, we could have an almost constant amount of background consistency checking of parts of file systems.

Taken to this extreme, chunkfs starts to look a lot like ZFS. ZFS is constantly verifying the consistency of metadata in its file systems. Admittedly, it's only doing so for the metadata for the files that are actually in use, and it's only doing so as that metadata is being read or written (assuming a scrub is not running), but one could argue that that's a benefit, as it's not wasting cycles checking data that's not in use. (There's the added benefit that ZFS is also checksumming the data, thus giving a correctness guarantee that chunkfs doesn't provide.)

I won't go as far to argue that the ZFS implementation is the only possible way to eliminate the need for fsck, but I would argue that they've done the right thing in doing so.

Here's actually a pretty interesting quote with respect to this, ironically enough from one of the authors of chunkfs:
The on-disk state is always valid. No more fsck, no need to replay a journal. We use a copy-on-write, transactional update system to correctly and safely update data on disk. There is no window where the on-disk state can be corrupted by a power cycle or system panic. This means that you are much less likely to lose data than on a file system that uses fsck or journal replay to repair on-disk data corruption after a crash. Yes, supposedly fsck-free file systems already exist - but then explain to me all the time I've spent waiting for fsck to finish on ext3 or logging UFS - and the resulting file system corruption.
And there's another interesting note from the paper:
Those inclined to dismiss file system bugs as a significant source of file system corruption are invited to consider the recent XFS bug in Linux 2.6.17 requiring repair via the XFS file system repair program.
I had originally intended to point out that a filesystem repair program is another possible entry point for bugs, but this FAQ entry proved my point for me:
To add insult to injury, xfs_repair(8) is currently not correcting these directories on detection of this corrupt state either. This xfs_repair issue is actively being worked on, and a fixed version will be available shortly.
Both of these problems have since been fixed, apparently, but this does give strength to my point. The added complexity of a file system repair program should be taken into consideration in this argument. Of course, one could argue that there is very little added complexity because the repair program shares much of the same code with the file system itself. This means, of course, that it shares the same bugs. On the other hand, one could maintain a separate code base for the repair program, but that is a considerable amount of added complexity.

This page is powered by Blogger. Isn't yours?