Wednesday, December 20, 2006
Unkillable processes
One of the blogs I read religiously is Ben Rockwood's. He has some interesting anecdotes (that's http://www.cuddletech.com/blog/pivot/entry.php?id=780, in case you get the spam warning instead of the blog) about using OpenSolaris in production at Joyent, including one about an unkillable process.
I mailed the link to a couple of former colleagues, mostly because I thought they might be interested in the NFS-over-ZFS anecdote (given that they work at an ISP.) Apparently I jinxed them -- just after getting in to work the next morning, they discovered an unkillable process on one of their Solaris 10 boxes. And it was also a process running in a zone, so it was impossible to reboot the zone to clear it up.
Sorry, guys.
(BTW, this appeared to be a deadlock situation. The process has two threads, one stuck in cv_wait() via exitlwps() and the other stuck in cv_wait() via tcp_close(). Given that I don't work there anymore, I couldn't really go crash-dump diving, but I'd bet that there were no other threads on the system that were going to call cv_signal() or cv_broadcast() on that particular CV.)
I mailed the link to a couple of former colleagues, mostly because I thought they might be interested in the NFS-over-ZFS anecdote (given that they work at an ISP.) Apparently I jinxed them -- just after getting in to work the next morning, they discovered an unkillable process on one of their Solaris 10 boxes. And it was also a process running in a zone, so it was impossible to reboot the zone to clear it up.
Sorry, guys.
(BTW, this appeared to be a deadlock situation. The process has two threads, one stuck in cv_wait() via exitlwps() and the other stuck in cv_wait() via tcp_close(). Given that I don't work there anymore, I couldn't really go crash-dump diving, but I'd bet that there were no other threads on the system that were going to call cv_signal() or cv_broadcast() on that particular CV.)
Friday, December 15, 2006
ZFS, DTrace, and fault domains
An interesting similarity between ZFS and DTrace occurred to me last night. One of the things ZFS gives you is the ability to sit outside the fault domain(s) between an application and its data on disk and catch any corruption that's introduced anywhere in that/those fault domain(s). You can't rely on a RAID controller to catch corruption intoduced between it and the server.
DTrace is similar in that it lets you look at different parts of the fault domain involved in running an application. That is, it lets you look at what's going in different parts of that fault domain -- the application, the libraries it uses, system calls it makes, and the function flow inside the kernel involved in implementing those system calls. Other traditional instrumentation tools generally allow you to look at one part of that domain -- truss lets you watch the system call boundary, instrumented libraries or applications let you watch just that part of the fault domain, etc.
Or maybe I'm just stretching things a bit in making this comparison.
DTrace is similar in that it lets you look at different parts of the fault domain involved in running an application. That is, it lets you look at what's going in different parts of that fault domain -- the application, the libraries it uses, system calls it makes, and the function flow inside the kernel involved in implementing those system calls. Other traditional instrumentation tools generally allow you to look at one part of that domain -- truss lets you watch the system call boundary, instrumented libraries or applications let you watch just that part of the fault domain, etc.
Or maybe I'm just stretching things a bit in making this comparison.
Thursday, December 07, 2006
Problems with non-redundant ZFS
From watching the zfs-discuss mailing list, it seems that people are starting to experience problems with ZFS where it appears that ZFS is causing them to lose data (or at least rendering their data inaccessible, which may as well be data loss.)
The issue is with using ZFS in a non-redundant fashion on top of HW RAID where corruption is introduced by the hardware itself. What happens in these cases is that ZFS sees the corruption and stops letting you access the filesystem, which effectively means that your data is gone. Technically, there's still a valid ZFS filesystem on the disks, but because the hardware is introducing corruption, the ZFS software won't let you access it.
The complaint people seem to have is that ZFS isn't allowing them to salvage their data, where other filesystems would allow them access to their data. (Although what they don't seem to be considering is that they shouldn't be trusting the data that the other filesystem is giving them.)
From what I gather, these seem to be problems where letting ZFS handle (at least some of) the redundancy would help, given that ZFS would have the chance to apply its self-healing-data magic. But given that ZFS is being presented with a single LUN, it has nowhere to put corrected copies of data.
(And yeah, if something in the hardware chain is corrupting all of the data, ZFS redundancy won't help. But in this case, you're not going to be getting any data via any filesystem.)
There's a fairly lengthy thread about this here.
I recently saw a comparison of ZFS to TCP, in that both give you guarantees about the correctness of data being delivered to the application, and that comparison is pretty relevant to this problem. If TCP sees corruption in a data stream, it will not deliver the data (and because TCP also guarantees in-order delivery, the data after the corrupted data will not be delivered until the corruption is fixed.) ZFS is similar -- it will not deliver data that is corrupt.
Both ZFS and TCP have mechanisms for correcting corrupted data. In the case of TCP, the sender retransmits the data. In the case of ZFS, the software grabs a good copy of the data from its redundant location on disk, either from a mirror or from the parity data stored via raidz. But if the zpool has been created without redundancy, there is no good copy of the data to be found, the data is effectively lost.
The issue is with using ZFS in a non-redundant fashion on top of HW RAID where corruption is introduced by the hardware itself. What happens in these cases is that ZFS sees the corruption and stops letting you access the filesystem, which effectively means that your data is gone. Technically, there's still a valid ZFS filesystem on the disks, but because the hardware is introducing corruption, the ZFS software won't let you access it.
The complaint people seem to have is that ZFS isn't allowing them to salvage their data, where other filesystems would allow them access to their data. (Although what they don't seem to be considering is that they shouldn't be trusting the data that the other filesystem is giving them.)
From what I gather, these seem to be problems where letting ZFS handle (at least some of) the redundancy would help, given that ZFS would have the chance to apply its self-healing-data magic. But given that ZFS is being presented with a single LUN, it has nowhere to put corrected copies of data.
(And yeah, if something in the hardware chain is corrupting all of the data, ZFS redundancy won't help. But in this case, you're not going to be getting any data via any filesystem.)
There's a fairly lengthy thread about this here.
I recently saw a comparison of ZFS to TCP, in that both give you guarantees about the correctness of data being delivered to the application, and that comparison is pretty relevant to this problem. If TCP sees corruption in a data stream, it will not deliver the data (and because TCP also guarantees in-order delivery, the data after the corrupted data will not be delivered until the corruption is fixed.) ZFS is similar -- it will not deliver data that is corrupt.
Both ZFS and TCP have mechanisms for correcting corrupted data. In the case of TCP, the sender retransmits the data. In the case of ZFS, the software grabs a good copy of the data from its redundant location on disk, either from a mirror or from the parity data stored via raidz. But if the zpool has been created without redundancy, there is no good copy of the data to be found, the data is effectively lost.
Wednesday, December 06, 2006
How long has that thread been waiting
Okay, so this is actually pretty simple, but I'll write about it anyway.
I've been looking at a core file with a hung nfsd, and I wondered if I could figure out how long threads had been blocked on the RW lock in question. The lock itself has no concept of when it was grabbed. A reader/writer lock is implemented as a single-word data structure, where all but a few bits of that word is devoted either to the address of the thread holding the write lock or to the count of readers. Reader/writer locks are described on page 836 of the Solaris Internals book and in the header file
Maybe some time information is kept in the turnstile itself? While the amount of time a thread spends waiting on a resource is a useful statistic, there's no need to store this information in that data structure[1], as the information is available elsewhere.
The kthread_t data structure contains this:
Given that the last thing the thread did while it was running was to request a resource that it then had to sleep on, the difference between now (stored in lbolt) and t_disp_time tells us how long the thread has been sleeping. So if we pick one of the threads waiting on a RW lock and investigate:
Subtracting t_disp_time from lbolt and dividing by hz, we see that this thread had been waiting a little over 3h46m when I grabbed this core file.
[1] I had originally written, "...there's no need to store this information in the data structures associated with a turnstile...", but a kthread_t could be considered one of the data structures associated with a turnstile, as the pointers maintaining the sleep queue are kept in the kthread_t structure itself (see also section 3.10 of the Solaris Internals book):
I've been looking at a core file with a hung nfsd, and I wondered if I could figure out how long threads had been blocked on the RW lock in question. The lock itself has no concept of when it was grabbed. A reader/writer lock is implemented as a single-word data structure, where all but a few bits of that word is devoted either to the address of the thread holding the write lock or to the count of readers. Reader/writer locks are described on page 836 of the Solaris Internals book and in the header file
Maybe some time information is kept in the turnstile itself? While the amount of time a thread spends waiting on a resource is a useful statistic, there's no need to store this information in that data structure[1], as the information is available elsewhere.
The kthread_t data structure contains this:
193 clock_t t_disp_time; /* last time this thread was running */
Given that the last thing the thread did while it was running was to request a resource that it then had to sleep on, the difference between now (stored in lbolt) and t_disp_time tells us how long the thread has been sleeping. So if we pick one of the threads waiting on a RW lock and investigate:
> ::turnstile ! awk '$3 != 0' ADDR SOBJ WTRS EPRI ITOR PRIOINV ffffffff818ccac0 fffffe849958a150 5 60 0 0 fffffe93cd2335c8 ffffffff8875ccc0 5 4 0 0 fffffe93eedcc008 fffffe849958ab68 3 59 0 0 > fffffe849958a150::rwlock ADDR OWNER/COUNT FLAGS WAITERS fffffe849958a150 fffffe849fb74120 B111 fffffe89fdd5fe40 (W) ||| fffffe8692f2aae0 (W) WRITE_LOCKED ------+|| ffffffff83127720 (W) WRITE_WANTED -------+| ffffffff817f3540 (W) HAS_WAITERS --------+ ffffffff89f79f20 (R) > fffffe89fdd5fe40::print kthread_t t_disp_time t_disp_time = 0x549da3e9 > lbolt/X lbolt: lbolt: 54b25d0e > hz/D hz: hz: 100 >
Subtracting t_disp_time from lbolt and dividing by hz, we see that this thread had been waiting a little over 3h46m when I grabbed this core file.
[1] I had originally written, "...there's no need to store this information in the data structures associated with a turnstile...", but a kthread_t could be considered one of the data structures associated with a turnstile, as the pointers maintaining the sleep queue are kept in the kthread_t structure itself (see also section 3.10 of the Solaris Internals book):
105 typedef struct _kthread { 106 struct _kthread *t_link; /* dispq, sleepq, and free queue link */ [ ... ] 280 struct _kthread *t_priforw; /* sleepq per-priority sublist */ 281 struct _kthread *t_priback; 282 283 struct sleepq *t_sleepq; /* sleep queue thread is waiting on */
Monday, December 04, 2006
Fun with fsdb
If you've been around long enough, you'll eventually see something weird like the following:
What's happened is that the filesystem mounted at /fs/proj is mounted on a directory with insufficient permissions. But how can you see the current permissions are? If you do an 'ls -li', you'll note that the inode number of the underlying directory is listed:
But how can you look at the information contained in that underlying inode? If you try to stat the file, you'll end up with information about the root inode of the filesystem mounted there. If you could make a hard link to the inode, you could access it via that link, but you can't make a hard link to a directory. You can use fsdb, however, to look directly at the filesystem in question:
Inside fsdb, you can move around the filesystem and get some information aboutwhat's there. But because fsdb is going directly to the disk and not going through the OS, it has no information about what's mounted there (and resolving a path like /fs/proj won't ever cross filesystem boundaries.) Below, we set the current inode and then get information about it:
And we can see that the permissions on the file are very restrictive. We could unmount the filesystem, change the permissions, and remount, or we could keep playing with fsdb. (Note that this likely isn't the kind of thing you'd want to do on a production server if you like your job. Note also that you need the 'w' option to be able to write to the device.):
juser@server:/fs> ls -l total 4 drwxr-xr-x 3 root root 512 Dec 4 10:17 proj/ drwxr-xr-x 4 root root 512 Nov 29 12:34 repl/ juser@server:/fs> cd proj juserserver:/fs/proj> cd .. ..: Permission denied. juser@server:/fs/proj>
What's happened is that the filesystem mounted at /fs/proj is mounted on a directory with insufficient permissions. But how can you see the current permissions are? If you do an 'ls -li', you'll note that the inode number of the underlying directory is listed:
juser@server> ls -li total 4 6 drwxr-xr-x 3 root root 512 Dec 4 10:17 proj/ 7 drwxr-xr-x 4 root root 512 Nov 29 12:34 repl/ juser@server>
But how can you look at the information contained in that underlying inode? If you try to stat the file, you'll end up with information about the root inode of the filesystem mounted there. If you could make a hard link to the inode, you could access it via that link, but you can't make a hard link to a directory. You can use fsdb, however, to look directly at the filesystem in question:
juser@server> sudo fsdb /dev/rdsk/c0t0d0s0 fsdb of /dev/rdsk/c0t0d0s0 (Read only) -- last mounted on / fs_clean is currently set to FSLOG fs_state consistent (fs_clean CAN be trusted) /dev/rdsk/c0t0d0s0 > :cd /fs /dev/rdsk/c0t0d0s0 > :ls -l /fs: i#: 5 ./ i#: 2 ../ i#: 33371 .rsync/ i#: 16207 .rsync_root i#: 6 proj/ i#: 7 repl/ /dev/rdsk/c0t0d0s0 >
Inside fsdb, you can move around the filesystem and get some information aboutwhat's there. But because fsdb is going directly to the disk and not going through the OS, it has no information about what's mounted there (and resolving a path like /fs/proj won't ever cross filesystem boundaries.) Below, we set the current inode and then get information about it:
/dev/rdsk/c0t0d0s0 > 6:inode /dev/rdsk/c0t0d0s0 > ?i i#: 6 md: d---rwx------ uid: 0 gid: 0 ln: 3 bs: 2 sz : c_flags : 0 200 db#0: 2fd accessed: Mon Dec 4 11:32:30 2006 modified: Mon Dec 4 10:13:13 2006 created : Mon Dec 4 11:32:59 2006 /dev/rdsk/c0t0d0s0 >
And we can see that the permissions on the file are very restrictive. We could unmount the filesystem, change the permissions, and remount, or we could keep playing with fsdb. (Note that this likely isn't the kind of thing you'd want to do on a production server if you like your job. Note also that you need the 'w' option to be able to write to the device.):
juser@server> sudo fsdb -o w /dev/rdsk/c0t0d0s0 fsdb of /dev/rdsk/c0t0d0s0 (Opened for write) -- last mounted on / fs_clean is currently set to FSLOG fs_state consistent (fs_clean CAN be trusted) /dev/rdsk/c0t0d0s0 > 6:inode /dev/rdsk/c0t0d0s0 > ?i i#: 6 md: d---rwx------ uid: 0 gid: 0 ln: 3 bs: 2 sz : c_flags : 0 200 db#0: 2fd accessed: Mon Dec 4 11:32:30 2006 modified: Mon Dec 4 10:13:13 2006 created : Mon Dec 4 11:32:59 2006 /dev/rdsk/c0t0d0s0 > :md=+055 i#: 6 md: d---rwxr-xr-x uid: 0 gid: 0 ln: 3 bs: 2 sz : c_flags : 0 200 db#0: 2fd accessed: Mon Dec 4 11:32:30 2006 modified: Mon Dec 4 10:13:13 2006 created : Mon Dec 4 11:32:59 2006 /dev/rdsk/c0t0d0s0 >
Friday, December 01, 2006
Turnstiles and MDB
In Solaris, turnstiles are a data structure used by some of the synchronization primitives in the kernel (mutexes and reader-write locks, specifically.) They're similar to sleep queues, but they also deal with the priority inversion problem by allowing for priotiy inheritance.
(Priority inversion occurs when a high-priority thread is waiting for a lower-priority thread to release a resource it needs. Priority inheritance is a mechanism whereby the lower-priority thread gets raised to the higher priority so that it can release the resource more quickly.)
There's more information in the Solaris Internals book about turnstiles, but I wanted to discuss looking at turnstiles with MDB. The ::turnstile dcmd will list all of the turnstiles on your live system or in your crash dump. For example:
You get the addresses of the turnstile and the synchronization object associated with it, the number of waiters, and priority information. So, let's look at the turnstiles with waiters:
We have the addresses of the synchronization objects, so let's look at one (I happen to know that these are all reader-writer locks):
We can see who the owner is (the address of the data structure representing the thread), the value of the flags, and the list of waiters (if any.) We know this is currently being held as a write lock because the WRITE_LOCKED flag is 1, but also because the OWNER/COUNT lists the address of a thread rather than a count of readers.
And given the owner, we can examine the stack:
So this thread is holding a reader-writer lock, and it appears to be waiting on a condition variable. As it turns out, nothing is ever going to call cv_broadcast() or cv_signal() on that condition variable, which means that the process is never going to release that RW lock, either. Which is, of course, why I'm looking at this crash dump in the first place.
(Priority inversion occurs when a high-priority thread is waiting for a lower-priority thread to release a resource it needs. Priority inheritance is a mechanism whereby the lower-priority thread gets raised to the higher priority so that it can release the resource more quickly.)
There's more information in the Solaris Internals book about turnstiles, but I wanted to discuss looking at turnstiles with MDB. The ::turnstile dcmd will list all of the turnstiles on your live system or in your crash dump. For example:
> ::turnstile ! head ADDR SOBJ WTRS EPRI ITOR PRIOINV ffffffff81600000 0 0 0 0 0 ffffffff81600040 0 0 0 0 0 ffffffff81600080 0 0 0 0 0 ffffffff816000c0 0 0 0 0 0 ffffffff81600100 0 0 0 0 0 ffffffff81600140 0 0 0 0 0 ffffffff81600180 ffffffff88b3fd48 0 165 0 0 ffffffff816001c0 ffffffff812bad80 0 60 0 0 ffffffff81600200 ffffffff852c5f98 0 60 0 0 >
You get the addresses of the turnstile and the synchronization object associated with it, the number of waiters, and priority information. So, let's look at the turnstiles with waiters:
> ::turnstile ! awk '$3 != 0' ADDR SOBJ WTRS EPRI ITOR PRIOINV ffffffff812e3748 ffffffff8c1f9570 2 164 0 0 ffffffff887b8340 ffffffff8e3ea688 1 165 0 0 ffffffff8193a980 ffffffff8d8f86f8 6 164 0 0 fffffe84cd15fe08 ffffffff8e3ea680 2 164 0 0 >
We have the addresses of the synchronization objects, so let's look at one (I happen to know that these are all reader-writer locks):
> ffffffff8c1f9570::rwlock ADDR OWNER/COUNT FLAGS WAITERS ffffffff8c1f9570 ffffffff92480380 B111 ffffffff8888f7e0 (W) ||| ffffffffb0cea1e0 (W) WRITE_LOCKED ------+|| WRITE_WANTED -------+| HAS_WAITERS --------+ >
We can see who the owner is (the address of the data structure representing the thread), the value of the flags, and the list of waiters (if any.) We know this is currently being held as a write lock because the WRITE_LOCKED flag is 1, but also because the OWNER/COUNT lists the address of a thread rather than a count of readers.
And given the owner, we can examine the stack:
> ffffffff92480380::findstack stack pointer for thread ffffffff92480380: fffffe8000dc04b0 [ fffffe8000dc04b0 _resume_from_idle+0xde() ] fffffe8000dc04e0 swtch+0x10b() fffffe8000dc0500 cv_wait+0x68() fffffe8000dc0550 top_end_sync+0xa3() fffffe8000dc05f0 ufs_write+0x32d() fffffe8000dc0600 fop_write+0xb() fffffe8000dc0890 rfs3_write+0x3a3() fffffe8000dc0b50 common_dispatch+0x585() fffffe8000dc0b60 rfs_dispatch+0x21() fffffe8000dc0c30 svc_getreq+0x17c() fffffe8000dc0c80 svc_run+0x124() fffffe8000dc0cb0 svc_do_run+0x88() fffffe8000dc0ed0 nfssys+0x50d() fffffe8000dc0f20 sys_syscall32+0xef() >
So this thread is holding a reader-writer lock, and it appears to be waiting on a condition variable. As it turns out, nothing is ever going to call cv_broadcast() or cv_signal() on that condition variable, which means that the process is never going to release that RW lock, either. Which is, of course, why I'm looking at this crash dump in the first place.