Wednesday, December 20, 2006
Unkillable processes
One of the blogs I read religiously is Ben Rockwood's. He has some interesting anecdotes (that's http://www.cuddletech.com/blog/pivot/entry.php?id=780, in case you get the spam warning instead of the blog) about using OpenSolaris in production at Joyent, including one about an unkillable process.
I mailed the link to a couple of former colleagues, mostly because I thought they might be interested in the NFS-over-ZFS anecdote (given that they work at an ISP.) Apparently I jinxed them -- just after getting in to work the next morning, they discovered an unkillable process on one of their Solaris 10 boxes. And it was also a process running in a zone, so it was impossible to reboot the zone to clear it up.
Sorry, guys.
(BTW, this appeared to be a deadlock situation. The process has two threads, one stuck in cv_wait() via exitlwps() and the other stuck in cv_wait() via tcp_close(). Given that I don't work there anymore, I couldn't really go crash-dump diving, but I'd bet that there were no other threads on the system that were going to call cv_signal() or cv_broadcast() on that particular CV.)
I mailed the link to a couple of former colleagues, mostly because I thought they might be interested in the NFS-over-ZFS anecdote (given that they work at an ISP.) Apparently I jinxed them -- just after getting in to work the next morning, they discovered an unkillable process on one of their Solaris 10 boxes. And it was also a process running in a zone, so it was impossible to reboot the zone to clear it up.
Sorry, guys.
(BTW, this appeared to be a deadlock situation. The process has two threads, one stuck in cv_wait() via exitlwps() and the other stuck in cv_wait() via tcp_close(). Given that I don't work there anymore, I couldn't really go crash-dump diving, but I'd bet that there were no other threads on the system that were going to call cv_signal() or cv_broadcast() on that particular CV.)