Thursday, November 30, 2006
Forcing a Solaris x86 kernel core dump
With a SPARC system, it's easy to force a kernel core dump: drop the system to the ok prompt and type 'sync'. Solaris x86 servers don't give you that, so you have to do something else.
The Solaris x86 FAQ suggests booting the server under kadb so that you can run this to get a core dump
Of course, you could always do it with DTrace, too, if you like:
(The panic() function is provided as one of the destructive functions, which is why you need the -w flag.)
(And I was in full paranoid mode while testing this, first to make sure I was on the correct server when I tried it, but also while I had the command in my mouse buffer to copy it here. It's not something you want to accidentally paste into some random window.)
The Solaris x86 FAQ suggests booting the server under kadb so that you can run this to get a core dump
$<systemdumpThis information is a little bit outdated, as kadb has been replaced by kmdb, which is actually much nicer, 'cause you can load kmdb at run-time instead of having been lucky enough to have booted under kadb. But the above command will still force a dump of the system.
Of course, you could always do it with DTrace, too, if you like:
dtrace -w -n 'BEGIN{panic();}'
(The panic() function is provided as one of the destructive functions, which is why you need the -w flag.)
(And I was in full paranoid mode while testing this, first to make sure I was on the correct server when I tried it, but also while I had the command in my mouse buffer to copy it here. It's not something you want to accidentally paste into some random window.)
Wednesday, November 29, 2006
Recommended reading: Postmortem Object Type Identification
This paper makes for some interesting reading: Postmortem Object Type Identification by Bryan Cantrill.
The paper presents a method for determining the type of arbitrary memory objects in a core dump. In other words, given some random address in a core dump, what's the type of the object?
(And I guess that introduces the question, why should you care what the type of a random address in memory is? One very good reason is that if you're trying to track down a memory corruption problem, knowing the type of the objects near the memory corruption can help narrow down your search for the bad code.)
Determining the type of statically allocated objects is fairly easy -- you have the symbol table and type information. If the random address happens to match something in the symbol table, you have all the information you need. (I'm oversimplifying a bit, but it's an easy problem.)
The harder problem is determining the type of a dynamically-allocated object. You don't have a symbol table handy to tell you what all the locations in memory are, because you didn't have this information handy at compile time. You could store information at run-time about the types of objects that you're allocating, but that becomes a hairy problem. The likeliest solution would involve modifying your memory allocation library to store this information, but you would need to pass type information to the memory allocation routine, which might not be feasible. (Although it should be noted that the kernel slab allocator in Solaris provides some of this information, as objects allocated from certain object caches are of known type.)
This paper presents a method for inferring the types of dynamically-allocated objects. At the core of this method is a fairly standard iterative graph-traversal algorithm for propagating information from nodes (i.e., memory objects) of known type to nodes of unknown type. Given that almost all dynamically-allocated objects are rooted in statically-allocated objects, the algorithm can provide very good coverage of dynamically-allocated objects. (And the implementation in MDB makes use of the object-cache type knowledge mentioned above as an optimization during initialization.)
The C language allows for uses that reduce the effectiveness of the algorithm, but the paper presents some heuristics to handle those. The paper also presents some interesting applications of this method that aren't directly related to debugging memory corruption.
Definitely worth reading.
The paper presents a method for determining the type of arbitrary memory objects in a core dump. In other words, given some random address in a core dump, what's the type of the object?
(And I guess that introduces the question, why should you care what the type of a random address in memory is? One very good reason is that if you're trying to track down a memory corruption problem, knowing the type of the objects near the memory corruption can help narrow down your search for the bad code.)
Determining the type of statically allocated objects is fairly easy -- you have the symbol table and type information. If the random address happens to match something in the symbol table, you have all the information you need. (I'm oversimplifying a bit, but it's an easy problem.)
The harder problem is determining the type of a dynamically-allocated object. You don't have a symbol table handy to tell you what all the locations in memory are, because you didn't have this information handy at compile time. You could store information at run-time about the types of objects that you're allocating, but that becomes a hairy problem. The likeliest solution would involve modifying your memory allocation library to store this information, but you would need to pass type information to the memory allocation routine, which might not be feasible. (Although it should be noted that the kernel slab allocator in Solaris provides some of this information, as objects allocated from certain object caches are of known type.)
This paper presents a method for inferring the types of dynamically-allocated objects. At the core of this method is a fairly standard iterative graph-traversal algorithm for propagating information from nodes (i.e., memory objects) of known type to nodes of unknown type. Given that almost all dynamically-allocated objects are rooted in statically-allocated objects, the algorithm can provide very good coverage of dynamically-allocated objects. (And the implementation in MDB makes use of the object-cache type knowledge mentioned above as an optimization during initialization.)
The C language allows for uses that reduce the effectiveness of the algorithm, but the paper presents some heuristics to handle those. The paper also presents some interesting applications of this method that aren't directly related to debugging memory corruption.
Definitely worth reading.
Monday, November 27, 2006
Unlinked files and MDB
Occasionally the unlinked file problem pops up. For the sysadmin, this generally starts with a log file filling up a file system. Someone removes the original file, generally because they've moved the file across a filesystem boundary (or gzipped it to a target file on another file system.) But now that the offending file is gone, the file system is still full because some process has the file open, and as long as there's a link to that file, the file system can't reclaim the space.
There are various ways to find that process. The first time I saw this, some 12 years ago or so on a SunOS 4 server, I used some tool (probably lsof) to generate a list of the inode numbers of all the open files on the system matching the particular device number. I then used some other tool (probably find) to generate a list of all the inodes on that particular device. I compared the sorted lists to find the inodes that didn't exist in the filesystem and then went back to the lsof output to find the process holding open the unlinked file.
In a later job, I found that someone had essentially scripted the above process, (which was logical, 'cause it satisfied Cardinal Rule number 1: Automate complex, repetitive processes.) But it was still fairly slow. And then one day I noticed something about the entries in /proc/*/fd/ that made this much faster.
Under Solaris, an unlinked file shows up with a link count of 0:
Under Linux (or at least the version of RedHat running on a box I have access to), an unlinked file is tagged as deleted:
Something interesting to note here is that Linux gives you the filename, which makes sense given that the fd is presented as a file descriptor. Solaris doesn't give you the filename, but a quick MDB invocation can get that for you:
And of course, you could just do this as a single command line:
There are two steps to this: "0t4291::pid2proc | ::fd 3 | ::print file_t" first turns the pid 4291 into a pointer to a proc structure. Given that, we ask for the pointer to the file_t structure associated with file descriptor 3 of that process, and then we print it based on its type. The second step is to take the pointer to the vnode and print its path (::vnode2path).
And doing a search shows a feature of lsof that I hadn't encountered yet: 'lsof +L1' will show you all of the open files on the system with a link count less than one. This blog is one among a few that mention using lsof this way. Lsof still won't give you the name of the unlinked files, though, so the mdb trick still comes in handy.
There are various ways to find that process. The first time I saw this, some 12 years ago or so on a SunOS 4 server, I used some tool (probably lsof) to generate a list of the inode numbers of all the open files on the system matching the particular device number. I then used some other tool (probably find) to generate a list of all the inodes on that particular device. I compared the sorted lists to find the inodes that didn't exist in the filesystem and then went back to the lsof output to find the process holding open the unlinked file.
In a later job, I found that someone had essentially scripted the above process, (which was logical, 'cause it satisfied Cardinal Rule number 1: Automate complex, repetitive processes.) But it was still fairly slow. And then one day I noticed something about the entries in /proc/*/fd/ that made this much faster.
Under Solaris, an unlinked file shows up with a link count of 0:
c--------- 1 juser tty 24, 1 Nov 27 15:16 0 c--------- 1 juser tty 24, 1 Nov 27 15:16 1 c--------- 1 juser tty 24, 1 Nov 27 15:16 2 -r--r--r-- 0 juser staff 1048576 Nov 27 10:56 3
Under Linux (or at least the version of RedHat running on a box I have access to), an unlinked file is tagged as deleted:
lrwx------ 1 juser staff 64 Nov 27 20:19 0 -> /dev/pts/0 lrwx------ 1 juser staff 64 Nov 27 20:19 1 -> /dev/pts/0 lrwx------ 1 juser staff 64 Nov 27 20:19 2 -> /dev/pts/0 lr-x------ 1 juser staff 64 Nov 27 20:19 3 -> /var/tmp/bigfile (deleted)
Something interesting to note here is that Linux gives you the filename, which makes sense given that the fd is presented as a file descriptor. Solaris doesn't give you the filename, but a quick MDB invocation can get that for you:
#mdb -k Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 ufs ip sctp usba random fcp fctl lofs md cpc fcip crypto logindmux ptm nfs ] > 0t4291::pid2proc | ::fd 3 | ::print file_t { f_tlock = { _opaque = [ 0 ] } f_flag = 0x2001 f_pad = 0xbadd f_vnode = 0xffffffff860fcb40 f_offset = 0 f_cred = 0xffffffff84cefc50 f_audit_data = 0xffffffff8b709638 f_count = 0x1 } > 0xffffffff860fcb40::vnode2path /var/tmp/bigfile >
And of course, you could just do this as a single command line:
#echo "0t4291::pid2proc | ::fd 3 | ::print file_t" | /usr/bin/mdb -k | grep f_vnode | awk '{printf("%s::vnode2path\n",$NF);}' | /usr/bin/mdb -k /var/tmp/bigfile #
There are two steps to this: "0t4291::pid2proc | ::fd 3 | ::print file_t" first turns the pid 4291 into a pointer to a proc structure. Given that, we ask for the pointer to the file_t structure associated with file descriptor 3 of that process, and then we print it based on its type. The second step is to take the pointer to the vnode and print its path (::vnode2path).
And doing a search shows a feature of lsof that I hadn't encountered yet: 'lsof +L1' will show you all of the open files on the system with a link count less than one. This blog is one among a few that mention using lsof this way. Lsof still won't give you the name of the unlinked files, though, so the mdb trick still comes in handy.