To timidly go where many have gone before: August 2006

Tuesday, August 29, 2006

Grabbing the umask of a running process via MDB

Wow, this was actually easier than I thought. I set out to see if I could figure out the umask of a process via MDB. I tried to figure out other ways to get this information, but I didn't come across anything. The umask isn't an environment variable, so something like "pargs -e" won't tell you anything.

So the umask of a process is just a field in the data structure that defines a process, proc_t. Specifically, it's the u_cmask field of that structure, so you can just do something like this (output slightly modified for formatting):

server# mdb -k
>  ::pgrep sshd
S    PID PPID PGID  SID UID             ADDR NAME
R   6311    1 6311 6311   0 ffffffff90064d48 sshd
> ffffffff90064d48::print -t proc_t ! grep u_cmask
        mode_t u_cmask = 0x12
>

And, of course, there are slightly different ways of doing the same thing. For example:

> 0t6311::pid2proc | ::print -t proc_t p_user.u_cmask
mode_t p_user.u_cmask = 0x12
>

So here we have a umask of 022 (note that it's printed in hex above, not octal.)

(I seem to remember having Googled this sometime late last week and coming up with nothing. My search history doesn't bear witness to this, though, and a quick Google of "umask running process solaris" points to this thread.)

# posted by Chad Mynhier @ 6:45 PM 2 comments

Wednesday, August 23, 2006

Diving into a kernel crash dump and banging my head on the bottom

I previously gave an example of diving into a kernel crash dump with mdb. In that case, I was lucky enough to have in a register the pointer to the data structure I wanted to look at. I'm looking at another crash dump, and I'm not so lucky this time. I have to go hunting the pointer I'm interested in.

Here's the backtrace:

> $C
fffffe800017fc30 strcmp()
fffffe800017fc90 vfs_setmntopt_nolock+0x147()
fffffe800017fce0 vfs_parsemntopts+0x96()
fffffe800017fe10 domount+0xc87()
fffffe800017fe90 mount+0x105()
fffffe800017fed0 syscall_ap+0x97()
fffffe800017ff20 sys_syscall32+0xef()
00000000080b2b80 0xfe45a0cc()
>

The basic problem is that strcmp() is getting passed a NULL pointer. I won't go into the details of that here, what I'm interested in here is determining what filesystem is being mounted. domount() is passed a pointer to a vnode, so I'm going to try looking there.

If this were a straight x86 box, I'd be happy. All arguments are passed on the stack, so things are very straightforward. I'd even have the arguments listed in the backtrace, so there'd be no more work than a cut and paste. But this is an x64 box, and arguments are passed in registers, so I have to manually track the value I want as it's moved from the register in which it was passed to the location where it was saved. It may have been saved on the stack, which makes life (relatively) easy, but it may have been saved into a non-volatile register, in which case I need to track it through succeeding stack frames until it gets pushed onto the stack. (Well, okay, this is just basic recursion, with "getting pushed onto the stack" as the base case.)

So here's what domount() looks like:

int
domount(char *fsname, struct mounta *uap, vnode_t *vp, struct cred *credp,
 struct vfs **vfspp)
{

Okay, so the vnode pointer I'm interested in is argument 3. And, as everyone knows (or at least can figure out after looking at a good reference), the third argument is passed in %rdx. What I need to do is track where domount() stores this:

> domount::dis
domount:                        pushq  %rbp
domount+1:                      movq   %rsp,%rbp
domount+4:                      pushq  %r15
domount+6:                      movq   %rdx,%r15
[ ... ]

Okay, so domount() stores %rdx into %r15 (a non-volatile register.) This means more work, as I have to go look at vfs_parsemntopts() to see where it stores %r15. But first, let me check that %r15 isn't used anywhere else in domount() before the instruction of interest (domount+0xc87):

> domount::dis ! grep '%r15$'
domount+4:                      pushq  %r15
domount+6:                      movq   %rdx,%r15
domount+0x37f:                  popq   %r15
domount+0x66e:                  popq   %r15
>

Okay, so domount() is overwriting the register a couple of places, and they're before the instruction of interest. So I check them out:

> domount+0x37f::dis
domount+0x35b:                  movl   %eax,%r14d
domount+0x35e:                  je     -0x29c   
domount+0x364:                  movl   $0x16,%eax
domount+0x369:                  cmpl   $0x4e,%r14d
domount+0x36d:                  cmovl.ne %r14d,%eax
domount+0x371:                  addq   $0xf8,%rsp
domount+0x378:                  popq   %rbx
domount+0x379:                  popq   %r12
domount+0x37b:                  popq   %r13
domount+0x37d:                  popq   %r14
domount+0x37f:                  popq   %r15
domount+0x381:                  leave
domount+0x382:                  ret
[ ... ]

So this looks like it's just an early exit from the function, and the second instance is similar, so I probably don't have to worry about these two cases. So that leaves me looking at vfs_parsemntopts():

> vfs_parsemntopts::dis
vfs_parsemntopts:               pushq  %rbp
vfs_parsemntopts+1:             movq   %rsp,%rbp
vfs_parsemntopts+4:             pushq  %r15
vfs_parsemntopts+6:             movl   $0x1,%r15d
vfs_parsemntopts+0xc:           pushq  %r14
vfs_parsemntopts+0xe:           pushq  %r13
vfs_parsemntopts+0x10:          pushq  %r12
vfs_parsemntopts+0x12:          pushq  %rbx
vfs_parsemntopts+0x13:          movq   %rsi,%rbx
vfs_parsemntopts+0x16:          subq   $0x18,%rsp
[ ... ]

Woohoo! vfs_parsemntopts() pushes %r15 onto the stack, so I'm done looking for the vnode pointer. So, pull it off the stack, dereference it as a vnode_t, and I get the name of the filesystem that was being mounted (or at least a cached guess):

> fffffe800017fce0-8/J
0xfffffe800017fcd8:             ffffffff827afdc0
> ffffffff827afdc0::print -t vnode_t
[ ... ]
char *v_path = 0xffffffff97b11e70 "/netapp/some/filesystem"
[ ... ]

# posted by Chad Mynhier @ 7:30 PM 0 comments

Sunday, August 20, 2006

Fun with DTrace -- tracing I/O through a pipe

I was tracking down a problem where one possible avenue of exploring what was going on was to use DTrace to track the data being written to and read from a pipe. I didn't go down that path, as I figured out the problem using other methods. But it seemed like a good exercise, so I followed through on it.

This wasn't just a simple case of grabbing the file descriptors returned by pipe() and tracking read()'s and write()'s made by that pid using those file descriptors. There was an intervening fork() and the dup()'s necessary to use the two ends of the pipe for stdin and stdout. I could have tracked the fork() and the dup()'s, but I chose to try it a different way.

A file descriptor is nothing more than an index into the file descriptor table associated with a process. Somewhere between the file descriptor being passed to the read() and write() system calls and the underlying function that actually reads to or writes from that file, the file descriptor must necessarily be mapped to the data structure representing the file. This happens in this file at the top of the read() code (and similarly for write()):

ssize_t
read(int fdes, void *cbuf, size_t count)
{
[ ... ]
  if ((fp = getf(fdes)) == NULL)
   return (set_errno(EBADF));

The return value of the getf() function (a struct file *) is what I'm looking for, the pointer to the process-independent data structure that I want to track. Once I have the file pointer from the pipe() system call, I can trace all the reads and writes using that pointer. Given that that value is only determined after entering read() or write(), I'll need to perform some speculation to determine which read()'s and write()'s I'm interested in.

So how do I determine the two corresponding values being used in pipe()? After having allocated two file pointers and two file descriptors, pipe() does the following to associate the file pointers to the file descriptors:

 setf(fd1, fp1);
 setf(fd2, fp2);

All very logical and straightforward. Based on the above, I wrote the following DTrace script. Note that it's actually not what I represented above, as this code assumes we know the pid of the process calling pipe(). But it's still reasonably representative, as tracing the reads and writes depends solely on the file pointer.

syscall::pipe:entry
/ pid == $target /
{
        printf("Pid %d called pipe()\n", pid);
        self->trace_pipe = 1;
        self->first_fp = 0;
}

/*
 * Grab the file pointer from the second call to setf() within pipe().  Note 
 * that we want to do this first to avoid hitting this predicate just after 
 * we've set self->first_fp.
 */
fbt::setf:entry
/ self->trace_pipe && self->first_fp != 0 /
{
        printf("fd == %d second fp == 0x%x\n",arg0,arg1);
        self->second_fp = arg1;
}

/*
 * Grab the file pointer from the first call to setf() within pipe().
 */
fbt::setf:entry
/ self->trace_pipe && self->first_fp == 0 /
{
        printf("fd == %d first fp == 0x%x\n",arg0,arg1);
        self->first_fp = arg1;
}

syscall::pipe:return
/ self->trace_pipe /
{
        self->trace_pipe = 0;
}

/* 
 * Note the speculation.  On entry to these system calls, we only have a file 
 * descriptor.  We commit the speculation when we know that the fd maps to the
 * file pointer of interest.
 */
syscall::write:entry,
syscall::read:entry
{
        self->spec = speculation();
        speculate(self->spec);

        printf("%s %d bytes %s fd %d\n",probefunc,arg2,
                (probefunc == "write" ? "to" : "from" ),arg0);
}

fbt::getf:return
/ self->spec && ( arg1 == self->first_fp || arg1 == self->second_fp ) /
{
        commit(self->spec);
        self->spec = 0;
}

fbt::getf:return
/ self->spec && arg1 != self->first_fp && arg1 != self->second_fp /
{
        discard(self->spec);
        self->spec = 0;
}

# posted by Chad Mynhier @ 10:16 AM 0 comments

Tuesday, August 15, 2006

VMWare (or Parallels) as an educational tool

I recently installed Solaris x86 under Parallels on a MacBook. As I was rebooting the Solaris VM, I noticed the Parallels BIOS message briefly flash up before starting to boot the OS. This got me thinking about the operating systems course at Princeton. As it existed during my brief stay there, the course involved writing an OS from the ground up on Intel hardware, with no simulator to hide the details away. I wish I'd taken the course while I was there, but I'd taken an OS course while getting my Master's, and I didn't see the need in doing it over again. Unfortunately, the OS course I took involved writing an OS on top of a simulator, so I didn't get the experience of doing the low-level stuff for myself.

Reminiscences aside, it occurred to me that Parallels (or VMWare) would be a great tool for an operating systems course. You could do all the low-level programming without needing to constantly reboot the machine you're working on. The fix-compile-test cycle would be much shorter since your working environment would be persistent.

I'm not the first person to think of it, though. This operating systems course at Columbia used VMWare, although the programming assignments aren't exactly what I had in mind. My ideal OS course involves writing at least the basics of an OS from the ground up. The course I link to involves various modifications to the Linux kernel. I won't try to argue that doing so isn't useful, it's just not what I would want out of an OS course.

# posted by Chad Mynhier @ 8:03 PM 0 comments

Monday, August 14, 2006

Fun with MDB

I don't have much experience with mdb, so I thought I'd dig into a kernel crash dump to see what I could figure out. This is a server connected to what might be faulty storage, so what I'm trying to determine is whether the storage might have had anything to do with the kernel panic.

First off, here's the stack backtrace:

server:/var/crash/server> sudo mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace 
ufs md ip sctp usba random fcp fctl lofs nfs ptm 
logindmux ipc crypto fcip ]
> $C
fffffe800076d9d0 alloccg+0x48()
fffffe800076da20 hashalloc+0xb4()
fffffe800076da90 alloc+0x10b()
fffffe800076dc40 bmap_write+0xa74()
fffffe800076dd40 wrip+0x759()
fffffe800076dde0 ufs_write+0x211()
fffffe800076ddf0 fop_write+0xb()
fffffe800076de00 lo_write+0x11()
fffffe800076de10 fop_write+0xb()
fffffe800076dec0 write+0x287()
fffffe800076ded0 write32+0xe()
fffffe800076df20 sys_syscall32+0xef()
>

It died in alloccg() (allocate cylinder group), one of the lower-level UFS functions, so the backtrace hasn't ruled out the storage. So, given the above, how do I determine what file was being written to? The top of the backtrace looks like a good place to start, so I check out the definition of alloccg(). The first argument passed in is the inode:

static daddr_t
alloccg(struct inode *ip, int cg, daddr_t bpref, int size)
{

The inode itself doesn't contain the filename, but from the definition of an inode, we have this:

typedef struct inode {
[ ... ]
        struct  vnode *i_vnode; /* vnode associated with this inode */
[ ... ]
}

and we can get the pathname of the file from the vnode:

typedef struct vnode {
[ ... ]
        char            *v_path;        /* cached path */
[ ... ]
}

Mdb is aware of data structures, and can interpret the raw memory addresses WRT those data structures. So, if we have the pointer to that inode, we could figure out the path. In the 32-bit x86 world, arguments are passed on the stack, so the arguments would have shown up in the backtrace. In the x64 world, arguments are passed in registers, which are overwritten with succeeding function calls. There are ways to trace arguments down (functions that need to reference their arguments after making a nested function call need to save the register values somewhere, either on the stack or in non-volatile registers), but in this case, the appropriate register (%rdi for the first argument) contains what we need.

> ::regs
%rax = 0xfffffffffffff000                 %r9  = 0xffffffffaf62ab18
%rbx = 0x0000000000002000                 %r10 = 0xffffffff9e403559
%rcx = 0x0000000000002000                 %r11 = 0xfffffffffbcc2de0 apic_cr8pri
%rdx = 0xffffffff83dd2000                 %r12 = 0xffffffff83c79000
%rsi = 0x00000000ffffff00                 %r13 = 0xffffffffbf4bf500
%rdi = 0xffffffffbf4bf500                 %r14 = 0xfffffffffbad45e0 alloccg
%r8  = 0x0000000000000000                 %r15 = 0xffffffff832eb1c0

%rip = 0xfffffffffbad4628 alloccg+0x48
%rbp = 0xfffffe800076d9d0
%rsp = 0xfffffe800076d960
%rflags = 0x00010297
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=

                        %cs = 0x0028    %ds = 0x0043    %es = 0x0043
%trapno = 0xe           %fs = 0x0000    fsbase = 0x00000000fbc22ae0
   %err = 0x0           %gs = 0x01c3    gsbase = 0x0000000000000000
>
> 0xffffffffbf4bf500::print -t "struct inode"
{
[ ... ]
    struct vnode *i_vnode = 0xffffffffbf4bedc0
[ ... ]
>
> 0xffffffffbf4bedc0::print -t "struct vnode"
{
[ ... ]
    char *v_path = 0xffffffff863952f8 "/fs/data/somefile"
[ ... ]
>

And there's the file that was being written when the panic occurred in alloccg().

For reference, I used the new Solaris Performance and Tools book. I found some very detailed information on argument passing in the x64 world (specifically with respect to kernel crash dump analysis) here, an as-yet-unpublished book by Frank Hoffman at Sun.

# posted by Chad Mynhier @ 7:31 PM 2 comments

Tuesday, August 01, 2006

Playing with process contracts in Solaris 10

Even though I know the general details about how process contracts work in Solaris 10, and had seen the list of functions that are used in manipulating process contracts, I hadn't actually tried playing with the API before last week. I started playing with it last week because I wanted to re-implement Sun's process contract handling in sshd so that I could submit it as a patch to OpenSSH.

I first went looking for documentation on the API. There are obviously the man pages for contract(4), libcontract(3CONTRACT), and all of the library calls, but I couldn't find much more than that aside from some threads on the smf-discuss@opensolaris.org forum. The man pages contain all you need, it's just a little bit harder to dig out the information than it would be in different documentation.

I posed some questions about process contracts and then tried to answer them programmatically. The first question I asked was, how do I determine what process contract I belong to? In retrospect, this may not even be a useful question, but it was the first one to come to mind. I have yet to figure this one out, though. I spent an hour or two trying to figure it out, but I eventually moved on, as it was tangential to what I really wanted to accomplish.

The next obvious question was, how do I create a new process contract? This is fairly simple: you open a new process contract template, set the terms of the contract, and activate the template. Once you've done this, a fork() will create a child in a new process contract. Here's an example:

int
pre_fork_activate_template()
{
        int             tmpl_fd;

        if ((tmpl_fd = open64("/system/contract/process/template", O_RDWR)) == -1) {
                perror("Can't open /system/contract/process/template");
                return -1;
        }
        if (ct_pr_tmpl_set_fatal(tmpl_fd, CT_PR_EV_HWERR|CT_PR_EV_SIGNAL) != 0){
                perror("Can't set process contract fatal events");
                return -1;
        }
        if (ct_tmpl_set_critical(tmpl_fd, CT_PR_EV_HWERR) != 0) {
                perror("Can't set process contract critical events");
                return -1;
        }
        if (ct_tmpl_activate(tmpl_fd) != 0) {
                perror("Can't activate process contract template");
                return -1;
        }

        return tmpl_fd;
}

(Of course, there's a potentially bad failure mode in the above code if it's a long-running daemon, as you could start leaking file descriptors if the open64() succeeds but any of the following function calls fail.)

And after this, a fork() will create a child in a new process contract. The parent will be the holder of this contract. This may not be what you want to do, as contracts aren't destroyed automatically when a child exits. This means that the parent could still be holding the contract long after the child is dead. For a short-lived process, this isn't a problem, but a long-running daemon with this behavior would be accumulating dead process contracts that count against certain limits (as discussed in this thread.) One way of handling this is to keep track of all the process contracts and reap them when they're no longer useful, or you could simply abandon the contract immediately. I've done the latter in this bit of code:

void
post_fork_contract_processing(int tmpl_fd,int pid)
{
        char            ctl_path[PATH_MAX];
        ctid_t          ctid;
        ct_stathdl_t    stathdl;
        int             ctl_fd;
        int             pathlen;
        int             stat_fd;

        /*
         * First clear the active template.
         */
        if (ct_tmpl_clear(tmpl_fd) != 0) {
                perror("Parent can't clear active template");
                return;
        }
        close(tmpl_fd);

        /*
         * If the fork didn't succeed (pid < 0), or if we're the child
         * (pid == 0), we have nothing more to do.
         */
        if (pid <= 0) {
                return;
        }

        /*
         * Now abandon the contract we've created.  This involves the
         * following steps:
         * - Get the contract id (ct_status_read(), ct_status_get_id())
         * - Get an fd for the ctl file for this contract
         *   (/system/contract/process//ctl)
         * - Abandon the contract (ct_ctl_abandon(fd))
         */
        if ((stat_fd = open64(CTFS_ROOT "/process/latest", O_RDONLY)) == -1) {
                perror("Parent can't open latest");
                return;
        }
        if (ct_status_read(stat_fd, CTD_COMMON, &stathdl) != 0) {
                perror("Parent can't read contract status");
                return;
        }
        if ((ctid = ct_status_get_id(stathdl)) < 0) {
                perror("ct_status_get_id() failed");
                ct_status_free(stathdl);
                return;
        }
        ct_status_free(stathdl);
        close(stat_fd);

        pathlen = snprintf(ctl_path, PATH_MAX, CTFS_ROOT "/process/%ld/ctl",ctid);
        if (pathlen > PATH_MAX) {
                fprintf(stderr,"Contract ctl file path exceeds maximum path length\n");
                return;
        }
        if ((ctl_fd = open64(ctl_path, O_WRONLY)) < 0) {
                perror("Parent couldn't open control file for child contract");
                return;
        }
        if (ct_ctl_abandon(ctl_fd) < 0) {
                perror("Parent couldn't abandon contract");
        }
        close(ctl_fd);
}

Note that getting the control file for the process contract for the child process involves getting the id of that contract so that we can construct the path for it under /system/contract/process.

And then (for completeness), I used this code to test things:

main()
{
        int             tmpl_fd;
        pid_t           pid;

        if ((tmpl_fd = pre_fork_activate_template()) < 0) {
                exit(1);
        }
        /*
         * Now that we've set the active template, fork a process to see
         * a new contract created.
         */
        if ((pid = fork()) < 0) {
                perror("Can't fork");
        }
        post_fork_contract_processing(tmpl_fd,pid);

        sleep(60);
}

# posted by Chad Mynhier @ 8:08 PM 2 comments

To timidly go where many have gone before