To timidly go where many have gone before: October 2008

Friday, October 10, 2008

Measuring the length of a linked list

How do you measure the length of a linked list? That's easy enough, you start at the beginning and follow the linked list, counting as you go. Of course, you have to make sure that there's not a cycle in your list, but there's a classic solution to that problem[1].

Okay, so what do you do if this is a linked list in the Solaris kernel, running on a production machine, and you want to know how long it is? Hmm, that's a bit more difficult. The linked list is likely changing, so you can't just pop open mdb, walk the list, and count. It would be convenient to be able to grab the lock on that linked list to guarantee that it doesn't change, and then walk it. Fortunately, there aren't any tools to do that, as you could bring a box to its knees while doing this (performance wise, if not actually crashing the box.) (Theoretically, you might be able to do it with "mdb -kw" if you happen to get lucky setting the right bit at just the right time to make it look as if that lock is held, but I wouldn't be willing to bet on someone actually succeeding at doing this.)

So here's a method that a colleague and I came up with to give an estimate of the maximum and average length of that linked list over a sampling period. (In this case, we're looking at a hash table, the sleep queue used for threads who call cv_block() on a condition variable.):

#!/usr/sbin/dtrace -s

tick-3s
{
       exit(0);
}

fbt::cv_block:entry
{
       /*
        * Hash function used for this hash table.  I'm adding 1 so I can
        * reserve 0 as a special value.
        */
       self->bucket = (((uintptr_t)(arg0) >> 2) + ((uintptr_t)(arg0) >> 9) & 511) + 1;
}

fbt::sleepq_insert:entry
/self->bucket/
{
       length[self->bucket]++;
       @r[self->bucket] = max(length[self->bucket]);
       @q[self->bucket] = avg(length[self->bucket]);
       bucket[arg1] = self->bucket;
       self->bucket = 0;
}

fbt::sleepq_unlink:entry
/ length[bucket[arg1]] > 0 /
{
       length[bucket[arg1]]--;
}

END
{
       trunc(@r,30);
       trunc(@q, 30);
}

Ignoring some of the particulars, we're keeping a length value for each hash bucket. When we insert something, we increment that value. When we delete something, we decrement it. We then keep track of the max and average for that length. Simple enough. We also keep track of which bucket this thread is going into (arg1 to both sleepq_insert() and sleepq_unlink() is a pointer to the thread structure) so that we know which length variable to decrement. (Note that we can't do this as a thread-local variable because sleepq_insert() and sleepq_unlink() won't happen in the same thread.)

Something to point out is that this is only an approximation to the length of any of these linked lists. What this actually tells you is the maximum and average growth in the length of those linked lists. It won't include the length of any of those lists at the time that the script starts sampling. But as an approximation, it's good enough, especially in this case, because the linked lists in a hash table should never really be longer than one or two and certainly never as long as, say, 83.

[1] Okay, if that link has since become broken, it points to this comment from the code for DTrace:

/*
 * We want to have a name for the minor.  In order to do this,
 * we need to walk the minor list from the devinfo.  We want
 * to be sure that we don't infinitely walk a circular list,
 * so we check for circularity by sending a scout pointer
 * ahead two elements for every element that we iterate over;
 * if the list is circular, these will ultimately point to the
 * same element.  You may recognize this little trick as the
 * answer to a stupid interview question -- one that always
 * seems to be asked by those who had to have it laboriously
 * explained to them, and who can't even concisely describe
 * the conditions under which one would be forced to resort to
 * this technique.  Needless to say, those conditions are
 * found here -- and probably only here.  Is this the only use
 * of this infamous trick in shipping, production code?  If it
 * isn't, it probably should be...
 */

# posted by Chad Mynhier @ 6:05 PM 0 comments

Wednesday, October 08, 2008

Ptime modifications

Late last year I made a proposal to make some modifications to ptime(1) (the full text of the proposal can be found in the second message here. The proposal includes links to the original RFEs.) There've been some bumps along the way, but the changes are soon going to be putback into Solaris. (Thanks, Rafael!)

When I made these changes, I was mostly just going through the list of open Solaris bugs looking for things to do, it wasn't something that I needed. I've recently found it to be an exceedingly useful tool, though, especially with the -m and -p options. For example, something's just not right with this particular process:

# ptime -mp 3878

real  3:00:47.519957300
user    12:13.948207800
sys        37.210204400
trap    19:40.638837300
tflt        2.942089500
dflt        0.783120500
kflt        0.000000000
lock  5:29:23.879672500
slp   6:00:46.021626100
lat        17.081155000
stop        0.000081200
#

This is a case of misaligned memory accesses on a SPARC box, which is why all the time is showing up in trap. Here's another process in a bad way:

# ./ptime -mp 2715

real 106:51:30.161266600
user  4:28:49.022048500
sys  19:00:33.201052500
trap        0.047837500
tflt        0.051432200
dflt        0.038230200
kflt        0.000000000
lock 103:08:06.134729400
slp  299:24:41.087893200
lat   1:04:27.602323100
stop        2.913079800

The cause of this aberrant behavior is probably good fodder for a later blog entry.

# posted by Chad Mynhier @ 6:51 PM 0 comments

Sunday, October 05, 2008

Memory alignment on SPARC, or a 300x speedup!

(Search for "Holy crap!" if you want to jump straight to the 300x speedup part.)

I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)

In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.

So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.

Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.

If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?

Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:

#include <stdio.h>
#include <stdlib.h>

typedef struct {
        int a;
        int b;
} pair_t;

#define PAIRS 100
#define REPS 10000

int
main()
{
        int i, j;
        char *foo;
        pair_t *pairs;

        if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) {
                fprintf(stderr, "Unable to allocate memory\n");
                exit(1);
        }

#ifdef ALIGNED
        pairs = (pair_t *) foo;
#else
        pairs = (pair_t *) (foo + 1);
#endif

        for (i = 0; i < PAIRS; i++) {
                pairs[i].a = i;
                pairs[i].b = i+5;
        }

        for (j = 0; j < REPS; j++) {
                int sum;

                for (i = 0; i < PAIRS; i++) {
                        sum += pairs[i].a + pairs[i].b;
                }
        }
}

With this Makefile:

all:    aligned unaligned onebyte-aligned onebyte-unaligned

aligned:        memalign.c
        cc -DALIGNED -xmemalign=4i -o aligned memalign.c

unaligned:      memalign.c
        cc -xmemalign=4i -o unaligned memalign.c

onebyte-aligned:        memalign.c
        cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c

onebyte-unaligned:      memalign.c
        cc -xmemalign=1s -o onebyte-unaligned memalign.c

First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):

# ptime ./aligned

real        0.016512200
user        0.008128100
sys         0.004643800
# 
# 
# ptime ./unaligned

real        5.749458300
user        3.343899250
sys         2.339621750
#

So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)

That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):

# ptime -m ./unaligned

real        6.263908350
user        3.591816850
sys         0.013817200
trap        2.589204200
tflt        0.000000000
dflt        0.000000000
kflt        0.000000000
lock        0.000000000
slp         0.000000000
lat         0.067424450
stop        0.000162500
#

So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.

So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:

# ptime ./aligned

real        0.012762500
user        0.007944150
sys         0.004371950
# 
# ptime ./onebyte-aligned

real        0.030157100
user        0.024818150
sys         0.004310100
# 
# ptime ./onebyte-unaligned

real        0.030376500
user        0.024850600
sys         0.004306500
#

Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)

# posted by Chad Mynhier @ 1:21 PM 2 comments

Saturday, October 04, 2008

Grabbing the value of a local variable with DTrace

I had an interesting DTrace experience the other day. I was working with a developer to try to track down a bug that was causing tens of thousands of unaligned memory accesses per second in an application. (This was on SPARC hardware, where misaligned memory accesses cause a trap into the kernel to handle it in software. Assuming the compiler options are correct, that is, which they were in this case.)

We got to a point in the code where a value in a local variable was being added to a pointer, and I wanted to see what that value was. (I already knew that the pointer was correctly aligned coming into this function.) The developer offered to go compile a version with some debugging print statements to get the value, which would take about fifteen minutes. As he was walking away, I figured out how I could do this with DTrace. By the time he got back to me with the value, I'd already extracted the value from the live instance of the app.

There's nothing terribly complicated about how I did it, but it does involve knowing a few things. The first is that local variables in a function (automatic variables, at least) are stored in the stack frame and are accessed via an offset from the frame pointer. (Although I guess I probably should have put knowing what a stack frame is and what the frame pointer is before knowing this.) The second is knowing that every arithmetic SPARC instruction operates on registers, so any value in memory (e.g., in the stack frame) must be loaded into a register before being used in an arithmetic instruction, The third is that you have the uregs[] array available to you in DTrace, so you can grab the value from one of the registers. And the last one is that the pid provider in DTrace lets you instrument any instruction in an application.

Say we have the following function in a program, and say that I want to be able to determine the value of localvar that's being added to somearg:

typedef struct {
        int a;
        int b;
} pair_t;

int
somefunction(int somearg, pair_t *somepair)
{
        int localvar;

        localvar = somepair->b == 0 ? somepair->a : somepair->b;

        return (somearg + localvar);
}

If we disassemble the function, we'll see this (I chose to use mdb, although dis would have sufficed.):

> somefunction::dis
somefunction:                   save      %sp, -0x70, %sp
somefunction+4:                 or        %i1, %g0, %o0
somefunction+8:                 ld        [%o0 + 0x4], %o1
somefunction+0xc:               cmp       %o1, 0x0
somefunction+0x10:              bne       +0x10         
somefunction+0x14:              nop
somefunction+0x18:              ba        +0xc          
somefunction+0x1c:              ld        [%o0], %i5
somefunction+0x20:              or        %o1, %g0, %i5
somefunction+0x24:              st        %i5, [%fp - 0x8]
somefunction+0x28:              add       %i0, %i5, %o0
somefunction+0x2c:              st        %o0, [%fp - 0x4]
somefunction+0x30:              or        %o0, %g0, %i0
somefunction+0x34:              ret
somefunction+0x38:              restore
somefunction+0x3c:              or        %o0, %g0, %i0
somefunction+0x40:              ret
somefunction+0x44:              restore
>

At somefunction+8, we're sticking the value of somepair->b into register %o1. From somefunction+0xc to somefunction+0x20, we're deciding which value to use for localvar. It's at somefunction+0x1c that we put the value of somepair->a into register %i5, and at somefunction+0x20 we're putting th value of somepair->b into %i5. (We already have the value in %o1, and the or with %g0 is just a way to move a value from one register to another -- %g0 is always a zero value. Note also that somefunction+0x1c is also the branch-delay slot of the preceding instruction, so we're not immediately overwriting %i5 at somefunction+0x20.)

The purpose of the instruction at somefunction+0x24 is to save the value in %i5 into the stack frame. This is the value of localvar that's going to be added to somearg. (The memory location of localvar is %fp - 0x8. It's probably worth pointing out here that, if I'd compiled with certain optimizations, this instruction might not be here. There's no real point in writing this value into the location for localvar in the stack frame, as the value will never be used again. I'm using this particular instruction below, but no matter how optimized the code could get, I'd always still have some instruction to instrument, as the value of localvar has to be in a register in order to perform that addition.)

Given the above, I can use the following bit of D code to get the value of localvar that gets added to somearg (the 24 in the probe identifier indicates that this is at offset 24 (hex) from the beginning of somefunction):

pid$target:a.out:somefunction:24
{
        printf("localvar value is %d\n", uregs[R_I5]);
}

And when I set things up such that somepair->b is 7 (and thus localvar will have the value 7), I get the following:

# dtrace -q -s ./someprogram.d -c ./a.out
localvar value is 7

#

# posted by Chad Mynhier @ 9:39 AM 0 comments

To timidly go where many have gone before