Friday, October 10, 2008
Measuring the length of a linked list
How do you measure the length of a linked list? That's easy enough, you start at the beginning and follow the linked list, counting as you go. Of course, you have to make sure that there's not a cycle in your list, but there's a classic solution to that problem[1].
Okay, so what do you do if this is a linked list in the Solaris kernel, running on a production machine, and you want to know how long it is? Hmm, that's a bit more difficult. The linked list is likely changing, so you can't just pop open mdb, walk the list, and count. It would be convenient to be able to grab the lock on that linked list to guarantee that it doesn't change, and then walk it. Fortunately, there aren't any tools to do that, as you could bring a box to its knees while doing this (performance wise, if not actually crashing the box.) (Theoretically, you might be able to do it with "mdb -kw" if you happen to get lucky setting the right bit at just the right time to make it look as if that lock is held, but I wouldn't be willing to bet on someone actually succeeding at doing this.)
So here's a method that a colleague and I came up with to give an estimate of the maximum and average length of that linked list over a sampling period. (In this case, we're looking at a hash table, the sleep queue used for threads who call cv_block() on a condition variable.):
Ignoring some of the particulars, we're keeping a length value for each hash bucket. When we insert something, we increment that value. When we delete something, we decrement it. We then keep track of the max and average for that length. Simple enough. We also keep track of which bucket this thread is going into (arg1 to both sleepq_insert() and sleepq_unlink() is a pointer to the thread structure) so that we know which length variable to decrement. (Note that we can't do this as a thread-local variable because sleepq_insert() and sleepq_unlink() won't happen in the same thread.)
Something to point out is that this is only an approximation to the length of any of these linked lists. What this actually tells you is the maximum and average growth in the length of those linked lists. It won't include the length of any of those lists at the time that the script starts sampling. But as an approximation, it's good enough, especially in this case, because the linked lists in a hash table should never really be longer than one or two and certainly never as long as, say, 83.
[1] Okay, if that link has since become broken, it points to this comment from the code for DTrace:
Okay, so what do you do if this is a linked list in the Solaris kernel, running on a production machine, and you want to know how long it is? Hmm, that's a bit more difficult. The linked list is likely changing, so you can't just pop open mdb, walk the list, and count. It would be convenient to be able to grab the lock on that linked list to guarantee that it doesn't change, and then walk it. Fortunately, there aren't any tools to do that, as you could bring a box to its knees while doing this (performance wise, if not actually crashing the box.) (Theoretically, you might be able to do it with "mdb -kw" if you happen to get lucky setting the right bit at just the right time to make it look as if that lock is held, but I wouldn't be willing to bet on someone actually succeeding at doing this.)
So here's a method that a colleague and I came up with to give an estimate of the maximum and average length of that linked list over a sampling period. (In this case, we're looking at a hash table, the sleep queue used for threads who call cv_block() on a condition variable.):
#!/usr/sbin/dtrace -s tick-3s { exit(0); } fbt::cv_block:entry { /* * Hash function used for this hash table. I'm adding 1 so I can * reserve 0 as a special value. */ self->bucket = (((uintptr_t)(arg0) >> 2) + ((uintptr_t)(arg0) >> 9) & 511) + 1; } fbt::sleepq_insert:entry /self->bucket/ { length[self->bucket]++; @r[self->bucket] = max(length[self->bucket]); @q[self->bucket] = avg(length[self->bucket]); bucket[arg1] = self->bucket; self->bucket = 0; } fbt::sleepq_unlink:entry / length[bucket[arg1]] > 0 / { length[bucket[arg1]]--; } END { trunc(@r,30); trunc(@q, 30); }
Ignoring some of the particulars, we're keeping a length value for each hash bucket. When we insert something, we increment that value. When we delete something, we decrement it. We then keep track of the max and average for that length. Simple enough. We also keep track of which bucket this thread is going into (arg1 to both sleepq_insert() and sleepq_unlink() is a pointer to the thread structure) so that we know which length variable to decrement. (Note that we can't do this as a thread-local variable because sleepq_insert() and sleepq_unlink() won't happen in the same thread.)
Something to point out is that this is only an approximation to the length of any of these linked lists. What this actually tells you is the maximum and average growth in the length of those linked lists. It won't include the length of any of those lists at the time that the script starts sampling. But as an approximation, it's good enough, especially in this case, because the linked lists in a hash table should never really be longer than one or two and certainly never as long as, say, 83.
[1] Okay, if that link has since become broken, it points to this comment from the code for DTrace:
/* * We want to have a name for the minor. In order to do this, * we need to walk the minor list from the devinfo. We want * to be sure that we don't infinitely walk a circular list, * so we check for circularity by sending a scout pointer * ahead two elements for every element that we iterate over; * if the list is circular, these will ultimately point to the * same element. You may recognize this little trick as the * answer to a stupid interview question -- one that always * seems to be asked by those who had to have it laboriously * explained to them, and who can't even concisely describe * the conditions under which one would be forced to resort to * this technique. Needless to say, those conditions are * found here -- and probably only here. Is this the only use * of this infamous trick in shipping, production code? If it * isn't, it probably should be... */
Wednesday, October 08, 2008
Ptime modifications
Late last year I made a proposal to make some modifications to ptime(1) (the full text of the proposal can be found in the second message here. The proposal includes links to the original RFEs.) There've been some bumps along the way, but the changes are soon going to be putback into Solaris. (Thanks, Rafael!)
When I made these changes, I was mostly just going through the list of open Solaris bugs looking for things to do, it wasn't something that I needed. I've recently found it to be an exceedingly useful tool, though, especially with the -m and -p options. For example, something's just not right with this particular process:
This is a case of misaligned memory accesses on a SPARC box, which is why all the time is showing up in trap. Here's another process in a bad way:
The cause of this aberrant behavior is probably good fodder for a later blog entry.
When I made these changes, I was mostly just going through the list of open Solaris bugs looking for things to do, it wasn't something that I needed. I've recently found it to be an exceedingly useful tool, though, especially with the -m and -p options. For example, something's just not right with this particular process:
# ptime -mp 3878 real 3:00:47.519957300 user 12:13.948207800 sys 37.210204400 trap 19:40.638837300 tflt 2.942089500 dflt 0.783120500 kflt 0.000000000 lock 5:29:23.879672500 slp 6:00:46.021626100 lat 17.081155000 stop 0.000081200 #
This is a case of misaligned memory accesses on a SPARC box, which is why all the time is showing up in trap. Here's another process in a bad way:
# ./ptime -mp 2715 real 106:51:30.161266600 user 4:28:49.022048500 sys 19:00:33.201052500 trap 0.047837500 tflt 0.051432200 dflt 0.038230200 kflt 0.000000000 lock 103:08:06.134729400 slp 299:24:41.087893200 lat 1:04:27.602323100 stop 2.913079800
The cause of this aberrant behavior is probably good fodder for a later blog entry.
Sunday, October 05, 2008
Memory alignment on SPARC, or a 300x speedup!
(Search for "Holy crap!" if you want to jump straight to the 300x speedup part.)
I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)
In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.
So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.
Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.
If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?
Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:
With this Makefile:
First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):
So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)
That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):
So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.
So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:
Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)
I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)
In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.
So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.
Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.
If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?
Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:
#include <stdio.h> #include <stdlib.h> typedef struct { int a; int b; } pair_t; #define PAIRS 100 #define REPS 10000 int main() { int i, j; char *foo; pair_t *pairs; if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) { fprintf(stderr, "Unable to allocate memory\n"); exit(1); } #ifdef ALIGNED pairs = (pair_t *) foo; #else pairs = (pair_t *) (foo + 1); #endif for (i = 0; i < PAIRS; i++) { pairs[i].a = i; pairs[i].b = i+5; } for (j = 0; j < REPS; j++) { int sum; for (i = 0; i < PAIRS; i++) { sum += pairs[i].a + pairs[i].b; } } }
With this Makefile:
all: aligned unaligned onebyte-aligned onebyte-unaligned aligned: memalign.c cc -DALIGNED -xmemalign=4i -o aligned memalign.c unaligned: memalign.c cc -xmemalign=4i -o unaligned memalign.c onebyte-aligned: memalign.c cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c onebyte-unaligned: memalign.c cc -xmemalign=1s -o onebyte-unaligned memalign.c
First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):
# ptime ./aligned real 0.016512200 user 0.008128100 sys 0.004643800 # # # ptime ./unaligned real 5.749458300 user 3.343899250 sys 2.339621750 #
So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)
That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):
# ptime -m ./unaligned real 6.263908350 user 3.591816850 sys 0.013817200 trap 2.589204200 tflt 0.000000000 dflt 0.000000000 kflt 0.000000000 lock 0.000000000 slp 0.000000000 lat 0.067424450 stop 0.000162500 #
So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.
So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:
# ptime ./aligned real 0.012762500 user 0.007944150 sys 0.004371950 # # ptime ./onebyte-aligned real 0.030157100 user 0.024818150 sys 0.004310100 # # ptime ./onebyte-unaligned real 0.030376500 user 0.024850600 sys 0.004306500 #
Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)
Saturday, October 04, 2008
Grabbing the value of a local variable with DTrace
I had an interesting DTrace experience the other day. I was working with a developer to try to track down a bug that was causing tens of thousands of unaligned memory accesses per second in an application. (This was on SPARC hardware, where misaligned memory accesses cause a trap into the kernel to handle it in software. Assuming the compiler options are correct, that is, which they were in this case.)
We got to a point in the code where a value in a local variable was being added to a pointer, and I wanted to see what that value was. (I already knew that the pointer was correctly aligned coming into this function.) The developer offered to go compile a version with some debugging print statements to get the value, which would take about fifteen minutes. As he was walking away, I figured out how I could do this with DTrace. By the time he got back to me with the value, I'd already extracted the value from the live instance of the app.
There's nothing terribly complicated about how I did it, but it does involve knowing a few things. The first is that local variables in a function (automatic variables, at least) are stored in the stack frame and are accessed via an offset from the frame pointer. (Although I guess I probably should have put knowing what a stack frame is and what the frame pointer is before knowing this.) The second is knowing that every arithmetic SPARC instruction operates on registers, so any value in memory (e.g., in the stack frame) must be loaded into a register before being used in an arithmetic instruction, The third is that you have the uregs[] array available to you in DTrace, so you can grab the value from one of the registers. And the last one is that the pid provider in DTrace lets you instrument any instruction in an application.
Say we have the following function in a program, and say that I want to be able to determine the value of localvar that's being added to somearg:
If we disassemble the function, we'll see this (I chose to use mdb, although dis would have sufficed.):
At somefunction+8, we're sticking the value of somepair->b into register %o1. From somefunction+0xc to somefunction+0x20, we're deciding which value to use for localvar. It's at somefunction+0x1c that we put the value of somepair->a into register %i5, and at somefunction+0x20 we're putting th value of somepair->b into %i5. (We already have the value in %o1, and the or with %g0 is just a way to move a value from one register to another -- %g0 is always a zero value. Note also that somefunction+0x1c is also the branch-delay slot of the preceding instruction, so we're not immediately overwriting %i5 at somefunction+0x20.)
The purpose of the instruction at somefunction+0x24 is to save the value in %i5 into the stack frame. This is the value of localvar that's going to be added to somearg. (The memory location of localvar is %fp - 0x8. It's probably worth pointing out here that, if I'd compiled with certain optimizations, this instruction might not be here. There's no real point in writing this value into the location for localvar in the stack frame, as the value will never be used again. I'm using this particular instruction below, but no matter how optimized the code could get, I'd always still have some instruction to instrument, as the value of localvar has to be in a register in order to perform that addition.)
Given the above, I can use the following bit of D code to get the value of localvar that gets added to somearg (the 24 in the probe identifier indicates that this is at offset 24 (hex) from the beginning of somefunction):
And when I set things up such that somepair->b is 7 (and thus localvar will have the value 7), I get the following:
We got to a point in the code where a value in a local variable was being added to a pointer, and I wanted to see what that value was. (I already knew that the pointer was correctly aligned coming into this function.) The developer offered to go compile a version with some debugging print statements to get the value, which would take about fifteen minutes. As he was walking away, I figured out how I could do this with DTrace. By the time he got back to me with the value, I'd already extracted the value from the live instance of the app.
There's nothing terribly complicated about how I did it, but it does involve knowing a few things. The first is that local variables in a function (automatic variables, at least) are stored in the stack frame and are accessed via an offset from the frame pointer. (Although I guess I probably should have put knowing what a stack frame is and what the frame pointer is before knowing this.) The second is knowing that every arithmetic SPARC instruction operates on registers, so any value in memory (e.g., in the stack frame) must be loaded into a register before being used in an arithmetic instruction, The third is that you have the uregs[] array available to you in DTrace, so you can grab the value from one of the registers. And the last one is that the pid provider in DTrace lets you instrument any instruction in an application.
Say we have the following function in a program, and say that I want to be able to determine the value of localvar that's being added to somearg:
typedef struct { int a; int b; } pair_t; int somefunction(int somearg, pair_t *somepair) { int localvar; localvar = somepair->b == 0 ? somepair->a : somepair->b; return (somearg + localvar); }
If we disassemble the function, we'll see this (I chose to use mdb, although dis would have sufficed.):
> somefunction::dis somefunction: save %sp, -0x70, %sp somefunction+4: or %i1, %g0, %o0 somefunction+8: ld [%o0 + 0x4], %o1 somefunction+0xc: cmp %o1, 0x0 somefunction+0x10: bne +0x10somefunction+0x14: nop somefunction+0x18: ba +0xc somefunction+0x1c: ld [%o0], %i5 somefunction+0x20: or %o1, %g0, %i5 somefunction+0x24: st %i5, [%fp - 0x8] somefunction+0x28: add %i0, %i5, %o0 somefunction+0x2c: st %o0, [%fp - 0x4] somefunction+0x30: or %o0, %g0, %i0 somefunction+0x34: ret somefunction+0x38: restore somefunction+0x3c: or %o0, %g0, %i0 somefunction+0x40: ret somefunction+0x44: restore >
At somefunction+8, we're sticking the value of somepair->b into register %o1. From somefunction+0xc to somefunction+0x20, we're deciding which value to use for localvar. It's at somefunction+0x1c that we put the value of somepair->a into register %i5, and at somefunction+0x20 we're putting th value of somepair->b into %i5. (We already have the value in %o1, and the or with %g0 is just a way to move a value from one register to another -- %g0 is always a zero value. Note also that somefunction+0x1c is also the branch-delay slot of the preceding instruction, so we're not immediately overwriting %i5 at somefunction+0x20.)
The purpose of the instruction at somefunction+0x24 is to save the value in %i5 into the stack frame. This is the value of localvar that's going to be added to somearg. (The memory location of localvar is %fp - 0x8. It's probably worth pointing out here that, if I'd compiled with certain optimizations, this instruction might not be here. There's no real point in writing this value into the location for localvar in the stack frame, as the value will never be used again. I'm using this particular instruction below, but no matter how optimized the code could get, I'd always still have some instruction to instrument, as the value of localvar has to be in a register in order to perform that addition.)
Given the above, I can use the following bit of D code to get the value of localvar that gets added to somearg (the 24 in the probe identifier indicates that this is at offset 24 (hex) from the beginning of somefunction):
pid$target:a.out:somefunction:24 { printf("localvar value is %d\n", uregs[R_I5]); }
And when I set things up such that somepair->b is 7 (and thus localvar will have the value 7), I get the following:
# dtrace -q -s ./someprogram.d -c ./a.out localvar value is 7 #