Sunday, October 05, 2008
Memory alignment on SPARC, or a 300x speedup!
(Search for "Holy crap!" if you want to jump straight to the 300x speedup part.)
I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)
In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.
So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.
Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.
If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?
Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:
With this Makefile:
First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):
So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)
That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):
So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.
So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:
Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)
I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)
In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.
So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.
Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.
If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?
Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:
#include <stdio.h> #include <stdlib.h> typedef struct { int a; int b; } pair_t; #define PAIRS 100 #define REPS 10000 int main() { int i, j; char *foo; pair_t *pairs; if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) { fprintf(stderr, "Unable to allocate memory\n"); exit(1); } #ifdef ALIGNED pairs = (pair_t *) foo; #else pairs = (pair_t *) (foo + 1); #endif for (i = 0; i < PAIRS; i++) { pairs[i].a = i; pairs[i].b = i+5; } for (j = 0; j < REPS; j++) { int sum; for (i = 0; i < PAIRS; i++) { sum += pairs[i].a + pairs[i].b; } } }
With this Makefile:
all: aligned unaligned onebyte-aligned onebyte-unaligned aligned: memalign.c cc -DALIGNED -xmemalign=4i -o aligned memalign.c unaligned: memalign.c cc -xmemalign=4i -o unaligned memalign.c onebyte-aligned: memalign.c cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c onebyte-unaligned: memalign.c cc -xmemalign=1s -o onebyte-unaligned memalign.c
First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):
# ptime ./aligned real 0.016512200 user 0.008128100 sys 0.004643800 # # # ptime ./unaligned real 5.749458300 user 3.343899250 sys 2.339621750 #
So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)
That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):
# ptime -m ./unaligned real 6.263908350 user 3.591816850 sys 0.013817200 trap 2.589204200 tflt 0.000000000 dflt 0.000000000 kflt 0.000000000 lock 0.000000000 slp 0.000000000 lat 0.067424450 stop 0.000162500 #
So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.
So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:
# ptime ./aligned real 0.012762500 user 0.007944150 sys 0.004371950 # # ptime ./onebyte-aligned real 0.030157100 user 0.024818150 sys 0.004310100 # # ptime ./onebyte-unaligned real 0.030376500 user 0.024850600 sys 0.004306500 #
Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)
Comments:
<< Home
There were similar issues with the mid-1990s Mac switch from 680x0 to PowerPC -- I had Microsoft Word 6.0 on my first PowerMac and it couldn't keep up with me typing in standard text mode. They released a recompiled version -- same code, just properly compiled for PowerPC -- and it was muuuuuch better.
I remember seeing web pages listing similar performance improvements based on turning on memory alignment enforcement when compiling and linking for the 680x0->PowerPC transition. (I think it was POV-Ray, the ray-tracing package.)
Post a Comment
I remember seeing web pages listing similar performance improvements based on turning on memory alignment enforcement when compiling and linking for the 680x0->PowerPC transition. (I think it was POV-Ray, the ray-tracing package.)
<< Home