Sunday, October 05, 2008


Memory alignment on SPARC, or a 300x speedup!

(Search for "Holy crap!" if you want to jump straight to the 300x speedup part.)

I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)

In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.

So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.

Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.

If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?

Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:

#include <stdio.h>
#include <stdlib.h>

typedef struct {
        int a;
        int b;
} pair_t;

#define PAIRS 100
#define REPS 10000

        int i, j;
        char *foo;
        pair_t *pairs;

        if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) {
                fprintf(stderr, "Unable to allocate memory\n");

#ifdef ALIGNED
        pairs = (pair_t *) foo;
        pairs = (pair_t *) (foo + 1);

        for (i = 0; i < PAIRS; i++) {
                pairs[i].a = i;
                pairs[i].b = i+5;

        for (j = 0; j < REPS; j++) {
                int sum;

                for (i = 0; i < PAIRS; i++) {
                        sum += pairs[i].a + pairs[i].b;

With this Makefile:

all:    aligned unaligned onebyte-aligned onebyte-unaligned

aligned:        memalign.c
        cc -DALIGNED -xmemalign=4i -o aligned memalign.c

unaligned:      memalign.c
        cc -xmemalign=4i -o unaligned memalign.c

onebyte-aligned:        memalign.c
        cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c

onebyte-unaligned:      memalign.c
        cc -xmemalign=1s -o onebyte-unaligned memalign.c

First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):

# ptime ./aligned

real        0.016512200
user        0.008128100
sys         0.004643800
# ptime ./unaligned

real        5.749458300
user        3.343899250
sys         2.339621750

So, misaligned accesses causes us to run 300x slower! Holy crap! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)

That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):

# ptime -m ./unaligned

real        6.263908350
user        3.591816850
sys         0.013817200
trap        2.589204200
tflt        0.000000000
dflt        0.000000000
kflt        0.000000000
lock        0.000000000
slp         0.000000000
lat         0.067424450
stop        0.000162500

So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.

So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:

# ptime ./aligned

real        0.012762500
user        0.007944150
sys         0.004371950
# ptime ./onebyte-aligned

real        0.030157100
user        0.024818150
sys         0.004310100
# ptime ./onebyte-unaligned

real        0.030376500
user        0.024850600
sys         0.004306500

Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)

There were similar issues with the mid-1990s Mac switch from 680x0 to PowerPC -- I had Microsoft Word 6.0 on my first PowerMac and it couldn't keep up with me typing in standard text mode. They released a recompiled version -- same code, just properly compiled for PowerPC -- and it was muuuuuch better.

I remember seeing web pages listing similar performance improvements based on turning on memory alignment enforcement when compiling and linking for the 680x0->PowerPC transition. (I think it was POV-Ray, the ray-tracing package.)
Great article, many thanks. IT's really hard to find good blog posts about memalign issues.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?