To timidly go where many have gone before: May 2006

Friday, May 26, 2006

Process contracts in Solaris 10

The Service Management Facility (SMF) is a new feature in Solaris 10 that's replacing the old System V /etc/rc*.d way of starting and stopping system processes. Of course, it's not just a new way to start and stop system processes, it's a way to manage services. Aside from other features, SMF can restart daemons that are killed for whatever reason. I was discussing this with a colleague, and he wanted to know how this works. Obviously, it could simply use the fact that init(1M) is the parent of all daemons and piggyback on the wait(2) loop that reaps children as they die. But that wouldn't be useful in supporting other features of SMF, such as having delegated restarters.

The communication mechanism that allows the daemon restart and other features of SMF to work is process contracts. A process contract is essentially another relationship between processes, similar to the parent-child relationship between processes in Unix, but also similar to process groups, as contract IDs aren't necessarily unique. Every process has an associated contract ID, and every contract has a contract holder. The contract holder receives information about certain things that happen to processes with that contract ID; namely process creation and normal termination (processes added to or removed from the contract), the last process to exit a contract (the contract is now empty), and other fatal events (a fatal signal, a core file was generated, or the process was killed due to an uncorrectable hardware error.)

One of the major benefits of process contracts is that it allows one process to define a fault boundary around a set of subprocesses. What this means in practice is that you have finer-grained control over the fault boundaries on your server. Prior to Solaris 10, the fault boundary essentially included every process on running on a server. In the case of an uncorrectable memory error, for example, a server would either panic (the error was in kernel address space) or "gracefully" reboot the server (if the error was in user space.) (See more here.) Given that the fault boundary included all processes, this was the only option.

With process contracts, however, there's finer-grained control over the fault boundaries. The boundary can be defined to include, for example, httpd and any of its child processes. If an uncorrectable memory error kills the parent httpd process, SMF can be notified and restart that process. There's no need to reboot the server. And given that SMF allows us to specify dependencies between services, there's no need to do anything with the sendmail process if there's no dependency on httpd.

(This is by no means an exhaustive look at process contracts. More information can be found in the man pages (contract(4), process(4), et al.) and other documentation.)

# posted by Chad Mynhier @ 8:06 AM 6 comments

Saturday, May 20, 2006

ZFS I/O reordering benchmark

I've posted about the write performance of ZFS compared to UFS, ext3fs and reiserfs (different OS's but exactly identical hardware (the same server) on the same disk cylinders.) That wasn't the extent of the benchmarking I performed, however. I'll detail another one that I did, although it's not as much a benchmark as a demonstration of the I/O reordering that ZFS does to give reads preference over writes.

The basic idea is to read a file while there's a heavy write load to the filesystem. The write load I was applying to the filesystem was the same one I detailed in this earlier entry. Again, I started out with a load that was serious overkill for the filesystems and then reran the test with a more reasonable workload.

I first created a data file to be read, 2GB for the overkill load, 512MB for the reasonable load. (The sizes were chosen to guarantee that the read of the file always took less than the time to generate the write load. I could have generated a longer write load so that I could reasonably compare the two different tests, but the tests were meant to be compared across filesystems with a similar write load, not across loads for the same filesystem.)

Instead of using cat or dd to read the file, I used md5sum. Why? Well, for one, 'cat file > /dev/null' has interesting behavior under Solaris. The cat utility actually mmaps the file and then relies on demand paging to read the file as needed. So 'cat file > /dev/null' essentially translates into a no-op. 'cat file | cat > /dev/null' achieves the goal, but it seems silly. I could have used dd, but I actually used md5sum in order to slightly handicap ZFS. Given that ZFS involves a lot of computation (checksums and whatnot), I wanted to give the CPU some more work to do so as to intentionally interfere with that. I also did it in order to create a more realistic test. In general, you don't read data from disk just to throw it away, you do something with it.

To quote from my earlier post for completeness here: The server I was using was a 2 x 2.8GHz Dell 1850 with a single 73GB SCSI disk and 2GB RAM. I ran the tests using both UFS and ZFS under Solaris x86 and both ext3fs and reiserfs under Linux. To avoid differences in performance between the inside and the outside of the disk, I used the same cylinders on the disk for all tests (plus or minus a cylinder or two.)

So here are the results for the overkill write load:

Filesystem	Time (min:sec, unloaded)	Time (min:sec,loaded)	Ratio loaded:unloaded
UFS	0:50.2	5:50	8.2:1
ZFS	0:31.8	0:36.0	1.13:1
ext3fs	0:36.3	54:21	89.9:1
reiserfs	0:33.4	69:45	124.6:1

Wow. So while ZFS performed the read under load in close to the same time it took with no load, ext3fs took 90 times as long, and reiserfs took 125 times as long. Again, all I can say is, "Wow." But I also have to emphasize that this write load was too heavy for the filesystems. (Although one could argue that the write load wasn't too heavy, given that ZFS could handle it gracefully. But it's certainly not a real-world workload, so while the data is interesting, it would be hard to argue that it's useful.)

And the results for the reasonable write load. (Note that I didn't run the test for UFS, mostly due to time constraints when I was doing this. Remember also that this was a 512MB file, not 2GB.):

Filesystem	Time (min:sec, unloaded)	Time (min:sec,loaded)	Ratio loaded:unloaded
ZFS	0:09.0	0:10.3	1.14:1
ext3fs	0:08.8	5:27	37.2:1
reiserfs	0:08.7	3:50	26.4:1

Okay, so these numbers aren't quite as ludicrous as the first. They're still impressive, though. The ZFS engineers appear to have done a very good job.

# posted by Chad Mynhier @ 2:15 PM 7 comments

Friday, May 19, 2006

ZFS benchmarking

ZFS was made publically available on November 16, 2005. I was doing my usual scan of Sun blogs and saw some dozens of different entries announcing this. I'd been waiting a year or so to get my hands on ZFS, so I feverishly set about downloading the appropriate bits so that I could install a server and start playing with it. (I'm being very literal when I say "feverishly" -- I wasn't feeling that well that day and measured myself at 102 degrees or thereabouts (39 degrees for anyone who wonders why I was above the boiling point of water) when I got home that evening.)

Over the next few weeks, I ran some benchmarks comparing ZFS to UFS, ext3fs and reiserfs. I avoided the standard benchmarks and used a script I'd developed earlier during the year to compare some cheap NAS implementations. This script was originally intended simply to generate a large amount in a filesystem hierarchy that mirrored what we would be doing with that cheap NAS, where there was a companion script to be run across a couple dozen servers to generate read traffic. The company I work for would probably balk if I put that script here, but it essentially created a filesystem hierarchy that looked like [00-FF]/[00-FF]/[0-F]/[1-64], where the 64 files at the leaves are ~10k.

I started out by trying to determine the parameters I wanted to use when running the benchmark against different filesystems, as I wanted to get 100% disk utilization. Unfortunately, I used ZFS to determine these parameters. This turned out to be serious overkill for the other filesystems, as the numbers below indicate. Here are the parameters I used, with an explanation for the values:

write-data -c 5 -u 1200 -m 64

I'm running 5 concurrent processes, each creating 1200 leaf directories with 64 files each. So 5 * 1200 * 64 * 10k is about 3.7GB of data in all (plus metadata.)

The server I was using was a 2 x 2.8GHz Dell 1850 with a single 73GB SCSI disk and 2GB RAM. I ran the tests using both UFS and ZFS under Solaris x86 and both ext3fs and reiserfs under Linux. To avoid differences in performance between the inside and the outside of the disk, I used the same cylinders on the disk for all tests (plus or minus a cylinder or two.) The times include syncing the data to disk.

Here are the results of these runs (averaged over several runs each.) The "starting empty" times represent runs with a newly-created filesystem. The "consecutive run" times represent runs when I don't clean up after a "starting empty" run, i.e., the files are being rewritten into the existing filesystem structure.

Filesystem:	UFS	ZFS	ext3fs	reiserfs
Time (min:sec)(starting empty)	28:16	2:49	60:39	46:25
Time (min:sec)(consecutive run)	59:57	5:34	20:26	50:31

So ZFS is the fastest all around for this particular workload by a spectacular margin. (And it's probably interesting to note that ext3fs was the only filesystem that was actually faster on "consecutive" runs. Given the asynchronous metadata updates, it might not be that surprising.) But, as I stated earlier, this was a workload designed to keep the disk busy when ZFS is being used. So while I'd demonstrated that ZFS can handle a heavy workload better than the other filesystems, I hadn't demonstrated that it's faster under a resonable workload. So I re-calibrated to keep all the filesystems below 100% disk utilization, which ended up being these parameters:

write-data -c 1 -u 1200 -m 64

So instead of 5 concurrent processes, there's just 1. And here are the results:

Filesystem:	UFS	ZFS	ext3fs	reiserfs
Time (min:sec)(starting empty)	3:24	0:35	7:28	4:43
Time (min:sec)(consecutive run)	11:01	0:38	1:10	2:34

So ZFS still won, but the margin of victory wasn't quite as large as with the first test. And here we see reiserfs doing better on consecutive runs, too. But while the above is informative, it still doesn't show the full story. It's also interesting to note the disk utilization during these runs. (Note that this wasn't a rigorous measurement, just eyeballing iostat output during the tests.)

Filesystem:	UFS	ZFS	ext3fs	reiserfs
% Utilization (starting empty)	95-100	45-50	95-100	95-100
% Utilization (consecutive run)	95-100	51-56	95-100	95-100

# posted by Chad Mynhier @ 7:43 AM 13 comments

Wednesday, May 10, 2006

Second DTrace success -- readdir_r() considered harmful

My first DTrace success pointed us towards using libumem in our threaded SMTP server. In order to protect the interests of the company I work for, I won't go into too much detail about our email plant architecture, but this application runs in different modes. One of them is the one that I've already described, the mode running on our MX servers that accept connections from the outside world. Once mail is received, we filter it (in ways that should be obvious considering the signal-to-noise ratio) and then hand off what's left to be delivered to user mailboxes. The application that does this work is that same SMTP server running in a different mode.

We've long known that there are performance problems with this mode of the application. We run it mostly on single-cpu servers (v120's) because it doesn't scale well to 2 or 4 CPUs. Even on the single-cpu servers, the CPUs are mostly idle, and the server plant is larger than it ideally needs to be. The developers blamed NAS performance, but the systems guys knew better, as nothing on the filer indicated that there was a problem. And nobody had put much effort into tracking down the problem.

Based on our success with libumem with this application in its other mode, we naturally assumed that using libumem here would solve all of our problems. We tried that here, and we saw some improvement, but not nearly what we had expected. Using a similar analysis (again noting that I hadn't yet discovered plockstat, which would have made the analysis much faster), it looked like all of the lock contention was coming from readdir_r().

But this doesn't make sense, right? Using readdir_r() is the Right Thing To Do(TM) in a threaded application, as it's thread-safe, whereas readdir() isn't. So why were we seeing problems here?

The man page contains this interesting tidbit about readdir():

     struct dirent *readdir(DIR *dirp);

[ ... ]

     The  pointer  returned by readdir() points to data which may
     be overwritten by another call  to  readdir()  on  the  same
     directory  stream.  This  data is not overwritten by another
     call to readdir() on a different directory stream.

Based on that description, one might assume that readdir_r() has been implemented with a mutex per directory stream. Does the code back this up? Let's look at the source:

     57 extern mutex_t _dirent_lock;

[ ... ]

     67 int
     68 readdir_r(DIR *dirp, struct dirent *entry, struct dirent **result)
     69 {
     70  struct dirent *dp; /* -> directory data */
     71  int saveloc = 0;
     72 
     73  lmutex_lock(&_dirent_lock);

[ ... ]

     99  lmutex_unlock(&_dirent_lock);
    100  *result = entry;
    101  return (0);
    102 }

So, no, we don't have a per-dirent lock, there's a single lock per process, which obviously doesn't scale to hundreds of threads. Given that this is a problem, what can we do about it? Do we actually need to use the thread-safe version?

So when does the thread-safe version of readdir_r() make sense? We're using the lock to protect the contents of the struct dirent. But the above man page snippet would indicate that that data structure wouldn't get overwritten by a readdir() on a different directory stream, if we were simply using readdir(). So it seems that the only time you'd need that protection is if multiple threads are sharing the same directory stream. Our application certainly isn't doing so, as every thread is doing its own opendir()/readdir_r()/closedir(). So we should be safe to replace the offending readdir_r() call with readdir().

And having done so, we saw a dramatic increase in performance (as measured in throughput per server.) I wish I could include the MRTG graph of thread count on these servers, but imagine a profile of the United States moving from the Rocky Mountains to the plains, and you'll get a good image of what it looked like. The improvement was enough to allow us to cut this server plant in half while still leaving a large buffer to handle spikes.

BTW, the email thread containing this message from Casper Dik is pretty interesting WRT the above problem.

# posted by Chad Mynhier @ 6:03 AM 7 comments

Thursday, May 04, 2006

First DTrace success

While we were evaluating the T2000 server, we ran into some performance problems in our application. The software couldn't scale up to fully utilize the hardware.

The traditional approach to debugging this would have been to increase the debugging level and then analyze the logs to try to find the problem. The problem with doing so in this case was that the mutex around writes to the log file was one of the candidates for lock contention. Whereas we might have gotten useful information had this analysis pointed elsewhere, had it pointed at the log file mutex, the analysis would have told us nothing. We couldn't have known whether we had introduced that lock contention ourselves by logging more information.

This was our first use of Solaris 10 in our environment, and I'd been waiting quite a while to use DTrace to solve a real problem, so I took the opportunity to do so. Having had limited experience using DTrace, it took longer to find the problem than it should have, and the path I ended up taking probably wasn't optimal. (In retrospect, the approach I took was embarrassingly brute-force, but I'll detail it here anyway.)

I'll skip some of the dead-end fumbling I did and just present the final successful path. I was curious to see where the application was spending all of its time. I used this DTrace script to determine that:

#!/usr/sbin/dtrace -s

pid$1:::entry
{
    self->ts[probemod,probefunc] = timestamp;
}


pid$1:::return
/ self->ts[probemod,probefunc] != 0/
{
    foo = timestamp - self->ts[probemod,probefunc];
    @functime[probemod,probefunc] = sum(foo);
    self->ts[probemod,probefunc] = 0;
}

(Note that the above was very expensive, as it enabled tens of thousands of probes. In retrospect, I might have tried a different approach to get this information.)

The results of running this script told me that the application was spending ~50% of its time in ___lwp_mutex_timedlock(). Okay, so what is the code path leading to this function?

#!/usr/sbin/dtrace -s

pid$1::___lwp_mutex_timedlock:entry
{
    self->ts = timestamp;
}


pid$1::___lwp_mutex_timedlock:return
/self->ts != 0/
{
    foo = timestamp - self->ts;
    @stacktime[ustack()] = sum(foo);
    self->ts = 0;
}

Running the above script didn't immediately point to a smoking gun. There was no one code path accounting for most of the time spent here, it was fairly well distributed through the code. But the code paths did share a common theme that would have been immediately apparent had I limited the stack depth to 5. I would then have seen these top two culprits:

              libc.so.1`___lwp_mutex_timedlock+0x30
              libc.so.1`queue_lock+0x60
              libc.so.1`mutex_lock_queue+0xcc
              libc.so.1`lmutex_lock+0xe0
              libc.so.1`malloc+0x44

              libc.so.1`___lwp_mutex_timedlock+0x30
              libc.so.1`queue_lock+0x60
              libc.so.1`mutex_lock_queue+0xcc
              libc.so.1`lmutex_lock+0xe0
              libc.so.1`free+0x1c

So all of this lock contention was coming from libc's malloc() and free(). This actually turned out to be an ideal situation, as we could just set LD_PRELOAD to use libumem. There was no need to make code changes to the application itself. When we tried this, we saw our lock contention problems disappear.

The key points I want to make about the experience are these:

- Using DTrace allowed us to avoid the traditional method of additional logging, which could have masked the real problem given that there's a mutex involved in logging. - DTrace let us discover the problem far more quickly than the traditional method likely would have. Even given my inexperience, it was maybe a few hours worth of work. With some experience (and with some knowledge of plockstat), it might have taken ten minutes or less. - DTrace pointed us to a problem that we likely wouldn't have considered, as it was outside of our code. And given that we hadn't considered this possibility prior to discovering the problem, we certainly didn't have debugging statements ready to be enabled that would have caught this problem.

# posted by Chad Mynhier @ 12:13 PM 2 comments

To timidly go where many have gone before