<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-26802961</id><updated>2011-12-09T00:36:40.265-08:00</updated><title type='text'>To timidly go where many have gone before</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>60</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-26802961.post-8002315089006209324</id><published>2009-02-16T14:00:00.001-08:00</published><updated>2009-02-16T14:03:25.161-08:00</updated><title type='text'>Blog moved</title><content type='html'>For the hundreds of you out there who read this blog, I've moved.  I'll continue blogging, probably with about the frequency I ever blogged here, but I'll be doing it &lt;a href="http://forsythesunsolutions.com/blog"&gt;in a new location.&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-8002315089006209324?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/8002315089006209324/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=8002315089006209324' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8002315089006209324'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8002315089006209324'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2009/02/blog-moved.html' title='Blog moved'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-4194936228495627409</id><published>2008-10-10T18:05:00.000-07:00</published><updated>2008-10-10T19:20:14.007-07:00</updated><title type='text'>Measuring the length of a linked list</title><content type='html'>How do you measure the length of a linked list?  That's easy enough, you start at the beginning and follow the linked list, counting as you go.  Of course, you have to make sure that there's not a cycle in your list, but there's a &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/dtrace/dtrace.c#3697"&gt;classic solution to that problem&lt;/a&gt;[1].&lt;br/&gt;&lt;br/&gt;

Okay, so what do you do if this is a linked list in the Solaris kernel, running on a production machine, and you want to know how long it is?  Hmm, that's a bit more difficult.  The linked list is likely changing, so you can't just pop open mdb, walk the list, and count.  It would be convenient to be able to grab the lock on that linked list to guarantee that it doesn't change, and then walk it.  Fortunately, there aren't any tools to do that, as you could bring a box to its knees while doing this (performance wise, if not actually crashing the box.)  (Theoretically, you might be able to do it with "mdb -kw" if you happen to get lucky setting the right bit at just the right time to make it look as if that lock is held, but I wouldn't be willing to bet on someone actually succeeding at doing this.)&lt;br/&gt;&lt;br/&gt;

So here's a method that a colleague and I came up with to give an estimate of the maximum and average length of that linked list over a sampling period.  (In this case, we're looking at a hash table, the sleep queue used for threads who call cv_block() on a condition variable.):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;#!/usr/sbin/dtrace -s

tick-3s
{
       exit(0);
}

fbt::cv_block:entry
{
       /*
        * Hash function used for this hash table.  I'm adding 1 so I can
        * reserve 0 as a special value.
        */
       self-&gt;bucket = (((uintptr_t)(arg0) &gt;&gt; 2) + ((uintptr_t)(arg0) &gt;&gt; 9) &amp; 511) + 1;
}

fbt::sleepq_insert:entry
/self-&gt;bucket/
{
       length[self-&gt;bucket]++;
       @r[self-&gt;bucket] = max(length[self-&gt;bucket]);
       @q[self-&gt;bucket] = avg(length[self-&gt;bucket]);
       bucket[arg1] = self-&gt;bucket;
       self-&gt;bucket = 0;
}

fbt::sleepq_unlink:entry
/ length[bucket[arg1]] &gt; 0 /
{
       length[bucket[arg1]]--;
}

END
{
       trunc(@r,30);
       trunc(@q, 30);
}
&lt;/pre&gt;&lt;br/&gt;

Ignoring some of the particulars, we're keeping a length value for each hash bucket.  When we insert something, we increment that value.  When we delete something, we decrement it.  We then keep track of the max and average for that length.  Simple enough.  We also keep track of which bucket this thread is going into (arg1 to both sleepq_insert() and sleepq_unlink() is a pointer to the thread structure) so that we know which length variable to decrement.  (Note that we can't do this as a thread-local variable because sleepq_insert() and sleepq_unlink() won't happen in the same thread.)&lt;br/&gt;&lt;br/&gt;

Something to point out is that this is only an approximation to the length of any of these linked lists.  What this actually tells you is the maximum and average &lt;b&gt;growth&lt;/b&gt; in the length of those linked lists.  It won't include the length of any of those lists at the time that the script starts sampling.  But as an approximation, it's good enough, especially in this case, because the linked lists in a hash table should never really be longer than one or two and &lt;b&gt;certainly&lt;/b&gt; never as long as, say, 83.&lt;br/&gt;&lt;br/&gt;



&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;[1]  Okay, if that link has since become broken, it points to this comment from the code for DTrace:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;/*
 * We want to have a name for the minor.  In order to do this,
 * we need to walk the minor list from the devinfo.  We want
 * to be sure that we don't infinitely walk a circular list,
 * so we check for circularity by sending a scout pointer
 * ahead two elements for every element that we iterate over;
 * if the list is circular, these will ultimately point to the
 * same element.  You may recognize this little trick as the
 * answer to a stupid interview question -- one that always
 * seems to be asked by those who had to have it laboriously
 * explained to them, and who can't even concisely describe
 * the conditions under which one would be forced to resort to
 * this technique.  Needless to say, those conditions are
 * found here -- and probably only here.  Is this the only use
 * of this infamous trick in shipping, production code?  If it
 * isn't, it probably should be...
 */
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-4194936228495627409?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/4194936228495627409/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=4194936228495627409' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4194936228495627409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4194936228495627409'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/10/measuring-length-of-linked-list.html' title='Measuring the length of a linked list'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-1874003264815211566</id><published>2008-10-08T18:51:00.000-07:00</published><updated>2008-10-08T19:14:23.390-07:00</updated><title type='text'>Ptime modifications</title><content type='html'>Late last year I made a proposal to make some modifications to ptime(1) (the full text of the proposal can be found in the second message &lt;a href="http://opensolaris.org/os/community/arc/caselog/2007/598/mail"&gt;here&lt;/a&gt;.  The proposal includes links to the original RFEs.)  There've been some bumps along the way, but the changes are soon going to be putback into Solaris.  (Thanks, &lt;a href="http://blogs.sun.com/rv/"&gt;Rafael&lt;/a&gt;!)&lt;br/&gt;&lt;br/&gt;

When I made these changes, I was mostly just going through the list of open Solaris bugs looking for things to do, it wasn't something that I needed.  I've recently found it to be an exceedingly useful tool, though, especially with the -m and -p options.  For example, something's just not right with this particular process:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# ptime -mp 3878

real  3:00:47.519957300
user    12:13.948207800
sys        37.210204400
trap    19:40.638837300
tflt        2.942089500
dflt        0.783120500
kflt        0.000000000
lock  5:29:23.879672500
slp   6:00:46.021626100
lat        17.081155000
stop        0.000081200
#
&lt;/pre&gt;&lt;br/&gt;

This is a case of misaligned memory accesses on a SPARC box, which is why all the time is showing up in trap.  Here's another process in a bad way:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# ./ptime -mp 2715

real 106:51:30.161266600
user  4:28:49.022048500
sys  19:00:33.201052500
trap        0.047837500
tflt        0.051432200
dflt        0.038230200
kflt        0.000000000
lock 103:08:06.134729400
slp  299:24:41.087893200
lat   1:04:27.602323100
stop        2.913079800
&lt;/pre&gt;&lt;br/&gt;

The cause of this aberrant behavior is probably good fodder for a later blog entry.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-1874003264815211566?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/1874003264815211566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=1874003264815211566' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1874003264815211566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1874003264815211566'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/10/ptime-modifications.html' title='Ptime modifications'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-3550578794124045224</id><published>2008-10-05T13:21:00.000-07:00</published><updated>2008-10-05T18:37:23.063-07:00</updated><title type='text'>Memory alignment on SPARC, or a 300x speedup!</title><content type='html'>(Search for "Holy crap!" if you want to jump straight to the 300x speedup part.)&lt;br/&gt;&lt;br/&gt;

I remember first running across SIGBUS in an introductory programming course some years ago.  You'll get a SIGBUS when you have a misaligned memory access.  For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4.  If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file.  SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers.  (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)&lt;br/&gt;&lt;br/&gt;

In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options.  (The x86 architecture handles things differently.)  The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access.  For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses.  "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.&lt;br/&gt;&lt;br/&gt;

So what does it mean to handle misaligned accesses gracefully?  Darryl Gove, in his book &lt;a href="http://www.amazon.com/Solaris-Application-Programming-Darryl-Gove/dp/0138134553/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1223238702&amp;sr=8-1"&gt;Solaris Application Programming&lt;/a&gt;, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.&lt;br/&gt;&lt;br/&gt;

Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.)  If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately.  For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s".  The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions.  Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.&lt;br/&gt;&lt;br/&gt;

If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses.  You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access.  But how big is that penalty?&lt;br/&gt;&lt;br/&gt;

Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;#include &amp;lt;stdio.h&gt;
#include &amp;lt;stdlib.h&gt;

typedef struct {
        int a;
        int b;
} pair_t;

#define PAIRS 100
#define REPS 10000

int
main()
{
        int i, j;
        char *foo;
        pair_t *pairs;

        if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) {
                fprintf(stderr, "Unable to allocate memory\n");
                exit(1);
        }

#ifdef ALIGNED
        pairs = (pair_t *) foo;
#else
        pairs = (pair_t *) (foo + 1);
#endif

        for (i = 0; i &lt; PAIRS; i++) {
                pairs[i].a = i;
                pairs[i].b = i+5;
        }

        for (j = 0; j &lt; REPS; j++) {
                int sum;

                for (i = 0; i &lt; PAIRS; i++) {
                        sum += pairs[i].a + pairs[i].b;
                }
        }
}
&lt;/pre&gt;&lt;br/&gt;

With this Makefile:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;all:    aligned unaligned onebyte-aligned onebyte-unaligned

aligned:        memalign.c
        cc -DALIGNED -xmemalign=4i -o aligned memalign.c

unaligned:      memalign.c
        cc -xmemalign=4i -o unaligned memalign.c

onebyte-aligned:        memalign.c
        cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c

onebyte-unaligned:      memalign.c
        cc -xmemalign=1s -o onebyte-unaligned memalign.c
&lt;/pre&gt;&lt;br/&gt;

First, let's look at the impact of handling misaligned access in the kernel.  (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# ptime ./aligned

real        0.016512200
user        0.008128100
sys         0.004643800
# 
# 
# ptime ./unaligned

real        5.749458300
user        3.343899250
sys         2.339621750
# 
&lt;/pre&gt;&lt;br/&gt;

So, misaligned accesses causes us to run &lt;b&gt;300x&lt;/b&gt; slower!  &lt;b&gt;Holy crap!&lt;/b&gt;  This is nowhere near what I expected on first glance.  Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a &lt;b&gt;30,000%&lt;/b&gt; speedup.  The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same.  I'm not sure why this is the case, I'll have to do some digging.  (It's likely that we're blowing something out of the cache, which makes sense.  But that's just hypothesis for the moment.)&lt;br/&gt;&lt;br/&gt;

That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" to ptime(1), another part of the modifications being putback soon):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# ptime -m ./unaligned

real        6.263908350
user        3.591816850
sys         0.013817200
trap        2.589204200
tflt        0.000000000
dflt        0.000000000
kflt        0.000000000
lock        0.000000000
slp         0.000000000
lat         0.067424450
stop        0.000162500
# 
&lt;/pre&gt;&lt;br/&gt;

So the majority of that extra time is &lt;b&gt;not&lt;/b&gt; being spent actually doing the useful work of handling the misaligned memory access, it's being spent in &lt;b&gt;trap&lt;/b&gt;.  This isn't that unexpected, 'cause it's well-known that traps are expensive.  But it does demonstrate just how wasteful this is.&lt;br/&gt;&lt;br/&gt;

So now let's look at the other option, compiling with "-xmemalign=1s".  What performance penalty do we pay for this?  Here's a comparison with the above aligned version of the program:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# ptime ./aligned

real        0.012762500
user        0.007944150
sys         0.004371950
# 
# ptime ./onebyte-aligned

real        0.030157100
user        0.024818150
sys         0.004310100
# 
# ptime ./onebyte-unaligned

real        0.030376500
user        0.024850600
sys         0.004306500
# 
&lt;/pre&gt;&lt;br/&gt;

Okay, so that's reasonable, we end up running about 2.5x slower.  (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.)  Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a &lt;b&gt;30,000%&lt;/b&gt; performance improvement.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-3550578794124045224?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/3550578794124045224/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=3550578794124045224' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3550578794124045224'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3550578794124045224'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/10/memory-alignment-on-sparc-or-300x.html' title='Memory alignment on SPARC, or a 300x speedup!'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-6423029909577839258</id><published>2008-10-04T09:39:00.000-07:00</published><updated>2008-10-04T19:12:28.041-07:00</updated><title type='text'>Grabbing the value of a local variable with DTrace</title><content type='html'>I had an interesting DTrace experience the other day.  I was working with a developer to try to track down a bug that was causing tens of thousands of unaligned memory accesses per second in an application.  (This was on SPARC hardware, where misaligned memory accesses cause a trap into the kernel to handle it in software.  Assuming the compiler options are correct, that is, which they were in this case.)&lt;br/&gt;&lt;br/&gt;

We got to a point in the code where a value in a local variable was being added to a pointer, and I wanted to see what that value was.  (I already knew that the pointer was correctly aligned coming into this function.)  The developer offered to go compile a version with some debugging print statements to get the value, which would take about fifteen minutes.  As he was walking away, I figured out how I could do this with DTrace.  By the time he got back to me with the value, I'd already extracted the value from the live instance of the app.&lt;br/&gt;&lt;br/&gt;

There's nothing terribly complicated about how I did it, but it does involve knowing a few things.  The first is that local variables in a function (automatic variables, at least) are stored in the stack frame and are accessed via an offset from the frame pointer.  (Although I guess I probably should have put knowing what a stack frame is and what the frame pointer is before knowing this.)  The second is knowing that every arithmetic SPARC instruction operates on registers, so any value in memory (e.g., in the stack frame) must be loaded into a register before being used in an arithmetic instruction,  The third is that you have the uregs[] array available to you in DTrace, so you can grab the value from one of the registers.  And the last one is that the pid provider in DTrace lets you instrument any instruction in an application.&lt;br/&gt;&lt;br/&gt;

Say we have the following function in a program, and say that I want to be able to determine the value of localvar that's being added to somearg:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;typedef struct {
        int a;
        int b;
} pair_t;

int
somefunction(int somearg, pair_t *somepair)
{
        int localvar;

        localvar = somepair-&gt;b == 0 ? somepair-&gt;a : somepair-&gt;b;

        return (somearg + localvar);
}
&lt;/pre&gt;&lt;br/&gt;

If we disassemble the function, we'll see this (I chose to use mdb, although dis would have sufficed.):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;&gt; somefunction::dis
somefunction:                   save      %sp, -0x70, %sp
somefunction+4:                 or        %i1, %g0, %o0
somefunction+8:                 ld        [%o0 + 0x4], %o1
somefunction+0xc:               cmp       %o1, 0x0
somefunction+0x10:              bne       +0x10         &lt;somefunction+0x20&gt;
somefunction+0x14:              nop
somefunction+0x18:              ba        +0xc          &lt;somefunction+0x24&gt;
somefunction+0x1c:              ld        [%o0], %i5
somefunction+0x20:              or        %o1, %g0, %i5
somefunction+0x24:              st        %i5, [%fp - 0x8]
somefunction+0x28:              add       %i0, %i5, %o0
somefunction+0x2c:              st        %o0, [%fp - 0x4]
somefunction+0x30:              or        %o0, %g0, %i0
somefunction+0x34:              ret
somefunction+0x38:              restore
somefunction+0x3c:              or        %o0, %g0, %i0
somefunction+0x40:              ret
somefunction+0x44:              restore
&gt; 
&lt;/pre&gt;&lt;br/&gt;

At somefunction+8, we're sticking the value of somepair-&gt;b into register %o1.  From somefunction+0xc to somefunction+0x20, we're deciding which value to use for localvar.  It's at somefunction+0x1c that we put the value of somepair-&gt;a into register %i5, and at somefunction+0x20 we're putting th value of somepair-&gt;b into %i5.  (We already have the value in %o1, and the or with %g0 is just a way to move a value from one register to another -- %g0 is always a zero value.  Note also that somefunction+0x1c is also the branch-delay slot of the preceding instruction, so we're not immediately overwriting %i5 at somefunction+0x20.)&lt;br/&gt;&lt;br/&gt;

The purpose of the instruction at somefunction+0x24 is to save the value in %i5 into the stack frame.  This is the value of localvar that's going to be added to somearg.  (The memory location of localvar is %fp - 0x8.  It's probably worth pointing out here that, if I'd compiled with certain optimizations, this instruction might not be here.  There's no real point in writing this value into the location for localvar in the stack frame, as the value will never be used again.  I'm using this particular instruction below, but no matter how optimized the code could get, I'd always still have some instruction to instrument, as the value of localvar has to be in a register in order to perform that addition.)&lt;br/&gt;&lt;br/&gt;

Given the above, I can use the following bit of D code to get the value of localvar that gets added to somearg (the 24 in the probe identifier indicates that this is at offset 24 (hex) from the beginning of somefunction):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;pid$target:a.out:somefunction:24
{
        printf("localvar value is %d\n", uregs[R_I5]);
}
&lt;/pre&gt;&lt;br/&gt;

And when I set things up such that somepair-&gt;b is 7 (and thus localvar will have the value 7), I get the following:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# dtrace -q -s ./someprogram.d -c ./a.out
localvar value is 7

# 
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-6423029909577839258?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/6423029909577839258/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=6423029909577839258' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6423029909577839258'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6423029909577839258'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/10/grabbing-value-of-local-variable-with.html' title='Grabbing the value of a local variable with DTrace'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-3290894501846491450</id><published>2008-09-13T17:06:00.000-07:00</published><updated>2008-09-13T17:22:10.967-07:00</updated><title type='text'>Suspend/resume on a ThinkPad T61 with OpenSolaris works</title><content type='html'>Woohoo!  Suspend and resume finally work for me!&lt;br/&gt;&lt;br/&gt;

I've been running OpenSolaris on my laptop since one of the 2008.05 release candidates.  I've been doing an image uupdate evvery time a new one is available, and the first thing I always try is to see whether suspend/resume work for me.  Initially, suspend didn't work at all.  It would go through the process, hit whatever driver didn't yet have suspend support (I've forgotten which one it was), and gracefully back out of the process.&lt;br/&gt;&lt;br/&gt;

A couple of updates ago, whatever driver it was apparently got support for suspend, and my laptop finally did a full suspend, including lighting up the little green moon symbol that indicates that the laptop is asleep.  Woohoo!  I was cooking with gas!  At least I thought I was.  Unfortunately, resume didn't actually work, so being able to suspend didn't help me much.&lt;br/&gt;&lt;br/&gt;

Today I did an update to build 97, and of course I tried suspend/resume.  Suspend worked, but I was expecting that.  I opened the laptop, and nothing happened, as expected.  I hit a few keys, and nothing happened, as expected.  I hit the power button and saw some disk activity.  I'd seen this before, so I didn't get excited.  But then the screen came to life.  And things worked.  I didn't get too excited, as it could simply have been a fluke.  I tried it again, and it worked again.  I still wasn't going to claim victory, though.  Ten times later, it's worked successfully every time.&lt;br/&gt;&lt;br/&gt;

Woohoo!&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-3290894501846491450?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/3290894501846491450/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=3290894501846491450' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3290894501846491450'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3290894501846491450'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/09/suspendresume-on-thinkpad-t61-with.html' title='Suspend/resume on a ThinkPad T61 with OpenSolaris works'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-3211795995635915611</id><published>2008-07-28T08:42:00.000-07:00</published><updated>2008-07-28T10:36:23.628-07:00</updated><title type='text'>ZFS boot, Live Upgrade, and bfu</title><content type='html'>ZFS boot is now available in Solaris Express for both x86 and SPARC. (It's been available for a while for x86 (OpenSolaris uses a ZFS boot, and the package manager (IPS) makes good use of it), but only recently became available for SPARC.) Having gotten used to it on my laptop (running OpenSolaris), I finally converted my home machines to use a ZFS root file system.&lt;br/&gt;&lt;br/&gt;

Woohoo! Life is much simpler now! Well, okay, certain tasks that I perform frequently have become much simpler and take far less time. Specifically, when I've done a build of Solaris and want to BFU my system (i.e., upgrade my system to the build that has just finished), the process takes much less time than it used to.&lt;br/&gt;&lt;br/&gt;

As background, I use Live Upgrade (LU) on my systems at home. I keep one boot environment (BE) as a "pristine" copy of a recent Solaris Express release so I'll always have something to boot from. I have another BE that I use for BFU purposes so that I have an environment I can trash.&lt;br/&gt;&lt;br/&gt;

So here are the steps I used to perform to BFU my system:&lt;br/&gt;
&lt;ul&gt;
&lt;li&gt; Reboot onto the pristine BE, if I'm not already.
&lt;li&gt; Destroy and recreate the BFU BE as a copy of the pristine version. (This is mostly paranoia, so that I avoid any cross-pollination between two different builds I've done. This is probably unnecessary, but I can't avoid the paranoia.)
&lt;li&gt; BFU the newly-created BE.
&lt;li&gt; Reboot onto the BFU BE.
&lt;/ul&gt;&lt;br/&gt;&lt;br/&gt;

In the above, the BE creation step is the expensive one. I don't think I've ever timed it, but it's on the order of one hour, which pushes the whole process to an hour and a half or so.&lt;br/&gt;&lt;br/&gt;

With a ZFS root file system and a version of Live Upgrade that makes use of the features ZFS, life becomes simpler. What LU does with ZFS is use snapshots and clones to copy an existing BE. The differences are then thus:&lt;br/&gt;
&lt;ul&gt;
&lt;li&gt; I don't need to reboot onto the pristine BE. I only needed the reboot to get off the BE that I was about to destroy. (Although note that I could avoid the reboot by simply having three or more BE's.) Because BE's are created as clones, I'm not limited to using existing disk partitions. I can create as many as I want.
&lt;li&gt; I don't need to destroy the existing BFU BE. See the previous item.
&lt;li&gt; Creating a new BFU BE takes about three minutes, not an hour. (There's additional work done by LU, so it's not quite as quick as simply cloning existing ZFS file systems, but it's significantly faster.)
&lt;li&gt; There are no significant differences in the BFU step, and I still have to reboot onto the new BE.
&lt;/ul&gt;&lt;br/&gt;&lt;br/&gt;

So a process that took about an hour and a half now takes less than ten minutes. Woohoo!&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-3211795995635915611?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/3211795995635915611/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=3211795995635915611' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3211795995635915611'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/3211795995635915611'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/07/zfs-boot-live-upgrade-and-bfu.html' title='ZFS boot, Live Upgrade, and bfu'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-4593788865032558812</id><published>2008-07-16T06:54:00.000-07:00</published><updated>2008-07-16T07:26:02.266-07:00</updated><title type='text'>OSDevCon 2008 interview</title><content type='html'>I recently attended the &lt;a href="http://www.osdevcon.org/2008/"&gt;OpenSolaris Developer Conference&lt;/a&gt; in Prague.  I at lunch the first day with a group of people from Sun (&lt;a href="http://blogs.sun.com/jimgris/"&gt;Jim Grisanzio&lt;/a&gt;, &lt;a href="http://blogs.sun.com/mman/"&gt;Martin Man&lt;/a&gt;, &lt;a href="http://blogs.sun.com/dom/"&gt;Dominic Kay&lt;/a&gt;, and &lt;a href="http://blogs.sun.com/deirdre/"&gt;Deirdre Straughan&lt;/a&gt;.)  At some point, Dominic asked me if I was presenting and what I was talking about.  This led to a discussion of why I'd gotten involved with OpenSolaris development.  Deirdre later asked me if I'd be willing to talk about that on video.&lt;br/&gt;

&lt;a href="http://blogs.sun.com/storage/en_US/entry/contributing_to_dtrace_an_interview"&gt; Here's the result.&lt;/a&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-4593788865032558812?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/4593788865032558812/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=4593788865032558812' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4593788865032558812'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4593788865032558812'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/07/osdevcon-2008-interview.html' title='OSDevCon 2008 interview'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-5394168993357222151</id><published>2008-05-08T06:03:00.000-07:00</published><updated>2008-05-08T06:35:14.030-07:00</updated><title type='text'>April Fool's joke gone worse!</title><content type='html'>&lt;a href="http://cmynhier.blogspot.com/2008/04/april-fools-joke-gone-bad.html"&gt;Earlier&lt;/a&gt; I mentioned that my &lt;a href="http://www.opensolaris.org/jive/thread.jspa?threadID=56137&amp;tstart=0"&gt;April Fool's joke&lt;/a&gt; had gone bad, as it was requested that I write it up as a &lt;a href="http://wikis.sun.com/display/DTrace/Adding+an+action+to+DTrace"&gt;tutorial&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

The joke has taken a turn for the worse.  I've now managed to turn it into a &lt;a href="http://www.osdevcon.org/2008/program_detail.html#chad"&gt;presentation&lt;/a&gt; at the &lt;a href="http://www.osdevcon.org/2008/index.html"&gt;OpenSolaris Developer Conference&lt;/a&gt; in Prague at the end of June.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-5394168993357222151?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/5394168993357222151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=5394168993357222151' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5394168993357222151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5394168993357222151'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/05/april-fools-joke-gone-worse.html' title='April Fool&apos;s joke gone worse!'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-4189610929810568794</id><published>2008-04-28T07:08:00.000-07:00</published><updated>2008-04-28T08:07:37.824-07:00</updated><title type='text'>Snap Upgrade rocks!</title><content type='html'>I've been a &lt;a href="http://docs.sun.com/app/docs/doc/820-4041"&gt;Live Upgrade&lt;/a&gt; convert for about a year and a half.  I have four boot environments on my workstation at home, two for Solaris Express and two for testing OpenSolaris builds.  (Yes, two would be sufficient, and that's what I used to do until the second time I put an experimental build on top of my "stable" boot environment.  It's impossible to make that same mistake if I have two stable boot environments.)  But as great as Live Upgrade is, it has its shortcomings.  For one, you have to reserve disk partitions for boot environments.  You can't retrofit Live Upgrade onto an existing system.  For another, initializing a boot environment takes time, as it's making a second copy of the OS.  (And I'm sure other people have other complaints, but those are my main two.)&lt;br/&gt;&lt;br/&gt;

With the upcoming &lt;a href="http://opensolaris.org/os/project/indiana/"&gt;Indiana&lt;/a&gt; release, Live Upgrade has been replaced with Snap Upgrade.  (As I understand it, Live Upgrade can't be open-sourced for some reason.)  Snap Upgrade is ZFS-based, which has direct impact on my two main complaints.  There's no longer a need to reserve paritions for additional boot environments, as boot environments are merely clones of the existing ZFS partitions.  This saves on disk space, as the copy-on-write nature of ZFS means that multiple boot environments will likely point at the most of the same disk blocks.  It also means that initializing a boot environment is effectively instantaneous:&lt;br/&gt;
&lt;pre&gt;
# date &amp;&amp; beadm create foo &amp;&amp; date
Mon Apr 28 10:39:03 EDT 2008
Mon Apr 28 10:39:10 EDT 2008
#
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

The new beadm(1M) command is used to manage boot environments.  The command options are a bit simpler than the Live Upgrade commands and demonstrate their ancestry as being a mix of ZFS and Live Upgrade operations:&lt;br/&gt;

&lt;pre&gt;
NAME
     beadm - utility for managing zfs boot environments

SYNOPSIS
     /usr/sbin/beadm

     beadm create [-a] [-e non-activeBeName | beName@snapshot]
         [-o property=value] ... [-p zpool] beName

     beadm create beName@snapshot

     beadm destroy [-f] beName | beName@snapshot

     beadm list [-a | -ds] [-H] [beName]

     beadm mount beName mountpoint

     beadm unmount beName

     beadm rename beName newBeName

     beadm activate beName
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

And some simple demonstration:&lt;br/&gt;

&lt;pre&gt;
# beadm list

BE          Active Active on Mountpoint Space
Name               reboot               Used
----        ------ --------- ---------- -----
opensolaris yes    yes       legacy     2.74G
initial     no     no        -          60.5K
# for i in 1 2 3 4 5 ; do
&gt; beadm create foo${i}
&gt; done
# beadm list

BE          Active Active on Mountpoint Space
Name               reboot               Used
----        ------ --------- ---------- -----
foo1        no     no        -          81.5K
opensolaris yes    yes       legacy     2.74G
initial     no     no        -          60.5K
foo5        no     no        -          81.5K
foo2        no     no        -          81.5K
foo3        no     no        -          80.5K
foo4        no     no        -          81.5K
# for i in 1 2 3 4 5 ; do
&gt; beadm destroy -f foo${i}
&gt; done
# beadm list

BE          Active Active on Mountpoint Space
Name               reboot               Used
----        ------ --------- ---------- -----
opensolaris yes    yes       legacy     2.74G
initial     no     no        -          60.5K
#
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-4189610929810568794?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/4189610929810568794/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=4189610929810568794' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4189610929810568794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4189610929810568794'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/04/snap-upgrade-rocks.html' title='Snap Upgrade rocks!'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-6125288503782518060</id><published>2008-04-09T18:01:00.000-07:00</published><updated>2008-04-10T05:07:25.522-07:00</updated><title type='text'>April Fool's joke gone bad!</title><content type='html'>My April Fool's joke for this year can be found &lt;a href = "http://www.opensolaris.org/jive/thread.jspa?threadID=56137&amp;tstart=0"&gt;here&lt;/a&gt;.  &lt;a href="http://blogs.sun.com/brendan/"&gt;Brendan&lt;/a&gt; had his revenge by making me write it up as a &lt;a href = "http://wikis.sun.com/display/DTrace/Adding+an+action+to+DTrace"&gt;tutorial&lt;/a&gt;.  (Okay, it was actually &lt;a href="http://blogs.sun.com/ahl/"&gt;Adam&lt;/a&gt; who suggested I flesh it out into a tutorial, but it sounds better to say that Brendan got his revenge.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-6125288503782518060?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/6125288503782518060/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=6125288503782518060' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6125288503782518060'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6125288503782518060'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/04/april-fools-joke-gone-bad.html' title='April Fool&apos;s joke gone bad!'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-8807031253333453162</id><published>2008-03-19T07:09:00.000-07:00</published><updated>2008-03-22T14:52:59.037-07:00</updated><title type='text'>dtrace.conf(08)</title><content type='html'>I attended dtrace.conf(08) last week, which was the first DTrace conference.  On a personal level, I was glad to finally get to meet the guys behind DTrace (&lt;a href="http://blogs.sun.com/bmc/"&gt;Bryan Cantrill&lt;/a&gt;, &lt;a href="http://blogs.sun.com/ahl/"&gt;Adam Leventhal&lt;/a&gt;, and &lt;a href="http://blogs.sun.com/mws/"&gt;Mike Shapiro&lt;/a&gt;) and the sponsor for my DTrace code submissions, &lt;a href="http://blogs.sun.com/jonh/"&gt;Jon Haslam&lt;/a&gt;, as well as a number of people whose names were familiar from blogs and mailing lists (and &lt;a href="http://www.amazon.com/dp/0131482092?tag=solarisintern-20&amp;camp=14573&amp;creative=327641&amp;linkCode=as1&amp;creativeASIN=0131482092&amp;adid=09JAJHX8CYRKEX2TX3NS&amp;"&gt;books&lt;/a&gt; on &lt;a href="http://www.amazon.com/dp/0131568191?tag=solarisintern-20&amp;camp=14573&amp;creative=327641&amp;linkCode=as1&amp;creativeASIN=0131568191&amp;adid=1WNKBKFJ5VKJPTSVHRYY&amp;"&gt;Solaris&lt;/a&gt;.)  I was also pleasantly surprised to be told that I'd been given core contributor status in the DTrace community (along with &lt;a href="http://www.forsythesunsolutions.com/blog/2"&gt;Jarod Jensen&lt;/a&gt;.)&lt;br/&gt;&lt;br/&gt;

The conference itself was pretty interesting.  &lt;a href="http://www.forsythesunsolutions.com/node/97"&gt;Others&lt;/a&gt; &lt;a href="http://x86vmm.blogspot.com/2008/03/dtraceconf08.html"&gt;have&lt;/a&gt; &lt;a href="http://www.lethargy.org/~jesus/archives/107-dtrace.conf08.html"&gt;blogged&lt;/a&gt; &lt;a href="http://redmonk.com/sogrady/2008/03/16/dtraceconf-and-the-dumbest-guy-in-the-room/"&gt;about&lt;/a&gt; the conference, and I don't have much to add about the quality of the group in attendance or the subject matter presented.  (Unfortunately, the bulk of the attendees missed my Weekend Warrior's Guide to DTrace Development presentation, as I had the misfortune of being scheduled for the slot just before dinner, and the schedule slipped during the day.  I did finally get to talk in front of a group of about a dozen people sometime after a number of beers and a few rounds of &lt;a href="http://www.vmunix.com/mark/blog/archives/2007/08/15/fishpong-and-yes-the-iphone-camera-sucks/"&gt;fishpong&lt;/a&gt; (no, not &lt;a href="http://portal.acm.org/citation.cfm?id=1031669&amp;dl="&gt;that fishpong&lt;/a&gt;), which, I guess, makes for my humorous conference anecdote.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-8807031253333453162?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/8807031253333453162/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=8807031253333453162' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8807031253333453162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8807031253333453162'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/03/dtraceconf08.html' title='dtrace.conf(08)'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-8680669434130187856</id><published>2008-02-07T11:32:00.000-08:00</published><updated>2008-02-07T11:40:29.854-08:00</updated><title type='text'>DTrace code putback</title><content type='html'>Woohoo!  The code's officially been putback.  Note that the putback actually incorporates three bugs:&lt;br/&gt;
&lt;pre&gt;
6325485 A stddev() aggregator would be a nice adjunct to avg()
6618705 p*d123 doesn't cause pid probes to be created
6624541 dtrace aggregations should assume signed arguments
&lt;/pre&gt;&lt;br/&gt;

Thanks to Jon Haslam for sponsoring it, Jon and Adam Leventhal for rigorous code reviews, and Bryan Cantrill for approving the request to integrate.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-8680669434130187856?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/8680669434130187856/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=8680669434130187856' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8680669434130187856'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8680669434130187856'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/02/dtrace-code-putback.html' title='DTrace code putback'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-4892392735183110494</id><published>2008-01-18T21:35:00.000-08:00</published><updated>2008-01-18T18:34:21.253-08:00</updated><title type='text'>The standard deviation aggregating action in DTrace</title><content type='html'>Woohoo!  The standard deviation aggregating action (stddev()) for DTrace will soon be put back into OpenSolaris.  This is my first putback for DTrace.&lt;br/&gt;&lt;br/&gt;

The change to add stddev() itself was actually very simple.  First, note that it uses this approximation to standard deviation:
&lt;pre&gt;
sqrt(avg(x^2) - avg(x)^2)
&lt;/pre&gt;&lt;br/&gt;

(For anyone who knows enough to complain about this, note this from the proposal I submitted for this:
&lt;pre&gt;
        It is recognised that this is an imprecise approximation to
        standard deviation, but it is calculable as an aggregation, and
        it should be sufficient for most of the purposes to which
        DTrace is put.
&lt;/pre&gt;&lt;br/&gt;
)&lt;br/&gt;&lt;br/&gt;

When I was thinking about how to implement this, the natural thing to do was to model it after how avg() was implemented.  When I first saw it, I was a bit surprised at how simple the implementation for avg() was (although in retrospect it was obvious, and I had even been thinking to myself that it must be fairly simple.)  The code that implements the real meat of the avg() aggregating action (at least in the kernel) is just this (to be found &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/dtrace/dtrace.c"&gt;here&lt;/a&gt;):
&lt;pre&gt;
static void
dtrace_aggregate_avg(uint64_t *data, uint64_t nval, uint64_t arg)
{
        data[0]++;
        data[1] += nval;
}
&lt;/pre&gt;&lt;br/&gt;

All it's doing is keeping a count and a sum.  The average itself is calculated in post-processing.  So the implementation for stddev() is really just as simple as this:
&lt;pre&gt;
static void
dtrace_aggregate_stddev(uint64_t *data, uint64_t nval, uint64_t arg)
{
        data[0]++;
        data[1] += nval;
        data[2] += nval * nval;
}
&lt;/pre&gt;&lt;br/&gt;

i.e., Just keep track of the count and the sum (to calculate avg(x)) and the sum of x^2 (to calculate avg(x^2)).  Of course, a problem creeps in here -- if nval is larger than a 32-bit value, we'll blow our 64 bits.  (And note that this is technically a problem for the existing implementation of avg(), too, as it could silently overflow its 64 bits.  Pretty unlikely, but not impossible.  &lt;a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6624695"&gt;This bug&lt;/a&gt; has been filed to correct this.)&lt;br/&gt;&lt;br/&gt;

What we decided to do to get around the problem was to implement 128-bit arithmetic functions to support this.  The obvious question with respect to doing this is why not just use some multi-precision library and be done with it.  Doing so would have introduced a dependency between the kernel and this external library, though, and given how easy it is to implement this, it was better to do so.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-4892392735183110494?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/4892392735183110494/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=4892392735183110494' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4892392735183110494'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4892392735183110494'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/01/standard-deviation-aggregating-action.html' title='The standard deviation aggregating action in DTrace'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-6094900342799052496</id><published>2008-01-13T15:26:00.000-08:00</published><updated>2008-01-13T12:29:17.630-08:00</updated><title type='text'>Installing the OpenSolaris preview remotely</title><content type='html'>This was fun.  I wanted to install the &lt;a href="http://opensolaris.org/os/project/indiana/"&gt;OpenSolaris preview&lt;/a&gt; on a ThinkPad.  Unfortunately, the ThinkPad has an NVIDIA video card that isn't supported by the current version of the OpenSolaris LiveCD.  But the GRUB menu shows a text-only option for booting, so I boot into that.&lt;br/&gt;&lt;br/&gt;

It boots and gives me a console prompt, and I try to figure out how to do a text-based install.  There's /usr/bin/gui-install, but nothing obvious for a text-based install.  A little bit of Googling, and I discover that there's no text-based install in the preview release.  (It is the alpha release, so it's not terribly surprising.)&lt;br/&gt;&lt;br/&gt;

Okay, I'll just ssh in, run the GUI installer, and have it display on my desktop.  When I try to ssh, I get this message: "no kex alg" (no key exchange algorithm.)  Some Googling yields &lt;a href="http://opensolaris.org/jive/thread.jspa?messageID=176002"&gt;this thread&lt;/a&gt;, which indicates that it's simply a lack of host keys.  I generate the host keys and successfully log in.&lt;br/&gt;&lt;br/&gt;

Okay, now X forwarding isn't working.  I look at /etc/ssh/sshd_config, but things look fine there.  I get this message on the laptop:  "failed to create a directory for the temporary X authority file: Error 0; will use the default xauth file".  Some Googling shows me &lt;a href="http://bugs.opensolaris.org/view_bug.do?bug_id=6613343"&gt;this bug&lt;/a&gt; (which is actually a duplicate of &lt;a href="http://bugs.opensolaris.org/view_bug.do?bug_id=6496972"&gt;this bug&lt;/a&gt;), which indicates that the error message is misleading.&lt;br/&gt;&lt;br/&gt;

On a hunch, I decide to check if xauth is installed on the LiveCD.  Doh!  No xauth.  Oh, but that's okay, I'll just copy a version over and see if that works.  From "strings sshd", I see that sshd has the hard-coded path "/usr/openwin/bin/xauth".  Hmm, /usr is a read-only file system mounted from the CD.&lt;br/&gt;&lt;br/&gt;

Given that I can't copy xauth to the hard-coded path, what can I do?  Well, there's the other option of changing the hard-coded path.  I could just compile a version of sshd with a different path for xauth, but I'm feeling lazy, so I decide to just change the hard-coded path in the binary.&lt;br/&gt;&lt;br/&gt;

In trying to figure out how to do this with the tools available on the LiveCD, I run across mention that vim can handle editing binary files.  (I'm sure that emacs would also be able to do it.)  I copy the sshd binary to my desktop, open the file in vim, find the string "/usr/openwin/bin/xauth" and replace it with "/var/tmp/foo/bin/xauth".  (Note that the lengths of the strings are the same -- I don't want to throw off the addresses for everything else in the file.  Even then, the resulting file was one byte larger, although od shows me that that extra byte isn't in the modified string.)&lt;br/&gt;&lt;br/&gt;

I copy the sshd binary back to the laptop and give it a try.  The binary runs, and X forwarding works (given that I'd already copied xauth to the appropriate location.)  I run /usr/bin/gui-install, and bam! the installer window pops up on my desktop.  A little while later, I have the OpenSolaris preview installed.&lt;br/&gt;&lt;br/&gt;

(And then I follow the &lt;a href="http://opensolaris.org/os/project/indiana/resources/update_guidelines/"&gt;instructions  to update packages&lt;/a&gt; and grab the &lt;a href="http://www.nvidia.com/object/unix.html"&gt;latest NVIDIA driver package&lt;/a&gt;.  One reboot later, and I'm cooking with gas.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-6094900342799052496?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/6094900342799052496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=6094900342799052496' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6094900342799052496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6094900342799052496'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2008/01/installing-opensolaris-preview-remotely.html' title='Installing the OpenSolaris preview remotely'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-640092037253791562</id><published>2007-12-02T08:43:00.000-08:00</published><updated>2007-12-02T09:15:17.283-08:00</updated><title type='text'>Persistent reverse SSH tunnel</title><content type='html'>A little while ago, I found myself with the desire to access my home machine when I'm not at home.  (I was mostly looking for the ability to do things like set off an OpenSolaris "nightly" build from work so that it would be done before I was at home in front of my computer again.)  I wondered if you could do something like this with ssh, and a little digging turned up the -R flag, which does exactly what I want.&lt;br/&gt;&lt;br/&gt;

While Googling on the topic, I ran across &lt;a href="http://www.brandonhutchinson.com/ssh_tunnelling.html"&gt;this&lt;/a&gt;, in which the author discusses using a cron job to make sure that the link would be re-established in the event that something terminated the process.  This is one way to achieve persistence, but given that I'm running OpenSolaris at home, there's a better way to achieve the same.  If I simply start the process like this:
&lt;pre&gt;
ctrun -r 0 -t -f hwerr,core,signal ssh -R 2069:localhost:22 remote.domain.com
&lt;/pre&gt;
then Solaris will guarantee that my SSH process gets restarted whenever it dies.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-640092037253791562?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/640092037253791562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=640092037253791562' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/640092037253791562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/640092037253791562'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/12/persistent-reverse-ssh-tunnel.html' title='Persistent reverse SSH tunnel'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-5646088715192652550</id><published>2007-09-20T10:03:00.000-07:00</published><updated>2007-09-20T10:05:58.701-07:00</updated><title type='text'>First OpenSolaris putback</title><content type='html'>Cool, my first OpenSolaris putback:

&lt;pre&gt;
*********  This mail is automatically generated  *******

Your putback for the following fix(es) is complete:

   6314610 audit_syslog(5) plugin module logs IP addresses in host byte order
   Contributed by Chad Mynhier (cmynhier@gmail.com).


These fixes will be in release:

       snv_75

The gate's automated scripts will mark these bugs "8-Fix Available"
momentarily, and the gatekeeper will mark them "10-Fix Delivered"
as soon as the gate has been delivered to the WOS.  You should not
need to update the bug status.

       Your Friendly Gatekeepers
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-5646088715192652550?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/5646088715192652550/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=5646088715192652550' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5646088715192652550'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5646088715192652550'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/09/first-opensolaris-putback.html' title='First OpenSolaris putback'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-5943445057658613206</id><published>2007-05-17T06:06:00.000-07:00</published><updated>2007-05-17T08:00:09.504-07:00</updated><title type='text'>Re-linking an unlinked file with fsdb</title><content type='html'>Given that you have an unlinked file on a file system, how do you relink it into the file system?&lt;br/&gt;&lt;br/&gt;

One of the things you can do is just work around the problem.  You could just copy /proc/XXX/fd/Y, which gets you the contents of the file but not the actual file itself.&lt;br/&gt;&lt;br/&gt;

What you really want to be able to do is "ln /proc/XXX/fd/Y /some/dir/foo".  Theoretically, there's no reason why this shouldn't work, given that you know the inode number of the unlinked file within its file system via /proc/XXX/fd/Y, but it doesn't currently work to do this.&lt;br/&gt;&lt;br/&gt;

I tried the following with fsdb (on Solaris on a UFS file system), and it worked to relink the file into the file system.  Doing something  like this isn't generally advisable, as it involves manipulating the disk directly.  This bypasses the file-caching mechanism in the kernel, so you'll end up with a disk image that doesn't match what the kernel thinks it should look like.  In my case, things were fine after a reboot and fsck, but I may simply have gotten lucky.&lt;br/&gt;&lt;br/&gt;

(I was prompted to try this after discovering that it's possible to unlink(1M) a non-empty directory.  I was curious to see whether or not the inodes in that directory were orphaned by doing so, and it turns out that they are.  Given that I had a file system with orphaned inodes, I wanted to see if I could remedy the situation.  I was mostly interested to see if I could fix the live file system, which I was unfortunately unable to do.)&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;# fsdb -o w /dev/md/rdsk/d1
fsdb of /dev/md/rdsk/d1 (Opened for write) -- last mounted on /var
fs_clean is currently set to FSLOG
fs_state consistent (fs_clean CAN be trusted)
/dev/md/rdsk/d1 &gt; :cd tmp
/dev/md/rdsk/d1 &gt; :ls -l
/tmp:
i#: b6          ./
i#: 2           ../
i#: 4d9a        .java/
i#: 1279        SMB--0A40081A-044_0
i#: 4dbd        autosave/
i#: 4970        chadfoo/
i#: 127b        cifs_0
i#: 127c        host_0
i#: 4d9e        mancache/
i#: 1277        named.pid
i#: 1276        named.run
i#: 127a        smb--0a40081a-044_0
/dev/md/rdsk/d1 &gt;
&lt;/pre&gt;&lt;br/&gt;

The unlinked file that I'm concerned about existed in the chadfoo directory.  (Not that this really matters, given that the scope of the name space for these inode numbers is /var, but I want to link it back to its original location.)&lt;br/&gt;&lt;br/&gt;

The plan is to create an empty file ("bigfile") in /var/tmp/chadfoo and use fsdb to change the inode number for that particular directory entry to be that of the unlinked file.  So that I don't lose the inode of the empty file, I create a second link to it, "bigfile2", and I fix link counts afterwards so that everything's kosher.&lt;br/&gt;&lt;br/&gt;

Here, I look at the directory entries for inode 4970 (/var/tmp/chadfoo):&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;/dev/md/rdsk/d1 &gt; 4970:inode; 0:dir?d
i#: 4970        .
/dev/md/rdsk/d1 &gt; 4970:inode; 1:dir?d
i#: b6          ..
/dev/md/rdsk/d1 &gt; 4970:inode; 2:dir?d
i#: 4a62        bigfile
/dev/md/rdsk/d1 &gt; 4970:inode; 3:dir?d
i#: 4a62        bigfile2
/dev/md/rdsk/d1 &gt;
&lt;/pre&gt;&lt;br/&gt;

And then I set the inode number for directory entry number 2 to be that of the unlinked file (decimal 19037):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;/dev/md/rdsk/d1 &gt; 4970:inode; 2:dir=0t19037
i#: 4a5d        bigfile
/dev/md/rdsk/d1 &gt;
&lt;/pre&gt;&lt;br/&gt;

What I've done doesn't change the link counts, though, so I need to do this manually:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;/dev/md/rdsk/d1 &gt; 4a62:inode?i
i#: 4a62           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 2              bs: 0              sz : c_flags : 0           0

        accessed: Wed May 16 11:12:22 2007
        modified: Wed May 16 11:12:22 2007
        created : Wed May 16 11:13:02 2007
/dev/md/rdsk/d1 &gt; :ln=1
i#: 4a62           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 1              bs: 0              sz : c_flags : 0           0

        accessed: Wed May 16 11:12:22 2007
        modified: Wed May 16 11:12:22 2007
        created : Wed May 16 11:13:02 2007
/dev/md/rdsk/d1 &gt; 4a5d:inode?i
i#: 4a5d           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 0              bs: 200410         sz : c_flags : 0           40000000

db#0: 28e00        db#1: 28e48        db#2: 28e50        db#3: 28e58
db#4: 28e60        db#5: 28e68        db#6: 28e70        db#7: 28e78
db#8: 28e80        db#9: 28e88        db#a: 28e90        db#b: 28e98
ib#0: 204748       ib#1: cc808
        accessed: Wed May 16 10:39:20 2007
        modified: Wed May 16 10:39:44 2007
        created : Wed May 16 10:39:44 2007
/dev/md/rdsk/d1 &gt; :ln=1
i#: 4a5d           md: ----rw-r--r--  uid: 2d85          gid: a
ln: 1              bs: 200410         sz : c_flags : 0           40000000

db#0: 28e00        db#1: 28e48        db#2: 28e50        db#3: 28e58
db#4: 28e60        db#5: 28e68        db#6: 28e70        db#7: 28e78
db#8: 28e80        db#9: 28e88        db#a: 28e90        db#b: 28e98
ib#0: 204748       ib#1: cc808
        accessed: Wed May 16 10:39:20 2007
        modified: Wed May 16 10:39:44 2007
        created : Wed May 16 10:39:44 2007
/dev/md/rdsk/d1 &gt;
&lt;/pre&gt;&lt;br/&gt;

And at this point, I reboot into single user, fsck /var (which shows a few problems to be corrected), and things are fine.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-5943445057658613206?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/5943445057658613206/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=5943445057658613206' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5943445057658613206'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5943445057658613206'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/05/re-linking-unlinked-file-with-fsdb.html' title='Re-linking an unlinked file with fsdb'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-1836914396344065118</id><published>2007-05-06T12:10:00.000-07:00</published><updated>2007-05-07T06:44:48.683-07:00</updated><title type='text'>ext3cow -- snapshots for ext3</title><content type='html'>A couple of days ago, I wrote about &lt;a href="http://cmynhier.blogspot.com/2007/05/chunkfs-minimizing-fsck-time.html"&gt;chunkfs&lt;/a&gt;.  I also recently ran across &lt;a href="http://www.ext3cow.com/"&gt;ext3cow&lt;/a&gt;.  Chunkfs has a very specific goal, that of minimizing fsck time.  Similarly, ext3cow has a very specific goal -- providing snapshots.  The motivation seems to be very business/regulatory-compliance-related, as evidenced by the publication list related to ext3cow (Building Regulatory Compliant Storage Systems, Secure Deletion for a Versioning File System, Verifiable Audit Trails for a Versioning File System, Limiting Liability in a Federally Compliant File System, etc.)  But snapshots are a generally useful feature outside the world of compliance, as anyone who's ever accidentally deleted a few days' work on a programming assignment can attest.&lt;br/&gt;&lt;br/&gt;

(Note that a snapshotting file system is not, in and of itself, a versioning file system.  A versioning file system maintains intermediate versions of files.  If I make three separate modifications to a file on the same day, I get three copies of the file.  A snapshotting file system only maintains the versions of files that existed when a snapshot was taken.  If I make three separate modifications to a file on the same day, and a snapshot is taken once a day, I only get the last version of the file.  With respect to the relevant regulations, however, the final version of a file as it existed on any given day (or some other specified time period) may be all that matters.)&lt;br/&gt;&lt;br/&gt;

One of the design features of ext3cow is that changes to support snapshotting were localized to the ext3 code itself, so that no changes were necessary to the VFS data structures or interfaces.  This certainly makes life easy, as it can be used with otherwise stock kernels.  This also places some restrictions on what can be done with this filesystem.&lt;br/&gt;&lt;br/&gt;

There are a few parts to how ext3cow works.  First, they've added a field to the superblock to contain what they call the epoch counter.  This is merely the timestamp of the last snapshot of the file system, in seconds since the epoch.&lt;br/&gt;&lt;br/&gt;

They've added three fields to the inode itself:  an epoch counter, a copy-on-write bitmap, and a next inode pointer.  The epoch counter of a file gets updated with the epoch counter of the file system every time a write to the inode occurs.  If the epoch counter of an inode is older than the file system epoch, it means that a new snapshot has been taken since the file was last updated, and the copy-on-write mechanism comes into play.  This mechanism involves the other two new fields.&lt;br/&gt;&lt;br/&gt;

Multiple versions of a file can share data blocks, as long as those data blocks haven't changed.  (This is the efficiency gained by copy-on-write mechanisms in general.)  In ext3cow, the copy-on-write bitmap keeps track of which data blocks are shared between versions of a file, essentially indicating whether each block is read-only (a new block needs to be allocated before modifying the data) or read-write (a new block has already been allocated, thus the data can be modified in-place.)  If an update needs to be made to a read-only block, a new block is allocated, the inode is updated to point to the new block, the new block is written out, and the COW bitmap is updated to reflect that the block has been allocated.  (Note that the old block is likely not copied to the new block, as it was likely first read in before being modified.  The copy of the data thus already exists in memory.)&lt;br/&gt;&lt;br/&gt;

The copy-on-write mechanism in ext3cow involves allocating new inodes for the snapshotted versions of files, thus the next-inode pointer in the inode.  (Note that the live inode necessarily remains the same so that things like NFS continue to work.)  The versions of a file are represented by a linked list via the next-inode pointer, with the current inode at the head of the list to make the common-case access fast.  When the inode for the snapshotted version of a file is allocated, all of the data block pointers (and indirect pointers, etc.) are the same for the snapshotted inode and the live inode.  The live inode will only have data blocks unique to itself when modifications or additions are made to the file.&lt;br/&gt;&lt;br/&gt;

The final metadata modification they've made is to the directory entry data structure.  For each directory entry, they've added a birth epoch and a death epoch, essentially a range during which that inode is live.  This allows multiple different instances of a filename to exist in a directory (as the lifetimes wouldn't overlap), and it avoids muddying the namespace with old versions of files (e.g., an ls of a current directory will show only those directory entries with an unspecified death epoch.)&lt;br/&gt;&lt;br/&gt;

So, given that I'm a proponent of ZFS, how does ext3cow compare?  To some extent, that's not really a fair question, as the scope of ZFS is much larger than the scope of ext3cow.  The main purpose of ext3cow is to provide snapshots, whereas snapshots in ZFS are almost merely a side-effect of other design goals.  Ext3cow is certainly a creative use of what's available in ext3 to provide the ability to take snapshots.&lt;br/&gt;&lt;br/&gt;

Given that it's not fair to compare the two as file systems per se, I will point out some similarities and differences that are interesting to note.&lt;br/&gt;&lt;br/&gt;

A similarity between the two is in the amount of work required to take a snapshot.  There is a constant amount of work involved in each case.  For ext3cow, the work involved is merely updating the snapshot epoch in the superblock.  For ZFS, the work involved is merely noting that we don't want to garbage-collect a specific version of the uberblock.&lt;br/&gt;&lt;br/&gt;

As for differences, the first to note is that ext3cow allocates new inodes for snapshotted files (but only when the current file changes.)  Given that ext3 doesn't support dynamic inode allocation, this means that snapshotted files will consume that limited resources.  While in general this is unlikely to be a problem, it could affect certain use cases such as frequently-snapshotted and highly active file systems or even moderately active file systems with long snapshot retention requirements.  Of course, in these cases, disk space is more likely to be the overriding concern.&lt;br/&gt;&lt;br/&gt;

In contrast, ZFS does not allocate new inodes for snapshotted files, it uses the inode as it existed when the snapshot was taken.  Of course, given that all file system updates are copy-on-write (even in the absence of snapshots), one could argue that ZFS is constantly allocating new inodes for modified files.  From this point of view, the only resource ZFS isn't consuming for snapshots is inode numbers.&lt;br/&gt;&lt;br/&gt;

Another difference is that ext3cow snapshots are essentially only snapshots of the data and not a snapshot of the file system per se.  Ext3cow doesn't support snapshotting the metadata itself.  The metadata for a snapshotted file can change at some later time when the live file is modified.  A new inode is allocated, the old metadata is copied over, and, at a minimum, the inode number and the next-inode pointer are modified.&lt;br/&gt;&lt;br/&gt;

In contrast, ZFS snapshots are a point-in-time view of the state of the entire file system, including metadata.  This is a result of the copy-on-write mechanism in ZFS.  File changes involve copy-on-write modifications, not only of the data blocks, but also of the metadata up the tree to the superblock.  When a snapshot is taken, that version of the uberblock is preserved, along with all of the metadata and data blocks referenced via this uberblock.  The metadata doesn't need to modified, and new inodes  don't need to be allocated for snapshotted files.&lt;br/&gt;&lt;br/&gt;

Of course, whether or not the metadata is maintained in a pristine state is mostly of theoretical interest.  It likely doesn't matter for regulatory compliance, where the data itself is the primary concern.  (Well, I'm sure a highly-paid lawyer could probably use this point to introduce doubt about the validity of snapshotted files, if there were ever a court case involving very large sums of money.)  And I'm sure this subtlety doesn't matter to the student who recovers a file representing many hours of work when a deadline is only a very few hours away.&lt;br/&gt;&lt;br/&gt;

Another difference is the amount of work involved in recovering a snapshotted version of a file.  With ext3cow, this involves a linked-list traversal of next-inode pointers while comparing the desired file time to the liveness range of each inode in turn.  With ZFS, this involves merely looking up the file in an alternate version of the filesystem, which is no more work than finding the current version of that file.  But again, this is likely a difference that won't matter.  Looking up the snapshotted version of a file is the rare case, and according to the authors' experimental results, the penalty is negligible.&lt;br/&gt;&lt;br/&gt;

One subtle difference between the two file systems is that ext3cow allows for individual file snapshots, which ZFS doesn't support.  That ext3cow supports this almost falls out of the implementation -- the epoch counter for the file is updated to the current time rather than the file system epoch, and the copy-on-write mechanism catches this condition.  Similarly, the implementation of snapshots in ZFS prevents this.  Snapshots are implemented using the uberblock.  Taking a snapshot of a single file would involve also snapshotting the metadata up the tree to the uberblock.  Guaranteeing that the snapshot is only of a single file would involve pruning away all other directories and files.  Given that snapshotting is such a cheap operation in ZFS, it doesn't make sense to do so.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-1836914396344065118?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/1836914396344065118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=1836914396344065118' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1836914396344065118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1836914396344065118'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/05/ext3cow-snapshots-for-ext3.html' title='ext3cow -- snapshots for ext3'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-4206489316675285560</id><published>2007-05-04T05:07:00.000-07:00</published><updated>2007-05-04T08:18:45.534-07:00</updated><title type='text'>ChunkFS -- minimizing fsck time</title><content type='html'>I ran across &lt;a href="http://www.fenrus.org/chunkfs.txt"&gt;chunkfs&lt;/a&gt; this week. (&lt;a href="http://infohost.nmt.edu/~val/chunkfs/"&gt;Another link&lt;/a&gt; and &lt;a href="http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf"&gt;the paper&lt;/a&gt;.) Actually, I ran across a mention of it a month or so ago, but I only really looked at it this week, having been pointed there by &lt;a href="http://www.c0t0d0s0.org//archives/3102-Solution-or-just-a-kludge.html"&gt;Joerg's blog&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

The point of chunkfs is to minimize fsck time. Having spent a sleepless night waiting for a 2TB ext3 file system to fsck, I can understand the desire to minimize this. As the paper points out, the trend in storage growth means that fsck's of normal filesystems will soon take on the order of days or tens of days. This quote from the paper is particularly interesting: &lt;i&gt;The fraction of time data is unavailable while running fsck may asymptotically approach unity in the absence of basic architectural changes.&lt;/i&gt;&lt;br/&gt;&lt;br/&gt;

The approach of chunkfs is to break up the file system into multiple mostly-independent fault domains, the intent being to limit the need for fsck to just one of these chunks (or possibly a small number of them.) The idea is similar in concept to having many small file systems rather than one large one, but it hides the details (and administrative hassle) from the user, so that the illusion is of a single large file system.&lt;br/&gt;&lt;br/&gt;

From &lt;a href="http://www.linuxworld.com/news/2007/050107-kernel.html?fsrc=rss-linux-news"&gt;this page&lt;/a&gt; we get this bit of information that isn't mentioned in the paper:
&lt;blockquote&gt;Chunkfs makes these numbers unique by putting the chunk number in the upper eight bits of every inode number. As a result, there is a maximum of 256 chunks in any chunkfs filesystem.&lt;/blockquote&gt; (I haven't actually verified this for myself, so take this with a grain of salt.) Given this, that one-week fsck is reduced to less than an hour, assuming that only a single chunk needs to be checked. That 256x speed-up in fsck time is certainly impressive.&lt;br/&gt;&lt;br/&gt;

The authors also suggest the possibility of doing on-line fsck's. The feasibility of this suggestion is based on observations that metadata updates tend to be localized, with the prediction that chunks would spend a majority of their time idle with respect to metadata updates. On-line fsck's would eliminate the need for file system for down time and could possibly prevent events like server reboots due to file system corruption (i.e., problems could be found in a controlled fashion.)&lt;br/&gt;&lt;br/&gt;

I would argue, however, that any effort to minimize fsck time is misguided. I'm of the opinion that the time would be better spent working on ways to eliminate the need for fsck.&lt;br/&gt;&lt;br/&gt;

Actually, let me step back from that statement for a second and extrapolate from the authors' work. They suggest a divide-and-conquer approach to create separate fault domains. These chunks would generally consist of some subset of the files in the file system. We could imagine growing the storage, and/or shrinking the chunk size, to such a degree that chunks are subsets of files, possibly even individual blocks. In this case, consistency checks would be done at the sub-file level. Given that the file system could support on-line fsck's, we could have an almost constant amount of background consistency checking of parts of file systems.&lt;br/&gt;&lt;br/&gt;

Taken to this extreme, chunkfs starts to look a lot like ZFS. ZFS is constantly verifying the consistency of metadata in its file systems. Admittedly, it's only doing so for the metadata for the files that are actually in use, and it's only doing so &lt;i&gt;as&lt;/i&gt; that metadata is being read or written (assuming a scrub is not running), but one could argue that that's a benefit, as it's not wasting cycles checking data that's not in use. (There's the added benefit that ZFS is also checksumming the data, thus giving a correctness guarantee that chunkfs doesn't provide.)&lt;br/&gt;&lt;br/&gt;

I won't go as far to argue that the ZFS implementation is the only possible way to eliminate the need for fsck, but I &lt;i&gt;would&lt;/i&gt; argue that they've done the right thing in doing so.&lt;br/&gt;&lt;br/&gt;

Here's actually a pretty interesting quote with respect to this, ironically enough from &lt;a href="http://blogs.sun.com/val/entry/zfs_faqs_freenix_cfp_new"&gt;one of the authors&lt;/a&gt; of chunkfs:
&lt;blockquote&gt;The on-disk state is always valid. No more fsck, no need to replay a journal. We use a copy-on-write, transactional update system to correctly and safely update data on disk. There is no window where the on-disk state can be corrupted by a power cycle or system panic. This means that you are much less likely to lose data than on a file system that uses fsck or journal replay to repair on-disk data corruption after a crash. Yes, supposedly fsck-free file systems already exist - but then explain to me all the time I've spent waiting for fsck to finish on ext3 or logging UFS - and the resulting file system corruption.&lt;/blockquote&gt;

And there's another interesting note from the paper:
&lt;blockquote&gt;Those inclined to dismiss file system bugs as a significant source of file system corruption are invited to consider the &lt;a href="http://oss.sgi.com/projects/xfs/faq.html#dir2"&gt;recent XFS bug in Linux 2.6.17&lt;/a&gt; requiring repair via the XFS file system repair program.&lt;/blockquote&gt;
I had originally intended to point out that a filesystem repair program is another possible entry point for bugs, but this FAQ entry proved my point for me:
&lt;blockquote&gt;To add insult to injury, xfs_repair(8) is currently not correcting these directories on detection of this corrupt state either. This xfs_repair issue is actively being worked on, and a fixed version will be available shortly.&lt;/blockquote&gt;
Both of these problems have since been fixed, apparently, but this does give strength to my point. The added complexity of a file system repair program should be taken into consideration in this argument. Of course, one could argue that there is very little added complexity because the repair program shares much of the same code with the file system itself. This means, of course, that it shares the same bugs. On the other hand, one could maintain a separate code base for the repair program, but that is a considerable amount of added complexity.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-4206489316675285560?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/4206489316675285560/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=4206489316675285560' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4206489316675285560'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/4206489316675285560'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/05/chunkfs-minimizing-fsck-time.html' title='ChunkFS -- minimizing fsck time'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-8866225048080140947</id><published>2007-04-24T16:40:00.000-07:00</published><updated>2007-04-24T13:37:59.174-07:00</updated><title type='text'>rootdisk in Jumpstart</title><content type='html'>A couple of times in the past, I've spent some time tracking down exactly what scripts/programs are called during Jumpstart and where certain bits of information are determined, but that information is scattered around on bits of paper, in email, and in files that I've lost track of.&lt;br/&gt;&lt;br/&gt;

Right now, what I'm looking for is how Jumpstart determines what disk to use when you specify "rootdisk" in a profile.  For example if you have the following:&lt;br/&gt;
&lt;pre&gt;
partitioning    explicit
filesys         rootdisk.s0      8192   /               logging
filesys         rootdisk.s1      8192   /var            logging
[ ... ]
&lt;/pre&gt;&lt;br/&gt;

Jumpstart will pick a disk to use as the root disk and then use that choice consistently through the rest of the process (via $SI_ROOTDISK.)  (Or at least it's supposed to do so consistently.  I just ran across &lt;a href="http://bugs.opensolaris.org/view_bug.do?bug_id=4892560"&gt;this&lt;/a&gt;.  I've never seen the problem, but apparently someone else has.)&lt;br/&gt;&lt;br/&gt;

The code for determining this is in the chkprobe script (e.g., /export/jumpstart/5.10-sparc-6_06/Solaris_10/Tools/Boot/usr/sbin/install.d/chkprobe.)  This is what should happen in most cases:&lt;br/&gt;
&lt;pre&gt;
        if [ -z "${SI_ROOTDISK}" ] ; then
                for i in /dev/dsk/*s0 ; do
                    SI_CDDEVICE=`basename ${i}`
                    findcd -t ${SI_CDDEVICE} &gt; /dev/null 2&gt;&amp;1
                    if [ $? != 0 ] ; then
                        SI_ROOTDISK=${SI_CDDEVICE}
                        break;
                    fi
                done
        fi
&lt;/pre&gt;&lt;br/&gt;

Given the ordering presented by the glob, this essentially means that the lowest-numbered disk on the lowest-numbered target on the lowest-numbered controller will be used as the root disk.  What's interesting is this bit of code before the above section:&lt;br/&gt;
&lt;pre&gt;
        # see if there is a device c0t3d0s0
        if [ -z "${SI_ROOTDISK}" -a -b /dev/dsk/c0t3d0s0 ] ; then
            findcd -t c0t3d0s0 &gt; /dev/null 2&gt;&amp;1
            if [ $? != 0 ] ; then
                SI_ROOTDISK=c0t3d0s0
            fi
        fi
&lt;/pre&gt;&lt;br/&gt;

So we have a special case for c0t3d0s0.  I assume this was necessary back when Sun hardware would present SCSI target 0 as sd3 (for SunOS 4) or c0t3d0s0 (for Solaris.)  I'd actually forgotten about that little quirk until I ran across this snippet of code.&lt;br/&gt;&lt;br/&gt;


Copyright notice from the above-mentioned chkprobe script:&lt;br/&gt;
&lt;pre&gt;
# Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-8866225048080140947?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/8866225048080140947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=8866225048080140947' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8866225048080140947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/8866225048080140947'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/04/rootdisk-in-jumpstart.html' title='rootdisk in Jumpstart'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-1280585989655591016</id><published>2007-04-14T11:44:00.000-07:00</published><updated>2007-04-14T12:18:49.844-07:00</updated><title type='text'>Undocumented feature in devfsadm</title><content type='html'>I was looking at the source code for &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/devfsadm/"&gt;devfsadm&lt;/a&gt; when I ran across something interesting in &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/devfsadm/devfsadm.c#961"&gt;
devfsadm.c&lt;/a&gt;:&lt;br/&gt;
&lt;pre&gt;
    961  vprint(CHATTY_MID, "walking device tree\n");
&lt;/pre&gt;&lt;br/&gt;
I checked out &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/devfsadm/devfsadm.h#98"&gt;
devfsadm.h&lt;/a&gt; and found this:&lt;br/&gt;
&lt;pre&gt;
     98 #define CHATTY_MID  "chatty"  /* prints with -V chatty */
&lt;/pre&gt;&lt;br/&gt;
I tried the "-V chatty" option and saw something a little more informative than a simple "failed to attach" message":&lt;br/&gt;
&lt;pre&gt;
devfsadm[5355]: chatty: process_devinfo_tree: enter
devfsadm[5355]: chatty: lock_dev(): entered
devfsadm[5355]: chatty: mkdirp(/dev, 0x1ed)
devfsadm[5355]: chatty: process_devinfo_tree: attaching driver (rtls)
devfsadm[5355]: chatty: attempting pre-cleanup
devfsadm[5355]: chatty: devi_tree_walk: root=/, minor=&lt;NULL&gt;, driver=rtls, error=0, flags=68
devfsadm: driver failed to attach: rtls
exit status = 1
&lt;/pre&gt;&lt;br/&gt;

(And of course, my first thought is, "Hey, a good starting point for DTrace!"  Unfortunately, devfsadm is distributed as a stripped binary:&lt;br/&gt;
&lt;pre&gt;
# ./devfsadm.d -c "devfsadm -i rtls"
dtrace: failed to compile script ./devfsadm.d: line 5: probe description pid5937::devi_tree_walk:entry does not match any probes
# file `which devfsadm`
/usr/sbin/devfsadm: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), stripped
#
&lt;/pre&gt;&lt;br/&gt;

Guh.  Not that it's the end of the world, but guh.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-1280585989655591016?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/1280585989655591016/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=1280585989655591016' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1280585989655591016'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/1280585989655591016'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/04/undocumented-feature-in-devfsadm.html' title='Undocumented feature in devfsadm'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-5551144250239688038</id><published>2007-04-11T12:04:00.000-07:00</published><updated>2007-04-11T12:46:18.981-07:00</updated><title type='text'>memmove(), memcpy() and DTrace</title><content type='html'>I was trying to use DTrace to watch the flow of control within a specific function in an application, and I saw some weird behavior.  Specifically, what I was seeing was this:&lt;br/&gt;
&lt;pre&gt;
  4        -&gt; memmove
  4        &lt;- memmove
  4      &lt;- memcpy
  4      -&gt; memmove
  4      &lt;- memmove
  4    &lt;- memcpy
&lt;/pre&gt;&lt;br/&gt;

After this, the indenting was out of alignment, tending more towards the left side of the screen than it should.  Very annoying, but I was curious to see why I always seems to be returning from both memmove() and memcpy() when I should only have been returning from memmove().  (I wasn't seeing any entry points to memcpy() to match the returns.)&lt;br/&gt;&lt;br/&gt;

I went looking through the code at opensolaris.org, and &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s#85"&gt;here's what I saw&lt;/a&gt;:&lt;br/&gt;
&lt;pre&gt;
     85  ENTRY(memmove)  /* (void *s1, void *s2, size_t n) */
     86  cmpq %rsi,%rdi / if (source addr &gt; dest addr)
     87  leaq -1(%rsi,%rdx),%r9
     88  jle .CopyRight /
     89  cmpq %r9,%rdi
     90  jle .CopyLeft
     91  jmp .CopyRight
     92 
     93  ENTRY(memcpy)                        /* (void *, const void*, size_t) */
     94 
     95 .CopyRight:
     96 LABEL(1try):
     97         cmp     $16, %rdx
     98         mov     %rdi, %rax
     99         jae     LABEL(1after)
    100 
    101         .p2align 4
&lt;/pre&gt;&lt;br/&gt;

Hmm, so the two functions share the same code, or at least part of it.  (The CopyLeft and CopyRight functions are copy in different directions.  memmove() has to handle overlapping segments differently so that it doesn't overwrite memory that it hasn't moved yet, thus CopyLeft.  memcpy() doesn't have this restriction, so it uses the simpler CopyRight.)&lt;br/&gt;&lt;br/&gt;

Given that these two share the same code in this fashion, they must necessarily share the same exit points, at least for code paths going through CopyRight.  And that's why I was seeing two return probes firing, one for memmove() and another for memcpy().  In specifying all function entry and exit points, I'm causing DTrace to instrument both memmove() and memcpy(), and DTrace is doing this in such a way that there are two probes at this same point.  When the executing process hits that point, it fires both probes, and I see a return from two different functions.&lt;br/&gt;&lt;br/&gt;

For the curious, here's the DTrace script I was running.&lt;br&gt;
&lt;pre&gt;
#!/usr/sbin/dtrace -s

#pragma D option flowindent

pid$target:dataserver:tdsrecv_language:entry
{
        self-&gt;traceme = 1;
}

pid$target:::entry,
pid$target:::return
/self-&gt;traceme/
{
}

pid$target:dataserver:tdsrecv_language:return
{
        self-&gt;traceme = 0;
        exit(0);
}
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-5551144250239688038?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/5551144250239688038/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=5551144250239688038' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5551144250239688038'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/5551144250239688038'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/04/memmove-memcpy-and-dtrace.html' title='memmove(), memcpy() and DTrace'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-6118490981523020975</id><published>2007-03-30T18:08:00.000-07:00</published><updated>2007-03-30T18:38:20.316-07:00</updated><title type='text'>Bootable ZFS</title><content type='html'>Woohoo!  We can now do a &lt;a href="http://opensolaris.org/os/community/on/flag-days/pages/2007032801/"&gt;native boot from ZFS&lt;/a&gt;, at least for x86 servers.  Live Upgrade doesn't support this yet, and since I use Live Upgrade on both my home system and my laptop, I thought I'd wait a while before giving this a go.  But Live Upgrade with ZFS will mostly just involve taking a snapshot and cloning it, so one could hack something together to get similar functionality.  Here is someone's &lt;a href="http://blogs.sun.com/timh/entry/friday_fun_with_bfu_and"&gt;experience&lt;/a&gt; of having done just that.  (Note that this involves using BFU and not doing an OS upgrade like you'd do with Live Upgrade, but it's still pretty interesting.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-6118490981523020975?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/6118490981523020975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=6118490981523020975' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6118490981523020975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6118490981523020975'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/03/bootable-zfs.html' title='Bootable ZFS'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-6798678277104770203</id><published>2007-03-07T08:09:00.001-08:00</published><updated>2007-03-07T09:05:09.807-08:00</updated><title type='text'>TCP/IP checksum offload in Solaris</title><content type='html'>I ran across this interesting problem the other day.  There are two snippets of snoop output below, the first from the client side and the second from the server side.  It's the same packet as seen from both sides (part of a connection to the echo  port.)  The problem is that the IP and TCP checksums don't match.&lt;br/&gt;

&lt;pre&gt;ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 10:53:46.84469
ETHER:  Packet size = 66 bytes
ETHER:  Destination = 0:14:4f:46:17:32,
ETHER:  Source      = 0:14:4f:3f:e6:e8,
ETHER:  Ethertype = 0800 (IP)
ETHER:
IP:   ----- IP Header -----
IP:
[ ... ] 
IP:   Header checksum = 0000
[ ... ]
TCP:  ----- TCP Header -----
TCP:
TCP:  Source port = 34783
TCP:  Destination port = 7 (ECHO)
TCP:  Sequence number = 2814225515
[ ... ]
TCP:  Checksum = 0x24d7
[ ... ]
ECHO:  ----- ECHO:   -----
ECHO:
ECHO:  ""
ECHO:


           0: 0014 4f46 1732 0014 4f3f e6e8 0800 4500    ..OF.2..O?.&lt;E8&gt;..E.
          16: 0034 2680 4000 4006 0000 0a40 0819 0a40    .4&amp;.@.@....@...@
          32: 0818 87df 0007 a7bd ac6b 0000 0000 8002    .........k......
          48: c1e8 24d7 0000 0204 05b4 0103 0300 0101    .&lt;E8&gt;$.............
          64: 0402                                       ..

&lt;/pre&gt;&lt;br/&gt;

&lt;pre&gt;ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 10:53:46.84548
ETHER:  Packet size = 70 bytes
ETHER:  Destination = 0:14:4f:46:17:32,
ETHER:  Source      = 0:14:4f:3f:e6:e8,
ETHER:  Ethertype = 0800 (IP)
ETHER:
IP:   ----- IP Header -----
IP:
[ ... ]
IP:   Header checksum = ef93
[ ... ]
TCP:  ----- TCP Header -----
TCP:
TCP:  Source port = 34783
TCP:  Destination port = 7 (ECHO)
TCP:  Sequence number = 2814225515
[ ... ]
TCP:  Checksum = 0xac6f
[ ... ]
ECHO:  ----- ECHO:   -----
ECHO:
ECHO:  ""
ECHO:


           0: 0014 4f46 1732 0014 4f3f e6e8 0800 4500    ..OF.2..O?.&lt;E8&gt;..E.
          16: 0034 2680 4000 4006 ef93 0a40 0819 0a40    .4&amp;.@.@.&lt;EF&gt;..@...@
          32: 0818 87df 0007 a7bd ac6b 0000 0000 8002    .........k......
          48: c1e8 ac6f 0000 0204 05b4 0103 0300 0101    .&lt;E8&gt;.o............
          64: 0402 b42f ebf7                             .../&lt;EB&gt;.

&lt;/pre&gt;

I had sent a very large client-side packet dump to a vendor as part of debugging a problem we were seeing.  The engineer on the other side pointed out to me that all of the client-side checksums were wrong, so I played with it a bit to see if he was right.  And apparently he was.&lt;br/&gt;

After a little bit of digging, I figured out what the problem was.  (It was something I'd known before, but I hadn't given it enough attention to realize that this is what I was seeing.)  The checksum calculation has been offloaded to the network adapter (for those adapters that support it.)  This has been in Solaris for a while, but there have been improvements in the checksum offload in Solaris 10.  There's some interesting information about it (and other improvements) &lt;a href="http://blogs.sun.com/sunay/entry/solaris_networking_the_magic_revealed"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-6798678277104770203?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/6798678277104770203/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=6798678277104770203' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6798678277104770203'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/6798678277104770203'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/03/tcpip-checksum-offload-in-solaris.html' title='TCP/IP checksum offload in Solaris'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-117105726377449577</id><published>2007-02-09T13:39:00.000-08:00</published><updated>2007-02-13T13:20:28.476-08:00</updated><title type='text'>Cisco VPN client on Windows under QEMU on Solaris</title><content type='html'>Okay, nothing terribly exciting to see here.  It's another addition to the who-knows-how-many blog entries out there about how someone managed to get some OS working in a virtual machine under another OS.&lt;br/&gt;&lt;br/&gt;

So I got Windows working under QEMU under Solaris x86 on my laptop.  The reason for doing this was so that I would be able to be able to VPN into work without actually having to boot into Windows.&lt;br/&gt;&lt;br/&gt;

(BTW, I'm using the SUNWqemu package available &lt;a href="http://opensolaris.org/os/project/qemu/downloads/"&gt;here&lt;/a&gt;.)&lt;br/&gt;&lt;br/&gt;

And here's the obligatory screenshot.  Just to be cute, I took a recursive screenshot:&lt;br/&gt;&lt;br/&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://photos1.blogger.com/x/blogger/2408/2809/1600/740145/Screenshot.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://photos1.blogger.com/x/blogger/2408/2809/320/750752/Screenshot.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-117105726377449577?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/117105726377449577/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=117105726377449577' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/117105726377449577'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/117105726377449577'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/02/cisco-vpn-client-on-windows-under-qemu.html' title='Cisco VPN client on Windows under QEMU on Solaris'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-117070728381743131</id><published>2007-02-05T12:22:00.000-08:00</published><updated>2007-02-05T12:35:45.423-08:00</updated><title type='text'>Snoop and NFS filehandles</title><content type='html'>Okay, so this is a non-original-content blog entry, but I found it interesting and wanted to link to it.  &lt;a href="http://blogs.sun.com/peteh/entry/understanding_snoop_1m_nfsv3_file"&gt;Here is a blog entry&lt;/a&gt; about associating the NFS (v3) file handle you see in snoop output to the referenced file on the NFS server.  It's another one of those things that's obvious after the fact:  "Hmm, an NFS filehandle encodes certain information so that the NFS server can uniquely identify a file, so there must be some way to uniquely identify a file on an NFS server given the filehandle."  But I still think it's interesting to see it in practice.&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-117070728381743131?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/117070728381743131/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=117070728381743131' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/117070728381743131'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/117070728381743131'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/02/snoop-and-nfs-filehandles.html' title='Snoop and NFS filehandles'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116904729058504142</id><published>2007-01-25T14:45:00.000-08:00</published><updated>2007-01-25T11:43:27.390-08:00</updated><title type='text'>Microstate accounting in Solaris 10 and CPU latency</title><content type='html'>Solaris 10 included some pretty major features such as ZFS, Zones and DTrace.  There were also other features that weren't quite as major but that are pretty nifty in and of themselves.  One of these features is that microstate accounting is turned on by default in Solaris 10.  (It was optional in earlier releases of Solaris.)
Eric Schrock blogged about microstate accounting &lt;a href="http://blogs.sun.com/eschrock/date/20041013"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

Using 'prstat -m' you can see some of this information.  For example:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
  9559 mynhier   27 3.9 0.0 0.0 0.0 0.0 0.0  69   0 186  3K   0 ube/1
  9562 mynhier   18 6.6 0.0 0.0 0.0 0.0 0.0  76   0  92  7K   0 ube/1
  9555 mynhier   21 2.7 0.0 0.0 0.0 0.0 0.0  76   0 143  2K   0 cc1/1
  9564 mynhier   14 8.8 0.0 0.0 0.0 0.0 0.0  77   0 124 13K   0 ube/1
  9551 mynhier   18 2.2 0.0 0.0 0.0 0.0 0.0  80   0 123  2K   0 cc1/1
  9549 mynhier   17 2.1 0.0 0.0 0.0 0.0 0.0  81   0 117  2K   0 cc1/1
  9560 mynhier   16 2.6 0.0 0.0 0.0 0.0 0.0  82   0 350  6K   0 iropt/1
  9507 mynhier  5.9 1.6 0.0 0.0 0.0 0.0  59  34  14  31  2K   0 dmake/1
  9508 mynhier  3.4 1.0 0.0 0.0 0.0 0.0  75  21  14  17  1K   0 dmake/1
  9489 mynhier  2.9 0.7 0.0 0.0 0.0 0.0  96 0.8  15  13  1K   0 dmake/1
  9490 mynhier  2.9 0.7 0.0 0.0 0.0 0.0  96 0.8  15  13  1K   0 dmake/1
   479 mynhier  0.7 0.0 0.0 0.0 0.0 0.0  98 1.7  28   0 113  13 Xorg/1
  9543 mynhier  0.1 0.4 0.0 0.0 0.0 0.0  67  32   7   1 215   0 cc/1
  9536 mynhier  0.1 0.4 0.0 0.0 0.0 0.0  79  20   6   3 207   0 cc/1
  9527 mynhier  0.1 0.4 0.0 0.0 0.0 0.0  95 4.4   6   5 203   0 cc/1
  9554 mynhier  0.1 0.2 0.0 0.0 0.0 0.0  86  14   2   0 221   0 gcc/1
  9550 mynhier  0.1 0.2 0.0 0.0 0.0 0.0  85  15   2   0 180   0 gcc/1
  9548 mynhier  0.1 0.2 0.0 0.0 0.0 0.0  84  16   1   3 163   0 gcc/1
  9546 mynhier  0.1 0.2 0.0 0.0 0.0 0.0  87  12   3   0 200   0 cc/1
  9544 mynhier  0.1 0.2 0.0 0.0 0.0 0.0  83  16   1   1 239   0 sh/1
Total: 111 processes, 238 lwps, load averages: 5.46, 5.79, 5.68
&gt;
&lt;/pre&gt;&lt;br/&gt;

The additional columns (TRP through LAT) indicate what percentage of time the process spent handling traps, text page faults, data page faults, sleeping or waiting for CPU.&lt;br/&gt;&lt;br/&gt;

Given that this is only a two-processor machine, it's fairly obvious from the load average reported above that the machine is overloaded, but the load average doesn't give quite as detailed picture of what's going as does 'prstat -m' output.  In particular, the LAT column (CPU latency) is pretty instructive.  All of the processes in the above example are spending some percentage of their time fully able to run but waiting to be switched onto a CPU.  You could have guessed this from the load average, but load average couldn't have told you that some of these processes were spending over 80% of their time sitting in a run queue.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116904729058504142?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116904729058504142/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116904729058504142' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116904729058504142'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116904729058504142'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/01/microstate-accounting-in-solaris-10.html' title='Microstate accounting in Solaris 10 and CPU latency'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116932987425355006</id><published>2007-01-20T13:46:00.000-08:00</published><updated>2007-01-20T13:51:14.256-08:00</updated><title type='text'>It's the little things</title><content type='html'>Guh.  There's nothing quite as embarassing as having made a simple formatting mistake in your blog and then having propagated it for a couple of months.&lt;br/&gt;&lt;br/&gt;

I just recently realized that this really only breaks how my blog appears in Firefox.  Given that I use Opera, Safari, and IE, and given that they all displayed what I wanted to see, I never realized the mistake.  So now I just republished two months' worth of entries to fix it.&lt;br/&gt;&lt;br/&gt;

Guh&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116932987425355006?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116932987425355006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116932987425355006' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116932987425355006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116932987425355006'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/01/its-little-things.html' title='It&apos;s the little things'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116826570039175698</id><published>2007-01-08T05:40:00.000-08:00</published><updated>2007-01-20T13:16:29.180-08:00</updated><title type='text'>ZFS and committing data to stable storage</title><content type='html'>I ran across a couple of blog entries that might be of interest to anyone using ZFS on an NFS server or using ZFS on a storage array with a battery-backed cache.&lt;br/&gt;&lt;br/&gt;

WRT NFS over ZFS, &lt;a href="http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine"&gt;this entry&lt;/a&gt; discusses why you might see worse performance with NFS over ZFS when compared to NFS over UFS (at least for a singly-threaded load.)  The article points out that it's actually an apples-to-oranges comparison, as ZFS implements the correct behavior WRT NFS semantics -- NFS COMMITs are guaranteed to be on stable storage (via the ZFS intent log (ZIL)), regardless of whether the disk cache is enabled.  With UFS (or other filesystems), NFS COMMITs are considered successful once they've hit the disk cache, which is decidedly not stable storage.&lt;br/&gt;&lt;br/&gt;

On a similar note (and linked from the above, also), &lt;a href="http://blogs.digitar.com/jjww/?itemid=44"&gt;this entry&lt;/a&gt; discusses ZFS performance problems with storage arrays using battery-backed cache.  In this case, the cache can be considered stable storage, but ZFS still forces a flush of the cache after every write to the ZIL.  The article discusses a couple of ways to handle this problem.  One of these is definitely a bad idea (disabling the ZIL), but the other has merit (instructing the storage array to ignore the flush commands from ZFS.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116826570039175698?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116826570039175698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116826570039175698' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116826570039175698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116826570039175698'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/01/zfs-and-committing-data-to-stable.html' title='ZFS and committing data to stable storage'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116793733641779787</id><published>2007-01-04T10:36:00.000-08:00</published><updated>2007-01-20T13:17:34.870-08:00</updated><title type='text'>Machine That Goes PING!</title><content type='html'>I was looking through the &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/cmd-inet/usr.sbin/ping/ping.c"&gt;source code&lt;/a&gt; for ping (the Solaris version, and I noticed this:&lt;/br&gt;
&lt;pre&gt;    496  if (getenv("MACHINE_THAT_GOES_PING") != NULL)
    497   stats = _B_TRUE;
&lt;/pre&gt;&lt;br/&gt;
So if you want the "-s" functionality without always having to specify the "-s", set this environment variable.&lt;br/&gt;&lt;br/&gt;

This fix doesn't appear to have been integrated before Solaris 10 06/06 was set in stone, though, and I don't have access to an 11/06 or recent Solaris Express installation to see it in action.  It's not terribly Earth-shaking as code changes go, I just found it amusing.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116793733641779787?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116793733641779787/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116793733641779787' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116793733641779787'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116793733641779787'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/01/machine-that-goes-ping.html' title='Machine That Goes PING!'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116785940728719104</id><published>2007-01-03T12:17:00.000-08:00</published><updated>2007-01-20T13:18:26.043-08:00</updated><title type='text'>ZFS stores file creation time</title><content type='html'>I've probably run across this at some point and forgotten it, but I happened to notice it while looking at something else.  ZFS stores the file creation time (crtime) in addition to the traditional atime, mtime, and ctime:&lt;br/&gt;
&lt;pre&gt;juser@server&gt; sudo zdb -dddd trashme
[ ... ]
    Object  lvl   iblk   dblk  lsize  asize  type
        14    2    16K   128K  3.75M  3.75M  ZFS plain file
                                 264  bonus  ZFS znode
        path    /foo
        atime   Wed Jan  3 16:20:42 2007
        mtime   Wed Jan  3 16:20:42 2007
        ctime   Wed Jan  3 16:20:42 2007
        crtime  Wed Jan  3 15:26:53 2007
[ ... ]
&lt;/pre&gt;&lt;br/&gt;

And, of course, this can be seen in the &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zfs_znode.h"&gt;source code&lt;/a&gt;:&lt;br/&gt;
&lt;pre&gt;     87 /*
     88  * This is the persistent portion of the znode.  It is stored
     89  * in the "bonus buffer" of the file.  Short symbolic links
     90  * are also stored in the bonus buffer.
     91  */
     92 typedef struct znode_phys {
     93  uint64_t zp_atime[2];  /*  0 - last file access time */
     94  uint64_t zp_mtime[2];  /* 16 - last file modification time */
     95  uint64_t zp_ctime[2];  /* 32 - last file change time */
     96  uint64_t zp_crtime[2];  /* 48 - creation time */
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116785940728719104?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116785940728719104/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116785940728719104' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116785940728719104'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116785940728719104'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2007/01/zfs-stores-file-creation-time.html' title='ZFS stores file creation time'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116666522061367119</id><published>2006-12-20T17:06:00.000-08:00</published><updated>2007-01-20T13:19:18.950-08:00</updated><title type='text'>Unkillable processes</title><content type='html'>One of the blogs I read religiously is Ben Rockwood's.  He has some interesting &lt;a href="http://www.cuddletech.com/blog/pivot/entry.php?id=780"&gt;anecdotes&lt;/a&gt; (that's http://www.cuddletech.com/blog/pivot/entry.php?id=780, in case you get the spam warning instead of the blog) about using OpenSolaris in production at Joyent, including one about an unkillable process.&lt;br/&gt;&lt;br/&gt;

I mailed the link to a couple of former colleagues, mostly because I thought they might be interested in the NFS-over-ZFS anecdote (given that they work at an ISP.)  Apparently I jinxed them -- just after getting in to work the next morning, they discovered an unkillable process on one of their Solaris 10 boxes.  And it was also a process running in a zone, so it was impossible to reboot the zone to clear it up.&lt;br/&gt;&lt;br/&gt;

Sorry, guys.&lt;br/&gt;&lt;br/&gt;

(BTW, this appeared to be a deadlock situation.  The process has two threads, one stuck in cv_wait() via exitlwps() and the other stuck in cv_wait() via tcp_close().  Given that I don't work there anymore, I couldn't really go crash-dump diving, but I'd bet that there were no other threads on the system that were going to call cv_signal() or cv_broadcast() on that particular CV.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116666522061367119?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116666522061367119/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116666522061367119' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116666522061367119'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116666522061367119'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/unkillable-processes.html' title='Unkillable processes'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116619440761861653</id><published>2006-12-15T06:40:00.000-08:00</published><updated>2007-01-20T13:19:57.390-08:00</updated><title type='text'>ZFS, DTrace, and fault domains</title><content type='html'>An interesting similarity between ZFS and DTrace occurred to me last night.  One of the things ZFS gives you is the ability to sit outside the fault domain(s) between an application and its data on disk and catch any corruption that's introduced anywhere in that/those fault domain(s).  You can't rely on a RAID controller to catch corruption intoduced between it and the server.&lt;br/&gt;&lt;br/&gt;

DTrace is similar in that it lets you look at different parts of the fault domain involved in running an application.  That is, it lets you look at what's going in different parts of that fault domain -- the application, the libraries it uses, system calls it makes, and the function flow inside the kernel involved in implementing those system calls.  Other traditional instrumentation tools generally allow you to look at one part of that domain -- truss lets you watch the system call boundary, instrumented libraries or applications let you watch just that part of the fault domain, etc.&lt;br/&gt;&lt;br/&gt;

Or maybe I'm just stretching things a bit in making this comparison.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116619440761861653?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116619440761861653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116619440761861653' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116619440761861653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116619440761861653'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/zfs-dtrace-and-fault-domains.html' title='ZFS, DTrace, and fault domains'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116550304380458615</id><published>2006-12-07T09:50:00.000-08:00</published><updated>2007-01-20T13:21:13.746-08:00</updated><title type='text'>Problems with non-redundant ZFS</title><content type='html'>From watching the zfs-discuss mailing list, it seems that people are starting to experience problems with ZFS where it &lt;i&gt;appears&lt;/i&gt; that ZFS is causing them to lose data (or at least rendering their data inaccessible, which may as well be data loss.)&lt;br/&gt;&lt;br/&gt;

The issue is with using ZFS in a non-redundant fashion on top of HW
RAID where corruption is introduced by the hardware itself.  What
happens in these cases is that ZFS sees the corruption and stops
letting you access the filesystem, which effectively means that your
data is gone.  Technically, there's still a valid ZFS filesystem on
the disks, but because the hardware is introducing corruption, the ZFS
software won't let you access it.&lt;br/&gt;&lt;br/&gt;

The complaint people seem to have is that ZFS isn't allowing them to
salvage their data, where other filesystems would allow them access to
their data.  (Although what they don't seem to be considering is that
they shouldn't be trusting the data that the other filesystem is
giving them.)&lt;br/&gt;&lt;br/&gt;

From what I gather, these seem to be problems where letting ZFS handle
(at least some of) the redundancy would help, given that ZFS would
have the chance to apply its self-healing-data magic.  But given that
ZFS is being presented with a single LUN, it has nowhere to put
corrected copies of data.&lt;br/&gt;&lt;br/&gt;

(And yeah, if something in the hardware chain is corrupting all of the
data, ZFS redundancy won't help.  But in this case, you're not going
to be getting any data via any filesystem.)&lt;br/&gt;&lt;br/&gt;

There's a fairly lengthy thread about this &lt;a href="http://www.opensolaris.org/jive/thread.jspa?messageID=74995"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

I recently saw a comparison of ZFS to TCP, in that both give you guarantees about the correctness of data being delivered to the application, and that comparison is pretty relevant to this problem.  If TCP sees corruption in a data stream, it will not deliver the data (and because TCP also guarantees in-order delivery, the data &lt;i&gt;after&lt;/i&gt; the corrupted data will not be delivered until the corruption is fixed.)  ZFS is similar -- it will not deliver data that is corrupt.&lt;br/&gt;&lt;br/&gt;

Both ZFS and TCP have mechanisms for correcting corrupted data.  In the case of TCP, the sender retransmits the data.  In the case of ZFS, the software grabs a good copy of the data from its redundant location on disk, either from a mirror or from the parity data stored via raidz.  But if the zpool has been created without redundancy, there is no good copy of the data to be found, the data is effectively lost.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116550304380458615?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116550304380458615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116550304380458615' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116550304380458615'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116550304380458615'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/problems-with-non-redundant-zfs.html' title='Problems with non-redundant ZFS'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116535558876559673</id><published>2006-12-06T08:30:00.000-08:00</published><updated>2007-01-20T13:22:43.516-08:00</updated><title type='text'>How long has that thread been waiting</title><content type='html'>Okay, so this is actually pretty simple, but I'll write about it anyway.&lt;br/&gt;&lt;br/&gt;

I've been looking at a core file with a hung nfsd, and I wondered if I could figure out how long threads had been blocked on the RW lock in question.  The lock itself has no concept of when it was grabbed.  A reader/writer lock is implemented as a single-word data structure, where all but a few bits of that word is devoted either to the address of the thread holding the write lock or to the count of readers.  Reader/writer locks are described on page 836 of the Solaris Internals book and in the &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/rwlock_impl.h"&gt;header file&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;

Maybe some time information is kept in the turnstile itself?  While the amount of time a thread spends waiting on a resource is a useful statistic, there's no need to store this information in that data structure[1], as the information is available elsewhere.&lt;br/&gt;&lt;br/&gt;

The &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/thread.h"&gt;kthread_t data structure&lt;/a&gt; contains this:&lt;br/&gt;
&lt;pre&gt;
    193  clock_t  t_disp_time; /* last time this thread was running */
&lt;/pre&gt;&lt;br/&gt;

Given that the last thing the thread did while it was running was to request a resource that it then had to sleep on, the difference between now (stored in lbolt) and t_disp_time tells us how long the thread has been sleeping.  So if we pick one of the threads waiting on a RW lock and investigate:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
&gt; ::turnstile ! awk '$3 != 0'
            ADDR             SOBJ  WTRS EPRI             ITOR          PRIOINV
ffffffff818ccac0 fffffe849958a150     5   60                0                0
fffffe93cd2335c8 ffffffff8875ccc0     5    4                0                0
fffffe93eedcc008 fffffe849958ab68     3   59                0                0
&gt; fffffe849958a150::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffe849958a150 fffffe849fb74120  B111 fffffe89fdd5fe40 (W)
                                    ||| fffffe8692f2aae0 (W)
                 WRITE_LOCKED ------+|| ffffffff83127720 (W)
                 WRITE_WANTED -------+| ffffffff817f3540 (W)
                  HAS_WAITERS --------+ ffffffff89f79f20 (R)
&gt; fffffe89fdd5fe40::print kthread_t t_disp_time
t_disp_time = 0x549da3e9
&gt; lbolt/X
lbolt:
lbolt:          54b25d0e
&gt; hz/D
hz:
hz:             100
&gt;
&lt;/pre&gt;&lt;br/&gt;

Subtracting t_disp_time from lbolt and dividing by hz, we see that this thread had been waiting a little over 3h46m when I grabbed this core file.&lt;br/&gt;&lt;br/&gt;

[1] I had originally written, "...there's no need to store this information in the data structures associated with a turnstile...", but a kthread_t could be considered one of the data structures associated with a turnstile, as the pointers maintaining the sleep queue are kept in the &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/thread.h"&gt;kthread_t structure&lt;/a&gt; itself (see also section 3.10 of the Solaris Internals book):&lt;br/&gt;
&lt;pre&gt;
    105 typedef struct _kthread {
    106  struct _kthread *t_link; /* dispq, sleepq, and free queue link */
[ ... ]
    280  struct _kthread *t_priforw; /* sleepq per-priority sublist */
    281  struct _kthread *t_priback;
    282 
    283  struct sleepq *t_sleepq; /* sleep queue thread is waiting on */
&lt;/pre&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116535558876559673?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116535558876559673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116535558876559673' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116535558876559673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116535558876559673'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/how-long-has-that-thread-been-waiting.html' title='How long has that thread been waiting'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116525174563678158</id><published>2006-12-04T13:30:00.000-08:00</published><updated>2007-01-20T13:24:05.340-08:00</updated><title type='text'>Fun with fsdb</title><content type='html'>If you've been around long enough, you'll eventually see something weird like the following:&lt;br/&gt;
&lt;pre&gt;juser@server:/fs&gt; ls -l
total 4
drwxr-xr-x   3 root  root   512 Dec  4 10:17 proj/
drwxr-xr-x   4 root  root   512 Nov 29 12:34 repl/
juser@server:/fs&gt; cd proj
juserserver:/fs/proj&gt; cd ..
..: Permission denied.
juser@server:/fs/proj&gt;
&lt;/pre&gt;&lt;br/&gt;

What's happened is that the filesystem mounted at /fs/proj is mounted on a directory with insufficient permissions.  But how can you see the current permissions are?  If you do an 'ls -li', you'll note that the inode number of the underlying directory is listed:&lt;br/&gt;
&lt;pre&gt;
juser@server&gt; ls -li
total 4
     6 drwxr-xr-x   3 root  root  512 Dec  4 10:17 proj/
     7 drwxr-xr-x   4 root  root  512 Nov 29 12:34 repl/
juser@server&gt;
&lt;/pre&gt;&lt;br/&gt;

But how can you look at the information contained in that underlying inode?  If you try to stat the file, you'll end up with information about the root inode of the filesystem mounted there.  If you could make a hard link to the inode, you could access it via that link, but you can't make a hard link to a directory.  You can use fsdb, however, to look directly at the filesystem in question:&lt;br/&gt;
&lt;pre&gt;juser@server&gt; sudo fsdb /dev/rdsk/c0t0d0s0
fsdb of /dev/rdsk/c0t0d0s0 (Read only) -- last mounted on /
fs_clean is currently set to FSLOG
fs_state consistent (fs_clean CAN be trusted)
/dev/rdsk/c0t0d0s0 &gt; :cd /fs
/dev/rdsk/c0t0d0s0 &gt; :ls -l
/fs:
i#: 5           ./
i#: 2           ../
i#: 33371       .rsync/
i#: 16207       .rsync_root
i#: 6           proj/
i#: 7           repl/
/dev/rdsk/c0t0d0s0 &gt;
&lt;/pre&gt;&lt;br/&gt;

Inside fsdb, you can move around the filesystem and get some information aboutwhat's there.  But because fsdb is going directly to the disk and not going through the OS, it has no information about what's mounted there (and resolving a path like /fs/proj won't ever cross filesystem boundaries.)  Below, we set the current inode and then get information about it:&lt;br/&gt;
&lt;pre&gt;
/dev/rdsk/c0t0d0s0 &gt; 6:inode
/dev/rdsk/c0t0d0s0 &gt; ?i
i#: 6              md: d---rwx------  uid: 0             gid: 0
ln: 3              bs: 2              sz : c_flags : 0           200            

db#0: 2fd
        accessed: Mon Dec  4 11:32:30 2006
        modified: Mon Dec  4 10:13:13 2006
        created : Mon Dec  4 11:32:59 2006
/dev/rdsk/c0t0d0s0 &gt;
&lt;/pre&gt;&lt;br/&gt;

And we can see that the permissions on the file are very restrictive.  We could unmount the filesystem, change the permissions, and remount, or we could keep playing with fsdb.  (Note that this likely isn't the kind of thing you'd want to do on a production server if you like your job.  Note also that you need the 'w' option to be able to write to the device.):&lt;br/&gt;
&lt;pre&gt;
juser@server&gt; sudo fsdb -o w /dev/rdsk/c0t0d0s0
fsdb of /dev/rdsk/c0t0d0s0 (Opened for write) -- last mounted on /
fs_clean is currently set to FSLOG
fs_state consistent (fs_clean CAN be trusted)
/dev/rdsk/c0t0d0s0 &gt; 6:inode
/dev/rdsk/c0t0d0s0 &gt; ?i
i#: 6              md: d---rwx------  uid: 0             gid: 0
ln: 3              bs: 2              sz : c_flags : 0           200            

db#0: 2fd
        accessed: Mon Dec  4 11:32:30 2006
        modified: Mon Dec  4 10:13:13 2006
        created : Mon Dec  4 11:32:59 2006
/dev/rdsk/c0t0d0s0 &gt; :md=+055
i#: 6              md: d---rwxr-xr-x  uid: 0             gid: 0
ln: 3              bs: 2              sz : c_flags : 0           200            

db#0: 2fd
        accessed: Mon Dec  4 11:32:30 2006
        modified: Mon Dec  4 10:13:13 2006
        created : Mon Dec  4 11:32:59 2006
/dev/rdsk/c0t0d0s0 &gt;
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116525174563678158?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116525174563678158/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116525174563678158' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116525174563678158'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116525174563678158'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/fun-with-fsdb.html' title='Fun with fsdb'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116500407812707821</id><published>2006-12-01T16:00:00.000-08:00</published><updated>2007-01-20T13:25:33.376-08:00</updated><title type='text'>Turnstiles and MDB</title><content type='html'>In Solaris, turnstiles are a data structure used by some of the synchronization primitives in the kernel (mutexes and reader-write locks, specifically.)  They're similar to sleep queues, but they also deal with the priority inversion problem by allowing for priotiy inheritance.&lt;br/&gt;&lt;br/&gt;

(Priority inversion occurs when a high-priority thread is waiting for a lower-priority thread to release a resource it needs.  Priority inheritance is a mechanism whereby the lower-priority thread gets raised to the higher priority so that it can release the resource more quickly.)&lt;br/&gt;&lt;br/&gt;

There's more information in the Solaris Internals book about turnstiles, but I wanted to discuss looking at turnstiles with MDB.  The ::turnstile dcmd will list all of the turnstiles on your live system or in your crash dump.  For example:&lt;br/&gt;
&lt;pre&gt;&gt; ::turnstile ! head
            ADDR             SOBJ  WTRS EPRI ITOR PRIOINV
ffffffff81600000                0     0    0    0       0
ffffffff81600040                0     0    0    0       0
ffffffff81600080                0     0    0    0       0
ffffffff816000c0                0     0    0    0       0
ffffffff81600100                0     0    0    0       0
ffffffff81600140                0     0    0    0       0
ffffffff81600180 ffffffff88b3fd48     0  165    0       0
ffffffff816001c0 ffffffff812bad80     0   60    0       0
ffffffff81600200 ffffffff852c5f98     0   60    0       0
&gt;
&lt;/pre&gt;&lt;br/&gt;

You get the addresses of the turnstile and the synchronization object associated with it, the number of waiters, and priority information.  So, let's look at the turnstiles with waiters:&lt;br/&gt;
&lt;pre&gt;&gt; ::turnstile ! awk '$3 != 0'
            ADDR             SOBJ  WTRS EPRI ITOR PRIOINV
ffffffff812e3748 ffffffff8c1f9570     2  164    0       0
ffffffff887b8340 ffffffff8e3ea688     1  165    0       0
ffffffff8193a980 ffffffff8d8f86f8     6  164    0       0
fffffe84cd15fe08 ffffffff8e3ea680     2  164    0       0
&gt;
&lt;/pre&gt;&lt;br/&gt;

We have the addresses of the synchronization objects, so let's look at one (I happen to know that these are all reader-writer locks):&lt;br/&gt;
&lt;pre&gt;&gt; ffffffff8c1f9570::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffffff8c1f9570 ffffffff92480380  B111 ffffffff8888f7e0 (W)
                                    ||| ffffffffb0cea1e0 (W)
                 WRITE_LOCKED ------+||
                 WRITE_WANTED -------+|
                  HAS_WAITERS --------+
&gt;
&lt;/pre&gt;&lt;br/&gt;

We can see who the owner is (the address of the data structure representing the thread), the value of the flags, and the list of waiters (if any.)  We know this is currently being held as a write lock because the WRITE_LOCKED flag is 1, but also because the OWNER/COUNT lists the address of a thread rather than a count of readers.&lt;br/&gt;&lt;br/&gt;

And given the owner, we can examine the stack:&lt;br/&gt;
&lt;pre&gt;&gt; ffffffff92480380::findstack
stack pointer for thread ffffffff92480380: fffffe8000dc04b0
[ fffffe8000dc04b0 _resume_from_idle+0xde() ]
  fffffe8000dc04e0 swtch+0x10b()
  fffffe8000dc0500 cv_wait+0x68()
  fffffe8000dc0550 top_end_sync+0xa3()
  fffffe8000dc05f0 ufs_write+0x32d()
  fffffe8000dc0600 fop_write+0xb()
  fffffe8000dc0890 rfs3_write+0x3a3()
  fffffe8000dc0b50 common_dispatch+0x585()
  fffffe8000dc0b60 rfs_dispatch+0x21()
  fffffe8000dc0c30 svc_getreq+0x17c()
  fffffe8000dc0c80 svc_run+0x124()
  fffffe8000dc0cb0 svc_do_run+0x88()
  fffffe8000dc0ed0 nfssys+0x50d()
  fffffe8000dc0f20 sys_syscall32+0xef()
&gt;
&lt;/pre&gt;&lt;br/&gt;

So this thread is holding a reader-writer lock, and it appears to be waiting on a condition variable.  As it turns out, nothing is ever going to call cv_broadcast() or cv_signal() on that condition variable, which means that the process is never going to release that RW lock, either.  Which is, of course, why I'm looking at this crash dump in the first place.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116500407812707821?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116500407812707821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116500407812707821' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116500407812707821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116500407812707821'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/12/turnstiles-and-mdb.html' title='Turnstiles and MDB'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116490330351609767</id><published>2006-11-30T11:21:00.000-08:00</published><updated>2007-01-20T13:26:30.123-08:00</updated><title type='text'>Forcing a Solaris x86 kernel core dump</title><content type='html'>With a SPARC system, it's easy to force a kernel core dump:  drop the system to the ok prompt and type 'sync'.  Solaris x86 servers don't give you that, so you have to do something else.&lt;br/&gt;&lt;br/&gt;

The &lt;a href="http://www.sun.drydog.com/faq/s86faq.html"&gt;Solaris x86 FAQ&lt;/a&gt; suggests booting the server under kadb so that you can run this to get a core dump&lt;br/&gt;
&lt;pre&gt;$&amp;lt;systemdump&lt;/pre&gt;

This information is a little bit outdated, as kadb has been replaced by kmdb, which is actually much nicer, 'cause you can load kmdb at run-time instead of having been lucky enough to have booted under kadb.  But the above command will still force a dump of the system.&lt;br/&gt;&lt;br/&gt;

Of course, you could always do it with DTrace, too, if you like:&lt;br/&gt;
&lt;pre&gt;dtrace -w -n 'BEGIN{panic();}'&lt;/pre&gt;&lt;br/&gt;

(The panic() function is provided as one of the destructive functions, which is why you need the -w flag.)&lt;br/&gt;&lt;br/&gt;

(And I was in full paranoid mode while testing this, first to make sure I was on the correct server when I tried it, but also while I had the command in my mouse buffer to copy it here.  It's not something you want to accidentally paste into some random window.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116490330351609767?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116490330351609767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116490330351609767' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116490330351609767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116490330351609767'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/11/forcing-solaris-x86-kernel-core-dump.html' title='Forcing a Solaris x86 kernel core dump'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116474114331188484</id><published>2006-11-29T07:50:00.000-08:00</published><updated>2007-01-20T13:28:12.706-08:00</updated><title type='text'>Recommended reading: Postmortem Object Type Identification</title><content type='html'>This paper makes for some interesting reading: &lt;a href="http://arxiv.org/abs/cs/0309037"&gt;Postmortem Object Type Identification&lt;/a&gt; by Bryan Cantrill.&lt;br/&gt;&lt;br/&gt; 

The paper presents a method for determining the type of arbitrary memory objects in a core dump.  In other words, given some random address in a core dump, what's the type of the object?&lt;br/&gt;&lt;br/&gt; 

(And I guess that introduces the question, why should you care what the type of a random address in memory is?  One very good reason is that if you're trying to track down a memory corruption problem, knowing the type of the objects near the memory corruption can help narrow down your search for the bad code.)&lt;br/&gt;&lt;br/&gt; 

Determining the type of statically allocated objects is fairly easy -- you have the symbol table and type information.  If the random address happens to match something in the symbol table, you have all the information you need.  (I'm oversimplifying a bit, but it's an easy problem.)&lt;br/&gt;&lt;br/&gt; 

The harder problem is determining the type of a dynamically-allocated object.  You don't have a symbol table handy to tell you what all the locations in memory are, because you didn't have this information handy at compile time.  You could store information at run-time about the types of objects that you're allocating, but that becomes a hairy problem. The likeliest solution would involve modifying your memory allocation library to store this information, but you would need to pass type information to the memory allocation routine, which might not be feasible.  (Although it should be noted that the kernel slab allocator in Solaris provides some of this information, as objects allocated from certain object caches are of known type.)&lt;br/&gt;&lt;br/&gt; 

This paper presents a method for inferring the types of dynamically-allocated objects.  At the core of this method is a fairly standard iterative graph-traversal algorithm for propagating information from nodes (i.e., memory objects) of known type to nodes of unknown type.  Given that almost all dynamically-allocated objects are rooted in statically-allocated objects, the algorithm can provide very good coverage of dynamically-allocated objects.  (And the implementation in MDB makes use of the object-cache type knowledge mentioned above as an optimization during initialization.)&lt;br/&gt;&lt;br/&gt; 

The C language allows for uses that reduce the effectiveness of the algorithm, but the paper presents some heuristics to handle those.  The paper also presents some interesting applications of this method that aren't directly related to debugging memory corruption.&lt;br/&gt;&lt;br/&gt; 

Definitely worth reading.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116474114331188484?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116474114331188484/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116474114331188484' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116474114331188484'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116474114331188484'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/11/recommended-reading-postmortem-object.html' title='Recommended reading: Postmortem Object Type Identification'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-116465742862807855</id><published>2006-11-27T16:19:00.000-08:00</published><updated>2007-01-20T13:44:06.170-08:00</updated><title type='text'>Unlinked files and MDB</title><content type='html'>Occasionally the unlinked file problem pops up.  For the sysadmin, this generally starts with a log file filling up a file system.  Someone removes the original file, generally because they've moved the file across a filesystem boundary (or gzipped it to a target file on another file system.)  But now that the offending file is gone, the file system is still full because some process has the file open, and as long as there's a link to that file, the file system can't reclaim the space.&lt;br/&gt;&lt;br/&gt;

There are various ways to find that process.  The first time I saw this, some 12 years ago or so on a SunOS 4 server, I used some tool (probably lsof) to generate a list of the inode numbers of all the open files on the system matching the particular device number.  I then used some other tool (probably find) to generate a list of all the inodes on that particular device.  I compared the sorted lists to find the inodes that didn't exist in the filesystem and then went back to the lsof output to find the process holding open the unlinked file.&lt;br/&gt;&lt;br/&gt;

In a later job, I found that someone had essentially scripted the above process, (which was logical, 'cause it satisfied Cardinal Rule number 1:  Automate complex, repetitive processes.)  But it was still fairly slow.  And then one day I noticed something about the entries in /proc/*/fd/ that made this much faster.&lt;br/&gt;&lt;br/&gt;

Under Solaris, an unlinked file shows up with a link count of 0:&lt;br/&gt;
&lt;pre&gt;c---------   1 juser tty       24,  1 Nov 27 15:16 0
c---------   1 juser tty       24,  1 Nov 27 15:16 1
c---------   1 juser tty       24,  1 Nov 27 15:16 2
-r--r--r--   0 juser staff    1048576 Nov 27 10:56 3&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Under Linux (or at least the version of RedHat running on a box I have access to), an unlinked file is tagged as deleted:&lt;br/&gt;
&lt;pre&gt;lrwx------ 1 juser staff  64 Nov 27 20:19 0 -&gt; /dev/pts/0
lrwx------ 1 juser staff  64 Nov 27 20:19 1 -&gt; /dev/pts/0
lrwx------ 1 juser staff  64 Nov 27 20:19 2 -&gt; /dev/pts/0
lr-x------ 1 juser staff  64 Nov 27 20:19 3 -&gt; /var/tmp/bigfile (deleted)&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Something interesting to note here is that Linux gives you the filename, which makes sense given that the fd is presented as a file descriptor.  Solaris doesn't give you the filename, but a quick MDB invocation can get that for you:&lt;br/&gt;
&lt;pre&gt;#mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 ufs 
ip sctp usba random fcp fctl lofs md cpc fcip crypto logindmux ptm nfs ]
&gt; 0t4291::pid2proc | ::fd 3 | ::print file_t
{
    f_tlock = {
        _opaque = [ 0 ]
    }
    f_flag = 0x2001
    f_pad = 0xbadd
    f_vnode = 0xffffffff860fcb40
    f_offset = 0
    f_cred = 0xffffffff84cefc50
    f_audit_data = 0xffffffff8b709638
    f_count = 0x1
}
&gt; 0xffffffff860fcb40::vnode2path
/var/tmp/bigfile
&gt;&lt;/pre&gt;&lt;br/&gt;

And of course, you could just do this as a single command line:&lt;br/&gt;
&lt;pre&gt;#echo "0t4291::pid2proc | ::fd 3 | ::print file_t" | 
/usr/bin/mdb -k | grep f_vnode | awk '{printf("%s::vnode2path\n",$NF);}' | 
/usr/bin/mdb -k
/var/tmp/bigfile
#&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

There are two steps to this: "0t4291::pid2proc | ::fd 3 | ::print file_t" first turns the pid 4291 into a pointer to a proc structure.  Given that, we ask for the pointer to the file_t structure associated with file descriptor 3 of that process, and then we print it based on its type.  The second step is to take the pointer to the vnode and print its path (::vnode2path).&lt;br/&gt;&lt;br/&gt;

And doing a search shows a feature of lsof that I hadn't encountered yet:  'lsof +L1' will show you all of the open files on the system with a link count less than one.  &lt;a href="http://0xfe.blogspot.com/2006/03/troubleshooting-unix-systems-with-lsof.html"&gt;This&lt;/a&gt; blog is one among a few that mention using lsof this way.  Lsof still won't give you the name of the unlinked files, though, so the mdb trick still comes in handy.&lt;/br&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-116465742862807855?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/116465742862807855/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=116465742862807855' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116465742862807855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/116465742862807855'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/11/unlinked-files-and-mdb.html' title='Unlinked files and MDB'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115878592753541673</id><published>2006-09-20T20:49:00.000-07:00</published><updated>2006-09-21T08:18:14.916-07:00</updated><title type='text'>Nifty MDB tidbits</title><content type='html'>This stuff is actually kind of fun to play with.  Yesterday, one of my colleagues was trying to figure out the option to ps to get it to display the actual time of day something was started, when it was started long enough ago that it only displays the month and day.  I did some random poking with MDB (output slightly modified for formatting purposes.  The 64-bit address is the proc structure.):&lt;br/&gt;&lt;pre&gt;server# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace 
cpu.AuthenticAMD.15 ufs ip sctp usba fcp fctl qlc 
lofs md cpc fcip random crypto zfs logindmux ptm nfs ]
&gt; ::ps ! grep sshd
R   1298      1   ffffffff8166b150 sshd
R   6701   1298   fffffe82ca7d98f8 sshd
R   6727   6701   fffffe81b3205230 sshd
R  25827   1298   fffffe80eb61a6c8 sshd
R  25857  25827   fffffe82c7be5a20 sshd
R   8479   1298   fffffe82c7bdb988 sshd
R   8485   8479   fffffe80ec1296f8 sshd
&gt; ffffffff8166b150::print -t proc_t ! grep time
 [ ... ]
            time_t tv_sec = 2006 Sep  7 15:00:44
 [ ... ]
&gt;&lt;/pre&gt;
(This was obviously just random poking and not necessarily the most efficient way to get this information.)&lt;br/&gt;
And here was a nifty problem that I ran across.  We'd lost contact with a Solaris 8 server -- ssh gave a connection refused, and there was no response on the console.  Someone dropped it to the ok prompt and ran 'sync' to get a core dump.  I was looking at the core dump and noticed this:&lt;br/&gt;&lt;pre&gt;server# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix ip usba lofs 
random nfs ptm ]
&gt; ::ps -f ! grep ssh
Z    428      1    0000030004938048 /usr/local/sbin/sshd
&gt;&lt;/pre&gt;
Hmm, sshd is a zombie, so it makes sense that there's no response on port 22.  What about the console?&lt;br/&gt;&lt;pre&gt;&gt; ::ps -f ! grep ttymon
R   1152      1   0000030003f54040 /usr/lib/saf/ttymon
Z   1144      1   0000030004cf1540 /usr/lib/saf/ttymon 
         -g -h -p server console login:  -T sun -d /de
&gt;&lt;/pre&gt;
Hmm, that's odd.  If the console ttymon died, init should have restarted it.  That's what inittab says.  So I look at init:&lt;br/&gt;&lt;pre&gt;&gt; ::ps -f ! grep init
Z      1      0      0000030001b81528 /etc/init -
&gt;&lt;/pre&gt;&lt;br/&gt;
Ayup, that would be a problem.  If init's a zombie itself, it's not likely to be doing its job of restarting ttymon.&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115878592753541673?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115878592753541673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115878592753541673' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115878592753541673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115878592753541673'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/09/nifty-mdb-tidbits.html' title='Nifty MDB tidbits'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115687734440566154</id><published>2006-08-29T18:45:00.000-07:00</published><updated>2006-08-29T12:28:11.746-07:00</updated><title type='text'>Grabbing the umask of a running process via MDB</title><content type='html'>Wow, this was actually easier than I thought.  I set out to see if I could figure out the umask of a process via MDB.  I tried to figure out other ways to get this information, but I didn't come across anything.  The umask isn't an environment variable, so something like "pargs -e" won't tell you anything.&lt;br/&gt;&lt;br/&gt;

So the umask of a process is just a field in the data structure that defines a process, proc_t.  Specifically, it's the u_cmask field of that structure, so you can just do something like this (output slightly modified for formatting):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;server# mdb -k
&gt;  ::pgrep sshd
S    PID PPID PGID  SID UID             ADDR NAME
R   6311    1 6311 6311   0 ffffffff90064d48 sshd
&gt; ffffffff90064d48::print -t proc_t ! grep u_cmask
        mode_t u_cmask = 0x12
&gt;
&lt;/pre&gt;

And, of course, there are slightly different ways of doing the same thing.  For example:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;&gt; 0t6311::pid2proc | ::print -t proc_t p_user.u_cmask
mode_t p_user.u_cmask = 0x12
&gt;
&lt;/pre&gt;

So here we have a umask of 022 (note that it's printed in hex above, not octal.)&lt;br/&gt;&lt;br/&gt;

(I seem to remember having Googled this sometime late last week and coming up with nothing.  My search history doesn't bear witness to this, though, and a quick Google of "umask running process solaris" points to &lt;a href="http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/405107bdbc661607/fc52dcda145a90b%23fc52dcda145a90b"&gt;this thread&lt;/a&gt;.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115687734440566154?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115687734440566154/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115687734440566154' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115687734440566154'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115687734440566154'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/grabbing-umask-of-running-process-via.html' title='Grabbing the umask of a running process via MDB'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115634354186926038</id><published>2006-08-23T19:30:00.000-07:00</published><updated>2006-08-23T13:18:30.976-07:00</updated><title type='text'>Diving into a kernel crash dump and banging my head on the bottom</title><content type='html'>I previously gave an &lt;a href=""&gt;example&lt;/a&gt; of diving into a kernel crash dump with mdb.  In that case, I was lucky enough to have in a register the pointer to the data structure I wanted to look at.  I'm looking at another crash dump, and I'm not so lucky this time.  I have to go hunting the pointer I'm interested in.&lt;br/&gt;&lt;br/&gt;

Here's the backtrace:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; $C
fffffe800017fc30 strcmp()
fffffe800017fc90 vfs_setmntopt_nolock+0x147()
fffffe800017fce0 vfs_parsemntopts+0x96()
fffffe800017fe10 domount+0xc87()
fffffe800017fe90 mount+0x105()
fffffe800017fed0 syscall_ap+0x97()
fffffe800017ff20 sys_syscall32+0xef()
00000000080b2b80 0xfe45a0cc()
&gt;
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
The basic problem is that strcmp() is getting passed a NULL pointer.  I won't go into the details of that here, what I'm interested in here is determining what filesystem is being mounted.  domount() is passed a pointer to a vnode, so I'm going to try looking there.&lt;br/&gt;&lt;br/&gt;

If this were a straight x86 box, I'd be happy.  All arguments are passed on the stack, so things are very straightforward.  I'd even have the arguments listed in the backtrace, so there'd be no more work than a cut and paste.  But this is an x64 box, and arguments are passed in registers, so I have to manually track the value I want as it's moved from the register in which it was passed to the location where it was saved.  It may have been saved on the stack, which makes life (relatively) easy, but it may have been saved into a non-volatile register, in which case I need to track it through succeeding stack frames until it gets pushed onto the stack.  (Well, okay, this is just basic recursion, with "getting pushed onto the stack" as the base case.)&lt;br/&gt;&lt;br/&gt;

So here's what &lt;a href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/vfs.c#885"&gt;domount()&lt;/a&gt; looks like:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
int
domount(char *fsname, struct mounta *uap, vnode_t *vp, struct cred *credp,
 struct vfs **vfspp)
{
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Okay, so the vnode pointer I'm interested in is argument 3.  And, as everyone knows (or at least can figure out after looking at a &lt;a href="http://www.genunix.org/gen/crashdump/book.pdf"&gt;good reference&lt;/a&gt;), the third argument is passed in %rdx.  What I need to do is track where domount() stores this:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; domount::dis
domount:                        pushq  %rbp
domount+1:                      movq   %rsp,%rbp
domount+4:                      pushq  %r15
domount+6:                      movq   %rdx,%r15
[ ... ]
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Okay, so domount() stores %rdx into %r15 (a non-volatile register.)  This means more work, as I have to go look at vfs_parsemntopts() to see where it stores %r15.  But first, let me check that %r15 isn't used anywhere else in domount() before the instruction of interest (domount+0xc87):&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; domount::dis ! grep '%r15$'
domount+4:                      pushq  %r15
domount+6:                      movq   %rdx,%r15
domount+0x37f:                  popq   %r15
domount+0x66e:                  popq   %r15
&gt;
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Okay, so domount() is overwriting the register a couple of places, and they're before the instruction of interest.  So I check them out:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; domount+0x37f::dis
domount+0x35b:                  movl   %eax,%r14d
domount+0x35e:                  je     -0x29c   &lt;domount+0xc2&gt;
domount+0x364:                  movl   $0x16,%eax
domount+0x369:                  cmpl   $0x4e,%r14d
domount+0x36d:                  cmovl.ne %r14d,%eax
domount+0x371:                  addq   $0xf8,%rsp
domount+0x378:                  popq   %rbx
domount+0x379:                  popq   %r12
domount+0x37b:                  popq   %r13
domount+0x37d:                  popq   %r14
domount+0x37f:                  popq   %r15
domount+0x381:                  leave
domount+0x382:                  ret
[ ... ]
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

So this looks like it's just an early exit from the function, and the second instance is similar, so I probably don't have to worry about these two cases.  So that leaves me looking at vfs_parsemntopts():&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; vfs_parsemntopts::dis
vfs_parsemntopts:               pushq  %rbp
vfs_parsemntopts+1:             movq   %rsp,%rbp
vfs_parsemntopts+4:             pushq  %r15
vfs_parsemntopts+6:             movl   $0x1,%r15d
vfs_parsemntopts+0xc:           pushq  %r14
vfs_parsemntopts+0xe:           pushq  %r13
vfs_parsemntopts+0x10:          pushq  %r12
vfs_parsemntopts+0x12:          pushq  %rbx
vfs_parsemntopts+0x13:          movq   %rsi,%rbx
vfs_parsemntopts+0x16:          subq   $0x18,%rsp
[ ... ]
&lt;/pre&gt;

Woohoo!  vfs_parsemntopts() pushes %r15 onto the stack, so I'm done looking for the vnode pointer.  So, pull it off the stack, dereference it as a vnode_t, and I get the name of the filesystem that was being mounted (or at least a cached guess):&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; fffffe800017fce0-8/J
0xfffffe800017fcd8:             ffffffff827afdc0
&gt; ffffffff827afdc0::print -t vnode_t
[ ... ]
char *v_path = 0xffffffff97b11e70 "/netapp/some/filesystem"
[ ... ]
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115634354186926038?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115634354186926038/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115634354186926038' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115634354186926038'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115634354186926038'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/diving-into-kernel-crash-dump-and.html' title='Diving into a kernel crash dump and banging my head on the bottom'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115593262526474469</id><published>2006-08-20T10:16:00.000-07:00</published><updated>2006-08-20T07:23:49.746-07:00</updated><title type='text'>Fun with DTrace -- tracing I/O through a pipe</title><content type='html'>I was tracking down a problem where one possible avenue of exploring what was going on was to use DTrace to track the data being written to and read from a pipe.  I didn't go down that path, as I figured out the problem using other methods.  But it seemed like a good exercise, so I followed through on it.&lt;br/&gt;&lt;br/&gt;

This wasn't just a simple case of grabbing the file descriptors returned by pipe() and tracking read()'s and write()'s made by that pid using those file descriptors.  There was an intervening fork() and the dup()'s necessary to use the two ends of the pipe for stdin and stdout.  I could have tracked the fork() and the dup()'s, but I chose to try it a different way.&lt;br/&gt;&lt;br/&gt;

A file descriptor is nothing more than an index into the file descriptor table associated with a process.  Somewhere between the file descriptor being passed to the read() and write() system calls and the underlying function that actually reads to or writes from that file, the file descriptor must necessarily be mapped to the data structure representing the file.  This happens &lt;a href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/syscall/rw.c"&gt;in this file&lt;/a&gt; at the top of the read() code (and similarly for write()):&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
ssize_t
read(int fdes, void *cbuf, size_t count)
{
[ ... ]
  if ((fp = getf(fdes)) == NULL)
   return (set_errno(EBADF));
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

The return value of the getf() function (a struct file *) is what I'm looking for, the pointer to the process-independent data structure that I want to track.  Once I have the file pointer from the pipe() system call, I can trace all the reads and writes using that pointer.  Given that that value is only determined after entering read() or write(), I'll need to perform some speculation to determine which read()'s and write()'s I'm interested in.&lt;br/&gt;&lt;br/&gt;

So how do I determine the two corresponding values being used in pipe()?  After having allocated two file pointers and two file descriptors, &lt;a href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/syscall/pipe.c"&gt;pipe()&lt;/a&gt; does the following to associate the file pointers to the file descriptors:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
 setf(fd1, fp1);
 setf(fd2, fp2); 
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

All very logical and straightforward.  Based on the above, I wrote the following DTrace script.  Note that it's actually not what I represented above, as this code assumes we know the pid of the process calling pipe().  But it's still reasonably representative, as tracing the reads and writes depends solely on the file pointer.&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
syscall::pipe:entry
/ pid == $target /
{
        printf("Pid %d called pipe()\n", pid);
        self-&gt;trace_pipe = 1;
        self-&gt;first_fp = 0;
}

/*
 * Grab the file pointer from the second call to setf() within pipe().  Note 
 * that we want to do this first to avoid hitting this predicate just after 
 * we've set self-&gt;first_fp.
 */
fbt::setf:entry
/ self-&gt;trace_pipe &amp;&amp; self-&gt;first_fp != 0 /
{
        printf("fd == %d second fp == 0x%x\n",arg0,arg1);
        self-&gt;second_fp = arg1;
}

/*
 * Grab the file pointer from the first call to setf() within pipe().
 */
fbt::setf:entry
/ self-&gt;trace_pipe &amp;&amp; self-&gt;first_fp == 0 /
{
        printf("fd == %d first fp == 0x%x\n",arg0,arg1);
        self-&gt;first_fp = arg1;
}

syscall::pipe:return
/ self-&gt;trace_pipe /
{
        self-&gt;trace_pipe = 0;
}

/* 
 * Note the speculation.  On entry to these system calls, we only have a file 
 * descriptor.  We commit the speculation when we know that the fd maps to the
 * file pointer of interest.
 */
syscall::write:entry,
syscall::read:entry
{
        self-&gt;spec = speculation();
        speculate(self-&gt;spec);

        printf("%s %d bytes %s fd %d\n",probefunc,arg2,
                (probefunc == "write" ? "to" : "from" ),arg0);
}

fbt::getf:return
/ self-&gt;spec &amp;&amp; ( arg1 == self-&gt;first_fp || arg1 == self-&gt;second_fp ) /
{
        commit(self-&gt;spec);
        self-&gt;spec = 0;
}

fbt::getf:return
/ self-&gt;spec &amp;&amp; arg1 != self-&gt;first_fp &amp;&amp; arg1 != self-&gt;second_fp /
{
        discard(self-&gt;spec);
        self-&gt;spec = 0;
}
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115593262526474469?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115593262526474469/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115593262526474469' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115593262526474469'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115593262526474469'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/fun-with-dtrace-tracing-io-through.html' title='Fun with DTrace -- tracing I/O through a pipe'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115564355280007863</id><published>2006-08-15T20:03:00.000-07:00</published><updated>2006-08-16T04:55:14.046-07:00</updated><title type='text'>VMWare (or Parallels) as an educational tool</title><content type='html'>I recently installed Solaris x86 under Parallels on a MacBook.  As I was rebooting the Solaris VM, I noticed the Parallels BIOS message briefly flash up before starting to boot the OS.  This got me thinking about the operating systems course at Princeton.  As it existed during my brief stay there, the course involved writing an OS from the ground up on Intel hardware, with no simulator to hide the details away.  I wish I'd taken the course while I was there, but I'd taken an OS course while getting my Master's, and I didn't see the need in doing it over again.  Unfortunately, the OS course I took involved writing an OS on top of a simulator, so I didn't get the experience of doing the low-level stuff for myself.&lt;br/&gt;&lt;br/&gt;

Reminiscences aside, it occurred to me that Parallels (or VMWare) would be a great tool for an operating systems course.  You could do all the low-level programming without needing to constantly reboot the machine you're working on.  The fix-compile-test cycle would be much shorter since your working environment would be persistent.&lt;br/&gt;&lt;br/&gt;

I'm not the first person to think of it, though.  This &lt;a href="http://www.cs.columbia.edu/~nieh/teaching/w4118/"&gt;operating systems course&lt;/a&gt; at Columbia used VMWare, although the programming assignments aren't exactly what I had in mind.  My ideal OS course involves writing at least the basics of an OS from the ground up.  The course I link to involves various modifications to the Linux kernel.  I won't try to argue that doing so isn't useful, it's just not what I would want out of an OS course.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115564355280007863?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115564355280007863/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115564355280007863' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115564355280007863'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115564355280007863'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/vmware-or-parallels-as-educational.html' title='VMWare (or Parallels) as an educational tool'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115556612078656175</id><published>2006-08-14T19:31:00.000-07:00</published><updated>2006-08-14T09:33:47.623-07:00</updated><title type='text'>Fun with MDB</title><content type='html'>I don't have much experience with mdb, so I thought I'd dig into a kernel crash dump to see what I could figure out.  This is a server connected to what might be faulty storage, so what I'm trying to determine is whether the storage might have had anything to do with the kernel panic.&lt;br/&gt;&lt;br/&gt;
First off, here's the stack backtrace:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
server:/var/crash/server&gt; sudo mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace 
ufs md ip sctp usba random fcp fctl lofs nfs ptm 
logindmux ipc crypto fcip ]
&gt; $C
fffffe800076d9d0 alloccg+0x48()
fffffe800076da20 hashalloc+0xb4()
fffffe800076da90 alloc+0x10b()
fffffe800076dc40 bmap_write+0xa74()
fffffe800076dd40 wrip+0x759()
fffffe800076dde0 ufs_write+0x211()
fffffe800076ddf0 fop_write+0xb()
fffffe800076de00 lo_write+0x11()
fffffe800076de10 fop_write+0xb()
fffffe800076dec0 write+0x287()
fffffe800076ded0 write32+0xe()
fffffe800076df20 sys_syscall32+0xef()
&gt;
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
It died in alloccg() (allocate cylinder group), one of the lower-level UFS functions, so the backtrace hasn't ruled out the storage.  So, given the above, how do I determine what file was being written to?  The top of the backtrace looks like a good place to start, so I check out the definition of alloccg().  The first argument passed in is the inode:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
static daddr_t
alloccg(struct inode *ip, int cg, daddr_t bpref, int size)
{
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
The inode itself doesn't contain the filename, but from the definition of an inode, we have this:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
typedef struct inode {
[ ... ]
        struct  vnode *i_vnode; /* vnode associated with this inode */
[ ... ]
}
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
and we can get the pathname of the file from the vnode:&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
typedef struct vnode {
[ ... ]
        char            *v_path;        /* cached path */
[ ... ]
}
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
Mdb is aware of data structures, and can interpret the raw memory addresses WRT those data structures.  So, if we have the pointer to that inode, we could figure out the path.  In the 32-bit x86 world, arguments are passed on the stack, so the arguments would have shown up in the backtrace.  In the x64 world, arguments are passed in registers, which are overwritten with succeeding function calls.  There are ways to trace arguments down (functions that need to reference their arguments after making a nested function call need to save the register values somewhere, either on the stack or in non-volatile registers), but in this case, the appropriate register (%rdi for the first argument) contains what we need.&lt;br/&gt;&lt;br/&gt;
&lt;pre&gt;
&gt; ::regs
%rax = 0xfffffffffffff000                 %r9  = 0xffffffffaf62ab18
%rbx = 0x0000000000002000                 %r10 = 0xffffffff9e403559
%rcx = 0x0000000000002000                 %r11 = 0xfffffffffbcc2de0 apic_cr8pri
%rdx = 0xffffffff83dd2000                 %r12 = 0xffffffff83c79000
%rsi = 0x00000000ffffff00                 %r13 = 0xffffffffbf4bf500
%rdi = 0xffffffffbf4bf500                 %r14 = 0xfffffffffbad45e0 alloccg
%r8  = 0x0000000000000000                 %r15 = 0xffffffff832eb1c0

%rip = 0xfffffffffbad4628 alloccg+0x48
%rbp = 0xfffffe800076d9d0
%rsp = 0xfffffe800076d960
%rflags = 0x00010297
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=&lt;of,df,IF,tf,SF,zf,AF,PF,CF&gt;

                        %cs = 0x0028    %ds = 0x0043    %es = 0x0043
%trapno = 0xe           %fs = 0x0000    fsbase = 0x00000000fbc22ae0
   %err = 0x0           %gs = 0x01c3    gsbase = 0x0000000000000000
&gt;
&gt; 0xffffffffbf4bf500::print -t "struct inode"
{
[ ... ]
    struct vnode *i_vnode = 0xffffffffbf4bedc0
[ ... ]
&gt;
&gt; 0xffffffffbf4bedc0::print -t "struct vnode"
{
[ ... ]
    char *v_path = 0xffffffff863952f8 "/fs/data/somefile"
[ ... ]
&gt;
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;
And there's the file that was being written when the panic occurred in alloccg().&lt;br/&gt;&lt;br/&gt;
For reference, I used the new &lt;a href="http://www.amazon.com/gp/product/0131568191/sr=8-1/qid=1155566806/ref=pd_bbs_1/104-3615974-8517550?ie=UTF8"&gt;Solaris Performance and Tools&lt;/a&gt; book.  I found some very detailed information on argument passing in the x64 world (specifically with respect to kernel crash dump analysis) &lt;a href="http://www.genunix.org/gen/crashdump/book.pdf"&gt;here&lt;/a&gt;, an as-yet-unpublished book by Frank Hoffman at Sun.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115556612078656175?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115556612078656175/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115556612078656175' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115556612078656175'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115556612078656175'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/fun-with-mdb.html' title='Fun with MDB'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115402734150276356</id><published>2006-08-01T20:08:00.000-07:00</published><updated>2006-08-01T13:56:57.806-07:00</updated><title type='text'>Playing with process contracts in Solaris 10</title><content type='html'>Even though I know the general details about how process contracts work in Solaris 10, and had seen the list of functions that are used in manipulating process contracts, I hadn't actually tried playing with the API before last week.  I started playing with it last week because I wanted to re-implement Sun's process contract handling in sshd so that I could submit it as a patch to OpenSSH.&lt;br/&gt;&lt;br/&gt;

I first went looking for documentation on the API.  There are obviously the man pages for contract(4), libcontract(3CONTRACT), and all of the library calls, but I couldn't find much more than that aside from some threads on the smf-discuss@opensolaris.org forum.  The man pages contain all you need, it's just a little bit harder to dig out the information than it would be in different documentation.&lt;br/&gt;&lt;br/&gt;

I posed some questions about process contracts and then tried to answer them programmatically.  The first question I asked was, how do I determine what process contract I belong to?  In retrospect, this may not even be a useful question, but it was the first one to come to mind.  I have yet to figure this one out, though.  I spent an hour or two trying to figure it out, but I eventually moved on, as it was tangential to what I really wanted to accomplish.&lt;br/&gt;&lt;br/&gt;

The next obvious question was, how do I create a new process contract?  This is fairly simple:  you open a new process contract template, set the terms of the contract, and activate the template.  Once you've done this, a fork() will create a child in a new process contract.  Here's an example:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
int
pre_fork_activate_template()
{
        int             tmpl_fd;

        if ((tmpl_fd = open64("/system/contract/process/template", O_RDWR)) == -1) {
                perror("Can't open /system/contract/process/template");
                return -1;
        }
        if (ct_pr_tmpl_set_fatal(tmpl_fd, CT_PR_EV_HWERR|CT_PR_EV_SIGNAL) != 0){
                perror("Can't set process contract fatal events");
                return -1;
        }
        if (ct_tmpl_set_critical(tmpl_fd, CT_PR_EV_HWERR) != 0) {
                perror("Can't set process contract critical events");
                return -1;
        }
        if (ct_tmpl_activate(tmpl_fd) != 0) {
                perror("Can't activate process contract template");
                return -1;
        }

        return tmpl_fd;
}
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

(Of course, there's a potentially bad failure mode in the above code if it's a long-running daemon, as you could start leaking file descriptors if the open64() succeeds but any of the following function calls fail.)&lt;br/&gt;&lt;br/&gt;

And after this, a fork() will create a child in a new process contract.  The parent will be the holder of this contract.  This may not be what you want to do, as contracts aren't destroyed automatically when a child exits.  This means that the parent could still be holding the contract long after the child is dead.  For a short-lived process, this isn't a problem, but a long-running daemon with this behavior would be accumulating dead process contracts that count against certain limits (as discussed in &lt;a href="http://www.opensolaris.org/jive/thread.jspa?messageID=46857&amp;"&gt;this thread&lt;/a&gt;.)  One way of handling this is to keep track of all the process contracts and reap them when they're no longer useful, or you could simply abandon the contract immediately.  I've done the latter in this bit of code:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
void
post_fork_contract_processing(int tmpl_fd,int pid)
{
        char            ctl_path[PATH_MAX];
        ctid_t          ctid;
        ct_stathdl_t    stathdl;
        int             ctl_fd;
        int             pathlen;
        int             stat_fd;

        /*
         * First clear the active template.
         */
        if (ct_tmpl_clear(tmpl_fd) != 0) {
                perror("Parent can't clear active template");
                return;
        }
        close(tmpl_fd);

        /*
         * If the fork didn't succeed (pid &lt; 0), or if we're the child
         * (pid == 0), we have nothing more to do.
         */
        if (pid &lt;= 0) {
                return;
        }

        /*
         * Now abandon the contract we've created.  This involves the
         * following steps:
         * - Get the contract id (ct_status_read(), ct_status_get_id())
         * - Get an fd for the ctl file for this contract
         *   (/system/contract/process/&lt;ctid&gt;/ctl)
         * - Abandon the contract (ct_ctl_abandon(fd))
         */
        if ((stat_fd = open64(CTFS_ROOT "/process/latest", O_RDONLY)) == -1) {
                perror("Parent can't open latest");
                return;
        }
        if (ct_status_read(stat_fd, CTD_COMMON, &amp;stathdl) != 0) {
                perror("Parent can't read contract status");
                return;
        }
        if ((ctid = ct_status_get_id(stathdl)) &lt; 0) {
                perror("ct_status_get_id() failed");
                ct_status_free(stathdl);
                return;
        }
        ct_status_free(stathdl);
        close(stat_fd);

        pathlen = snprintf(ctl_path, PATH_MAX, CTFS_ROOT "/process/%ld/ctl",ctid);
        if (pathlen &gt; PATH_MAX) {
                fprintf(stderr,"Contract ctl file path exceeds maximum path length\n");
                return;
        }
        if ((ctl_fd = open64(ctl_path, O_WRONLY)) &lt; 0) {
                perror("Parent couldn't open control file for child contract");
                return;
        }
        if (ct_ctl_abandon(ctl_fd) &lt; 0) {
                perror("Parent couldn't abandon contract");
        }
        close(ctl_fd);
}
&lt;/pre&gt;&lt;br/&gt;&lt;br/&gt;

Note that getting the control file for the process contract for the child process involves getting the id of that contract so that we can construct the path for it under /system/contract/process.&lt;br/&gt;&lt;br/&gt;

And then (for completeness), I used this code to test things:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
main()
{
        int             tmpl_fd;
        pid_t           pid;

        if ((tmpl_fd = pre_fork_activate_template()) &lt; 0) {
                exit(1);
        }
        /*
         * Now that we've set the active template, fork a process to see
         * a new contract created.
         */
        if ((pid = fork()) &lt; 0) {
                perror("Can't fork");
        }
        post_fork_contract_processing(tmpl_fd,pid);

        sleep(60);
}
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115402734150276356?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115402734150276356/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115402734150276356' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115402734150276356'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115402734150276356'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/08/playing-with-process-contracts-in.html' title='Playing with process contracts in Solaris 10'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115289931101335142</id><published>2006-07-14T08:00:00.000-07:00</published><updated>2006-08-13T02:59:04.870-07:00</updated><title type='text'>OpenSSH SMF/contract problem</title><content type='html'>I'm investigating a weird SMF problem with our site ssh service.  (We run OpenSSH rather than the Sun-supplied ssh so that we can have a consistent version across different OS's.)  The symptom is that the ssh process doesn't get restarted after being killed, even though svcs still shows the process as being online:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
server:/var/svc/log&gt; svcs -a | grep ssh
online         Apr_27   svc:/site/ssh:default
server:/var/svc/log&gt; ps -ef | grep sshd
  jruser 28788 28786   0 17:41:11 ?  0:01 
         /usr/local/sbin/sshd -R
    root 13016     1   0 18:08:13 ?  0:00 
         /usr/local/sbin/sshd -R
    root 28786     1   0 17:41:10 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 13018 13016   0 18:08:14 ?  0:01 
         /usr/local/sbin/sshd -R
    root 24437     1   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 24439 24437   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
server:/var/svc/log&gt;

server2:/var/tmp&gt; ssh server
ssh: connect to host server port 22: Connection refused
server2:/var/tmp&gt;
&lt;/pre&gt;

Hmm, knowing something about contracts, I check to see if the sshd processes that are hanging around are still part of some contract:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
server:/var/svc/log&gt; ctstat -a -v | grep 24437
        member processes:   1952 9779 9780 9781 10226 
10227 10228 10229 10230 10231 10232 10233 10234 10236 
10238 10241 10242 13016 13018 13020 15881 15882 15883 
15884 17016 17524 17821 17866 17873 17874 22188 22189 
24437 24439 24441 26576 28786 28788 28790 28844 28845
server:/var/svc/log&gt; 
&lt;/pre&gt;

That's odd, there are a lot more processes than just the sshd processes I've listed.  What are all of these?  It turns out that a lot of these are httpd processes:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
server:/var/svc/log&gt; ptree 10226
10226 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10227 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10229 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10230 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10231 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10232 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10233 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10241 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10242 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9779  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9780  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9781  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
server:/var/svc/log&gt;
&lt;/pre&gt;


So at a guess, someone ssh'd in to the server and started httpd, but for some reason, it didn't get put into its own process contract.  And some more poking (and some Googling) verifies this for me.  Apparently, Sun's sshd does the right thing WRT process contracts, and OpenSSH doesn't.  (Or at least doesn't yet as of 4.3p2.)&lt;br/&gt;&lt;br/&gt;

Sun's sshd will attempt to put children into their own process contracts.  This way, if you do a 'svcadm disable ssh', existing connections aren't killed.  (I've mentioned process contracts &lt;a href="http://cmynhier.blogspot.com/2006/05/process-contracts-in-solaris-10.html"&gt;before&lt;/a&gt;, but you could just as easily read the man pages -- 'man contracts' is a good place to start.)&lt;br/&gt;&lt;br/&gt;

OpenSSH (or at least 4.3p2) doesn't appear to do anything at all with process contracts.  What you end up with is every sshd process in the same contract.  What's more, you end up with the entire process tree in the same contract.  This includes the shells and any other processes started from those shells, etc.  (Well, with the exception of anything like ctrun(1) that creates new process contracts.)&lt;br/&gt;&lt;br/&gt;

Why is this a problem?  Well, for one, you get what you see above:  SMF doesn't restart a service that you think it should, nor does it even report that the service is down.  But there's the flipside:  in the above case, had I done a 'svcadm disable ssh', not only would I have killed the existing connection, but I would also have killed that running httpd instance.  That definitely violates the principle of least surprise.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115289931101335142?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115289931101335142/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115289931101335142' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115289931101335142'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115289931101335142'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/07/openssh-smfcontract-problem.html' title='OpenSSH SMF/contract problem'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-115168268192874250</id><published>2006-07-03T14:00:00.000-07:00</published><updated>2006-08-13T02:57:07.386-07:00</updated><title type='text'>Solaris/Linux reliability</title><content type='html'>(Note:  the title and subject of this post are not a bald-faced attempt to generate traffic to this blog.)&lt;br/&gt;&lt;br/&gt;

I've recently started a new job in an environment that had historically been entirely Solaris but that introduced Linux a few years ago.  The reasons for introducing Linux were perfectly valid -- the developers wanted faster hardware, and at the time, the choice was between bigger Sun hardware (running Solaris) or Intel hardware (running Linux.)  At the time, the cost difference was such that there was no question which direction to take.  Given that Sun had abandoned Solaris x86, there was only one realistic choice for OS, but now that Sun has thrown its full weight behind Solaris x86, we're questioning whether we should go back to being an all-Solaris shop.&lt;br/&gt;&lt;br/&gt;

Other details aside, someone recently expressed the opinion that Solaris is more reliable than Linux.  This opinion was questioned:  was this anything more than just a guess?&lt;br/&gt;&lt;br/&gt;

I've done some Googling, but everything I find that discusses the comparative reliability of the two OS's seems to be opinion.  I can't seem to find any reasonable analysis with hard data.  And in the absence of hard data, I'll simply add another opinion.  I'll try to give some basis for that opinion, but I realize it's only that.&lt;br/&gt;&lt;br/&gt;

I believe that Solaris is the more reliable of the two OS's and is the one more suited to being used in an enterprise environment.  (And by "enterprise environment", I mean an environment in which you care about server uptime, even if it's only during certain periods of the day, like trading hours.  I'm not referring to environments which have been set up to handle server downtime, like web farms behind a load balancer or compute clusters with redundancy designed-in.)&lt;br/&gt;&lt;br/&gt;

What basis do I have for holding this opinion?  I can't claim much experience that's relevant to this issue, as most of my Solaris experience has been on SPARC hardware, and all of my experience with Linux has been on Intel hardware.  The SPARC hardware I've dealt with has been of higher quality than the Intel hardware, so that muddies the issue.  (And I was fortunate enough to be in grad school during the recent USII E-cache unpleasantness.)&lt;br/&gt;&lt;br/&gt;

I claim this opinion based on the following:  Solaris has an integrated fault management architecture.  Others have written about this, so instead of merely repeating what they've said, I'll link directly to what they've written.&lt;br/&gt;&lt;br/&gt;

Gavin Maltby gives a general overview of the fault management architecture &lt;a href="http://blogs.sun.com/roller/page/gavinm?entry=h3_solaris_fault_management_top"&gt;here&lt;/a&gt;.  He also discusses structured error events in &lt;a href="http://blogs.sun.com/roller/page/gavinm?entry=fault_management_top_10_2"&gt;this&lt;/a&gt; blog, including why they're important.  (These were entries in his as-yet-unfinished &lt;a href="http://blogs.sun.com/roller/page/gavinm?entry=solaris_10_top_ten_features"&gt;Top Solaris 10 features for fault management&lt;/a&gt; list.  I hope to see writeups for more of these, but they appear to be pretty time-intensive, given the level of detail he goes into.)  He also goes into quite a bit of detail related to fault management for the Opteron &lt;a href="http://blogs.sun.com/roller/page/gavinm?entry=amd_opteron_athlon64_turion64_fault"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

But just having a structured way to report error events doesn't do much good unless you can act on that information.  What you want to do is to correlate errors and take preemptive measures before they become faults.  You also want to be able to isolate faults when they occur.  Gavin mentions diagnosis engines in his blog, and there's more about them &lt;a href="http://blogs.sun.com/roller/page/andy?entry=predictive_self_healing_eft_overview"&gt;here&lt;/a&gt;.  Liane Praza discusses smf(5)'s role in fault isolation &lt;a href="http://blogs.sun.com/roller/page/lianep/20050316"&gt;here&lt;/a&gt;, and Richard Elling discusses one part of fault isolation, memory page retirement, &lt;a href="http://blogs.sun.com/roller/page/relling?entry=analysis_of_memory_page_retirement"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;

Fault isolation is at the heart of my opinion that Solaris is more reliable than Linux.  Instead of throwing up a single fault boundary around the entire system ("Uncorrectable memory error?  Panic."), Solaris gives us much finer-grained boundaries ("Uncorrectable memory error?  Is it in the kernel?  All we can do is panic.  Is it in user space?  Restart the affected services.")  With respect to memory errors, given that most of a system's memory is in user space, the probability of a panic becomes something much smaller than the 100% you get with a monolithic fault boundary.  (And note that this division between user- and kernel-space is more than just twofold, as smf(5) lets us define the fault boundaries between user processes.  It's certainly feasible that the uncorrectable memory error affects the root of the dependency tree, but if it only affects some leaf in the dependency tree, there's only a single process to be restarted.)&lt;br/&gt;&lt;br/&gt;

With respect to the original question, is Solaris more reliable than Linux, it may be the case that Linux has made great advances in fault management, but my simple searches have yet to uncover that work.  If anyone reading this cares to point me in the right direction, I'd greatly appreciate it.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-115168268192874250?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/115168268192874250/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=115168268192874250' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115168268192874250'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/115168268192874250'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/07/solarislinux-reliability.html' title='Solaris/Linux reliability'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114865969286101677</id><published>2006-06-04T09:07:00.000-07:00</published><updated>2006-08-13T03:01:01.530-07:00</updated><title type='text'>Non-SMF restarter</title><content type='html'>In Solaris 10, the Service Management Facility (SMF) proveides some nice fault-tolerance features.  If a service is killed for whatever reason, SMF will restart it (assuming you've configure SMF to do so for the service in question.)  There may be cases in which you can't move a service under SMF, but you'd like a service to be restarted when it dies.  For this, we have ctrun(1).&lt;br/&gt;&lt;br/&gt;

At its core, ctrun is very simple:  it creates a process contract and runs a specified command in that process contract.  (I wrote a  little about process contracts &lt;a href="http://cmynhier.blogspot.com/2006/05/process-contracts-in-solaris-10.html"&gt;here&lt;/a&gt;, and there are always the man pages (contract(4), process(4), et al.))  It can also act as a restarter for the process if you tell it to.  For example,
&lt;pre&gt;
ctrun -r 0 -t -f hwerr,core,signal /usr/local/sbin/food
&lt;/pre&gt;
will run the foo daemon and restart it if dies from a hardware error, if it dumps core, or if it receives a fatal signal.  The '-r 0' tells ctrun to attempt to restart it an infinite number of times, and the '-t' tells ctrun to transfer any inherited subcontracts to the new process contract when it restarts food.&lt;br/&gt;&lt;br/&gt;

So why do we need ctrun if we have SMF?  Well, for one, you may not be the administrator of the system you want to run a restarting daemon on.  Or it might not be a daemon, it might simply be a long-running calculation that you want to start on Friday afternoon before you leave for the weekend (and that you've written in such a way that it frequently saves state so that it pick up close to where it was when it terminated.)  Or you might be the administrator, and you might be working with production daemons, but you might have an in-house written rc system that you're not willing to scrap to move everything under SMF (especially given that you're a mixed Solaris and Linux shop and don't have SMF under Linux.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114865969286101677?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114865969286101677/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114865969286101677' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114865969286101677'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114865969286101677'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/06/non-smf-restarter.html' title='Non-SMF restarter'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114865853766491454</id><published>2006-05-26T08:06:00.000-07:00</published><updated>2006-08-13T03:01:52.346-07:00</updated><title type='text'>Process contracts in Solaris 10</title><content type='html'>The Service Management Facility (SMF) is a new feature in Solaris 10 that's replacing the old System V /etc/rc*.d way of starting and stopping system processes.  Of course, it's not just a new way to start and stop system processes, it's a way to manage &lt;i&gt;services&lt;/i&gt;.  Aside from other features, SMF can restart daemons that are killed for whatever reason.  I was discussing this with a colleague, and he wanted to know how this works.  Obviously, it could simply use the fact that init(1M) is the parent of all daemons and piggyback on the wait(2) loop that reaps children as they die.  But that wouldn't be useful in supporting other features of SMF, such as having delegated restarters.&lt;br/&gt;&lt;br/&gt;

The communication mechanism that allows the daemon restart and other features of SMF to work is &lt;i&gt;process contracts&lt;/i&gt;.  A process contract is essentially another relationship between processes, similar to the parent-child relationship between processes in Unix, but also similar to process groups, as contract IDs aren't necessarily unique.  Every process has an associated contract ID, and every contract has a contract holder.  The contract holder receives information about certain things that happen to processes with that contract ID; namely process creation and normal termination (processes added to or removed from the contract), the last process to exit a contract (the contract is now empty), and other fatal events (a fatal signal, a core file was generated, or the process was killed due to an uncorrectable hardware error.)&lt;br/&gt;&lt;br/&gt;

One of the major benefits of process contracts is that it allows one process to define a fault boundary around a set of subprocesses.  What this means in practice is that you have finer-grained control over the fault boundaries on your server.  Prior to Solaris 10, the fault boundary essentially included every process on running on a server.  In the case of an uncorrectable memory error, for example, a server would either panic (the error was in kernel address space) or "gracefully" reboot the server (if the error was in user space.)  (See more &lt;a href="http://blogs.sun.com/roller/page/lianep?entry=smf_5_and_fault_isolation"&gt;here&lt;/a&gt;.)  Given that the fault boundary included all processes, this was the only option.&lt;br/&gt;&lt;br/&gt;

With process contracts, however, there's finer-grained control over the fault boundaries.  The boundary can be defined to include, for example, httpd and any of its child processes.  If an uncorrectable memory error kills the parent httpd process, SMF can be notified and restart that process.  There's no need to reboot the server.  And given that SMF allows us to specify dependencies between services, there's no need to do anything with the sendmail process if there's no dependency on httpd.&lt;br/&gt;&lt;br/&gt;

(This is by no means an exhaustive look at process contracts.  More information can be found in the man pages (contract(4), process(4), et al.) and other documentation.)&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114865853766491454?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114865853766491454/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114865853766491454' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114865853766491454'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114865853766491454'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/05/process-contracts-in-solaris-10.html' title='Process contracts in Solaris 10'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114813870881500036</id><published>2006-05-20T14:15:00.000-07:00</published><updated>2006-08-13T03:03:39.713-07:00</updated><title type='text'>ZFS I/O reordering benchmark</title><content type='html'>I've posted about the write performance of ZFS compared to UFS, ext3fs and reiserfs (different OS's but exactly identical hardware (the same server) on the same disk cylinders.)  That wasn't the extent of the benchmarking I performed, however.  I'll detail another one that I did, although it's not as much a benchmark as a demonstration of the I/O reordering that ZFS does to give reads preference over writes.&lt;br/&gt;&lt;br/&gt;

The basic idea is to read a file while there's a heavy write load to the filesystem.  The write load I was applying to the filesystem was the same one I detailed in &lt;a href="http://cmynhier.blogspot.com/2006/05/zfs-benchmarking.html"&gt;this earlier entry&lt;/a&gt;.  Again, I started out with a load that was serious overkill for the filesystems and then reran the test with a more reasonable workload.&lt;br/&gt;&lt;br/&gt;

I first created a data file to be read, 2GB for the overkill load, 512MB for the reasonable load.  (The sizes were chosen to guarantee that the read of the file always took less than the time to generate the write load.  I could have generated a longer write load so that I could reasonably compare the two different tests, but the tests were meant to be compared across filesystems with a similar write load, not across loads for the same filesystem.)&lt;br/&gt;&lt;br/&gt;

Instead of using cat or dd to read the file, I used md5sum.  Why?  Well, for one, 'cat file &gt; /dev/null' has interesting behavior under Solaris.  The cat utility actually mmaps the file and then relies on demand paging to read the file as needed.  So 'cat file &gt; /dev/null' essentially translates into a no-op.  'cat file | cat &gt; /dev/null' achieves the goal, but it seems silly.  I could have used dd, but I actually used md5sum in order to slightly handicap ZFS.  Given that ZFS involves a lot of computation (checksums and whatnot), I wanted to give the CPU some more work to do so as to intentionally interfere with that.  I also did it in order to create a more realistic test.  In general, you don't read data from disk just to throw it away, you do something with it.&lt;br/&gt;&lt;br/&gt;

To quote from my earlier post for completeness here:  The server I was using was a 2 x 2.8GHz Dell 1850 with a single 73GB SCSI disk and 2GB RAM. I ran the tests using both UFS and ZFS under Solaris x86 and both ext3fs and reiserfs under Linux. To avoid differences in performance between the inside and the outside of the disk, I used the same cylinders on the disk for all tests (plus or minus a cylinder or two.)&lt;br/&gt;&lt;br/&gt;

So here are the results for the overkill write load:&lt;br/&gt;&lt;br/&gt;

&lt;table&gt;&lt;tr&gt;&lt;td&gt;Filesystem&lt;/td&gt;&lt;td&gt;Time (min:sec, unloaded)&lt;/td&gt;&lt;td&gt;Time (min:sec,loaded)&lt;/td&gt;&lt;td&gt;Ratio loaded:unloaded&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;UFS&lt;/td&gt;&lt;td&gt;0:50.2&lt;/td&gt;&lt;td&gt;5:50&lt;/td&gt;&lt;td&gt;8.2:1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ZFS&lt;/td&gt;&lt;td&gt;0:31.8&lt;/td&gt;&lt;td&gt;0:36.0&lt;/td&gt;&lt;td&gt;1.13:1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ext3fs&lt;/td&gt;&lt;td&gt;0:36.3&lt;/td&gt;&lt;td&gt;54:21&lt;/td&gt;&lt;td&gt;89.9:1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;reiserfs&lt;/td&gt;&lt;td&gt;0:33.4&lt;/td&gt;&lt;td&gt;69:45&lt;/td&gt;&lt;td&gt;124.6:1&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br/&gt;&lt;br/&gt;

Wow.  So while ZFS performed the read under load in close to the same time it took with no load, ext3fs took 90 times as long, and reiserfs took 125 times as long.  Again, all I can say is, "Wow."  But I also have to emphasize that this write load was too heavy for the filesystems.  (Although one could argue that the write load &lt;i&gt;wasn't&lt;/i&gt; too heavy, given that ZFS could handle it gracefully.  But it's certainly not a real-world workload, so while the data is interesting, it would be hard to argue that it's useful.)&lt;br/&gt;&lt;br/&gt;

And the results for the reasonable write load.  (Note that I didn't run the test for UFS, mostly due to time constraints when I was doing this.  Remember also that this was a 512MB file, not 2GB.):&lt;br/&gt;&lt;br/&gt;

&lt;table&gt;&lt;tr&gt;&lt;td&gt;Filesystem&lt;/td&gt;&lt;td&gt;Time (min:sec, unloaded)&lt;/td&gt;&lt;td&gt;Time (min:sec,loaded)&lt;/td&gt;&lt;td&gt;Ratio loaded:unloaded&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ZFS&lt;/td&gt;&lt;td&gt;0:09.0&lt;/td&gt;&lt;td&gt;0:10.3&lt;/td&gt;&lt;td&gt;1.14:1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ext3fs&lt;/td&gt;&lt;td&gt;0:08.8&lt;/td&gt;&lt;td&gt;5:27&lt;/td&gt;&lt;td&gt;37.2:1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;reiserfs&lt;/td&gt;&lt;td&gt;0:08.7&lt;/td&gt;&lt;td&gt;3:50&lt;/td&gt;&lt;td&gt;26.4:1&lt;/tr&gt;&lt;/table&gt;&lt;br/&gt;&lt;br/&gt;

Okay, so these numbers aren't quite as ludicrous as the first.  They're still impressive, though.  The ZFS engineers appear to have done a very good job.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114813870881500036?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114813870881500036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114813870881500036' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114813870881500036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114813870881500036'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/05/zfs-io-reordering-benchmark.html' title='ZFS I/O reordering benchmark'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114805171602613667</id><published>2006-05-19T07:43:00.000-07:00</published><updated>2006-08-13T03:04:57.733-07:00</updated><title type='text'>ZFS benchmarking</title><content type='html'>ZFS was made publically available on November 16, 2005.  I was doing my usual scan of Sun blogs and saw some dozens of different entries announcing this.  I'd been waiting a year or so to get my hands on ZFS, so I feverishly set about downloading the appropriate bits so that I could install a server and start playing with it.  (I'm being very literal when I say "feverishly" -- I wasn't feeling that well that day and measured myself at 102 degrees or thereabouts (39 degrees for anyone who wonders why I was above the boiling point of water) when I got home that evening.)&lt;br/&gt;&lt;br/&gt;

Over the next few weeks, I ran some benchmarks comparing ZFS to UFS, ext3fs and reiserfs.  I avoided the standard benchmarks and used a script I'd developed earlier during the year to compare some cheap NAS implementations.  This script was originally intended simply to generate a large amount in a filesystem hierarchy that mirrored what we would be doing with that cheap NAS, where there was a companion script to be run across a couple dozen servers to generate read traffic.  The company I work for would probably balk if I put that script here, but it essentially created a filesystem hierarchy that looked like [00-FF]/[00-FF]/[0-F]/[1-64], where the 64 files at the leaves are ~10k.&lt;br/&gt;&lt;br/&gt;

I started out by trying to determine the parameters I wanted to use when running the benchmark against different filesystems, as I wanted to get 100% disk utilization.  Unfortunately, I used ZFS to determine these parameters.  This turned out to be serious overkill for the other filesystems, as the numbers below indicate.  Here are the parameters I used, with an explanation for the values:&lt;pre&gt;write-data -c 5 -u 1200 -m 64&lt;/pre&gt;I'm running 5 concurrent processes, each creating 1200 leaf directories with 64 files each.  So 5 * 1200 * 64 * 10k is about 3.7GB of data in all (plus metadata.)&lt;br/&gt;&lt;br/&gt;

The server I was using was a 2 x 2.8GHz Dell 1850 with a single 73GB SCSI disk and 2GB RAM. I ran the tests using both UFS and ZFS under Solaris x86 and both ext3fs and reiserfs under Linux. To avoid differences in performance between the inside and the outside of the disk, I used the same cylinders on the disk for all tests (plus or minus a cylinder or two.)  The times include syncing the data to disk.&lt;br/&gt;&lt;br/&gt;

Here are the results of these runs (averaged over several runs each.)  The "starting empty" times represent runs with a  newly-created filesystem.  The "consecutive run" times represent runs when I don't clean up after a "starting empty" run, i.e., the files are being rewritten into the existing filesystem structure.&lt;br/&gt;&lt;br/&gt;

&lt;table&gt;&lt;tr&gt; &lt;td&gt;Filesystem:&lt;/td&gt; &lt;td&gt;UFS&lt;/td&gt; &lt;td&gt;ZFS&lt;/td&gt; &lt;td&gt;ext3fs&lt;/td&gt; &lt;td&gt;reiserfs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Time (min:sec)(starting empty)&lt;/td&gt; &lt;td&gt;28:16&lt;/td&gt; &lt;td&gt;2:49&lt;/td&gt; &lt;td&gt;60:39&lt;/td&gt; &lt;td&gt;46:25&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Time (min:sec)(consecutive run)&lt;/td&gt; &lt;td&gt;59:57&lt;/td&gt; &lt;td&gt;5:34&lt;/td&gt; &lt;td&gt;20:26&lt;/td&gt; &lt;td&gt;50:31&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br/&gt;&lt;br/&gt;

So ZFS is the fastest all around for this particular workload by a spectacular margin.  (And it's probably interesting to note that ext3fs was the only filesystem that was actually faster on "consecutive" runs.  Given the asynchronous metadata updates, it might not be that surprising.)  But, as I stated earlier, this was a workload designed to keep the disk busy when ZFS is being used.  So while I'd demonstrated that ZFS can handle a heavy workload better than the other filesystems, I hadn't demonstrated that it's faster under a resonable workload.  So I re-calibrated to keep all the filesystems below 100% disk utilization, which ended up being these parameters:&lt;pre&gt;write-data -c 1 -u 1200 -m 64&lt;/pre&gt;So instead of 5 concurrent processes, there's just 1.  And here are the results:&lt;br/&gt;&lt;br/&gt;

&lt;table&gt;&lt;tr&gt; &lt;td&gt;Filesystem:&lt;/td&gt; &lt;td&gt;UFS&lt;/td&gt; &lt;td&gt;ZFS&lt;/td&gt; &lt;td&gt;ext3fs&lt;/td&gt; &lt;td&gt;reiserfs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Time (min:sec)(starting empty)&lt;/td&gt;&lt;td&gt;3:24&lt;/td&gt; &lt;td&gt;0:35&lt;/td&gt;&lt;td&gt;7:28&lt;/td&gt;&lt;td&gt;4:43&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Time (min:sec)(consecutive run)&lt;/td&gt; &lt;td&gt;11:01&lt;/td&gt; &lt;td&gt;0:38&lt;/td&gt;&lt;td&gt;1:10&lt;/td&gt;&lt;td&gt;2:34&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br/&gt;&lt;br/&gt;

So ZFS still won, but the margin of victory wasn't quite as large as with the first test.  And here we see reiserfs doing better on consecutive runs, too.  But while the above is informative, it still doesn't show the full story.  It's also interesting to note the disk utilization during these runs.  (Note that this wasn't a rigorous measurement, just eyeballing iostat output during the tests.)&lt;br/&gt;&lt;br/&gt;

&lt;table&gt;&lt;tr&gt; &lt;td&gt;Filesystem:&lt;/td&gt; &lt;td&gt;UFS&lt;/td&gt; &lt;td&gt;ZFS&lt;/td&gt; &lt;td&gt;ext3fs&lt;/td&gt; &lt;td&gt;reiserfs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;% Utilization (starting empty)&lt;/td&gt; &lt;td&gt;95-100&lt;/td&gt;&lt;td&gt;45-50&lt;/td&gt;&lt;td&gt;95-100&lt;/td&gt;&lt;td&gt;95-100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;% Utilization (consecutive run)&lt;/td&gt;&lt;td&gt;95-100&lt;/td&gt;&lt;td&gt;51-56&lt;/td&gt;&lt;td&gt;95-100&lt;/td&gt;&lt;td&gt;95-100&lt;/td&gt; &lt;/tr&gt;&lt;/table&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114805171602613667?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114805171602613667/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114805171602613667' title='14 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114805171602613667'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114805171602613667'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/05/zfs-benchmarking.html' title='ZFS benchmarking'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>14</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114727214134901878</id><published>2006-05-10T06:03:00.000-07:00</published><updated>2006-12-22T15:33:33.100-08:00</updated><title type='text'>Second DTrace success -- readdir_r() considered harmful</title><content type='html'>My first DTrace success pointed us towards using libumem in our threaded SMTP server.  In order to protect the interests of the company I work for, I won't go into too much detail about our email plant architecture, but this application runs in different modes.  One of them is the one that I've already described, the mode running on our MX servers that accept connections from the outside world.  Once mail is received, we filter it (in ways that should be obvious considering the signal-to-noise ratio) and then hand off what's left to be delivered to user mailboxes.  The application that does this work is that same SMTP server running in a different mode.&lt;br/&gt;&lt;br/&gt;

We've long known that there are performance problems with this mode of the application.  We run it mostly on single-cpu servers (v120's) because it doesn't scale well to 2 or 4 CPUs.  Even on the single-cpu servers, the CPUs are mostly idle, and the server plant is larger than it ideally needs to be.  The developers blamed NAS performance, but the systems guys knew better, as nothing on the filer indicated that there was a problem.  And nobody had put much effort into tracking down the problem.&lt;br/&gt;&lt;br/&gt;

Based on our success with libumem with this application in its other mode, we naturally assumed that using libumem here would solve all of our problems.  We tried that here, and we saw some improvement, but not nearly what we had expected.  Using a similar analysis (again noting that I hadn't yet discovered plockstat, which would have made the analysis much faster), it looked like all of the lock contention was coming from readdir_r().&lt;br/&gt;&lt;br/&gt;

But this doesn't make sense, right?  Using readdir_r() is the Right Thing To Do(TM) in a threaded application, as it's thread-safe, whereas readdir() isn't.  So why were we seeing problems here?&lt;br/&gt;&lt;br/&gt;

The man page contains this interesting tidbit about readdir():&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
     struct dirent *readdir(DIR *dirp);

[ ... ]

     The  pointer  returned by readdir() points to data which may
     be overwritten by another call  to  readdir()  on  the  same
     directory  stream.  This  data is not overwritten by another
     call to readdir() on a different directory stream.
&lt;/pre&gt;

Based on that description, one might assume that readdir_r() has been implemented with a mutex per directory stream.  Does the code back this up?  Let's look at the &lt;a href=http://cvs.opensolaris.org/source/xref/on/usr/src/lib/libc/port/gen/readdir_r.c&gt;source&lt;/a&gt;:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
     57 extern mutex_t _dirent_lock;

[ ... ]

     67 int
     68 readdir_r(DIR *dirp, struct dirent *entry, struct dirent **result)
     69 {
     70  struct dirent *dp; /* -&gt; directory data */
     71  int saveloc = 0;
     72 
     73  lmutex_lock(&amp;_dirent_lock);

[ ... ]

     99  lmutex_unlock(&amp;_dirent_lock);
    100  *result = entry;
    101  return (0);
    102 }
&lt;/pre&gt;

So, no, we don't have a per-dirent lock, there's a single lock per process, which obviously doesn't scale to hundreds of threads.  Given that this is a problem, what can we do about it?  Do we actually need to use the thread-safe version?&lt;br/&gt;&lt;br/&gt;

So when does the thread-safe version of readdir_r() make sense?  We're using the lock to protect the contents of the struct dirent.  But the above man page snippet would indicate that that data structure wouldn't get overwritten by a readdir() on a different directory stream, if we were simply using readdir().  So it seems that the only time you'd need that protection is if multiple threads are sharing the same directory stream.  Our application certainly isn't doing so, as every thread is doing its own opendir()/readdir_r()/closedir().  So we should be safe to replace the offending readdir_r() call with readdir().&lt;br/&gt;&lt;br/&gt;

And having done so, we saw a dramatic increase in performance (as measured in throughput per server.)  I wish I could include the MRTG graph of thread count on these servers, but imagine a profile of the United States moving from the Rocky Mountains to the plains, and you'll get a good image of what it looked like.  The improvement was enough to allow us to cut this server plant in half while still leaving a large buffer to handle spikes.&lt;br/&gt;&lt;br/&gt;

BTW, the email thread containing &lt;a href=http://cert.uni-stuttgart.de/archive/bugtraq/2005/11/msg00097.html&gt;this message&lt;/a&gt; from Casper Dik is pretty interesting WRT the above problem.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114727214134901878?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114727214134901878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114727214134901878' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114727214134901878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114727214134901878'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/05/second-dtrace-success-readdirr.html' title='Second DTrace success -- readdir_r() considered harmful'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114633801463267970</id><published>2006-05-04T12:13:00.000-07:00</published><updated>2006-08-13T03:09:35.583-07:00</updated><title type='text'>First DTrace success</title><content type='html'>While we were evaluating the T2000 server, we ran into some performance problems in our application.  The software couldn't scale up to fully utilize the hardware.&lt;br/&gt;&lt;br/&gt;

The traditional approach to debugging this would have been to increase the debugging level and then analyze the logs to try to find the problem.  The problem with doing so in this case was that the mutex around writes to the log file was one of the candidates for lock contention.  Whereas we might have gotten useful information had this analysis pointed elsewhere, had it pointed at the log file mutex, the analysis would have told us nothing.  We couldn't have known whether we had introduced that lock contention ourselves by logging more information.&lt;br/&gt;&lt;br/&gt;

This was our first use of Solaris 10 in our environment, and I'd been waiting quite a while to use DTrace to solve a real problem, so I took the opportunity to do so.  Having had limited experience using DTrace, it took longer to find the problem than it should have, and the path I ended up taking probably wasn't optimal.  (In retrospect, the approach I took was embarrassingly brute-force, but I'll detail it here anyway.)&lt;br/&gt;&lt;br/&gt;

I'll skip some of the dead-end fumbling I did and just present the final successful path.  I was curious to see where the application was spending all of its time.  I used this DTrace script to determine that:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
#!/usr/sbin/dtrace -s

pid$1:::entry
{
    self-&gt;ts[probemod,probefunc] = timestamp;
}


pid$1:::return
/ self-&gt;ts[probemod,probefunc] != 0/
{
    foo = timestamp - self-&gt;ts[probemod,probefunc];
    @functime[probemod,probefunc] = sum(foo);
    self-&gt;ts[probemod,probefunc] = 0;
}

&lt;/pre&gt;

(Note that the above was very expensive, as it enabled tens of thousands of probes.  In retrospect, I might have tried a different approach to get this information.)&lt;br/&gt;&lt;br/&gt;

The results of running this script told me that the application was spending ~50% of its time in ___lwp_mutex_timedlock().  Okay, so what is the code path leading to this function?&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
#!/usr/sbin/dtrace -s

pid$1::___lwp_mutex_timedlock:entry
{
    self-&gt;ts = timestamp;
}


pid$1::___lwp_mutex_timedlock:return
/self-&gt;ts != 0/
{
    foo = timestamp - self-&gt;ts;
    @stacktime[ustack()] = sum(foo);
    self-&gt;ts = 0;
}
&lt;/pre&gt;

Running the above script didn't immediately point to a smoking gun.  There was no one code path accounting for most of the time spent here, it was fairly well distributed through the code.  But the code paths did share a common theme that would have been immediately apparent had I limited the stack depth to 5.  I would then have seen these top two culprits:&lt;br/&gt;&lt;br/&gt;

&lt;pre&gt;
              libc.so.1`___lwp_mutex_timedlock+0x30
              libc.so.1`queue_lock+0x60
              libc.so.1`mutex_lock_queue+0xcc
              libc.so.1`lmutex_lock+0xe0
              libc.so.1`malloc+0x44

              libc.so.1`___lwp_mutex_timedlock+0x30
              libc.so.1`queue_lock+0x60
              libc.so.1`mutex_lock_queue+0xcc
              libc.so.1`lmutex_lock+0xe0
              libc.so.1`free+0x1c
&lt;/pre&gt;

So all of this lock contention was coming from libc's malloc() and free().  This actually turned out to be an ideal situation, as we could just set LD_PRELOAD to use libumem.  There was no need to make code changes to the application itself.  When we tried this, we saw our lock contention problems disappear.&lt;br/&gt;&lt;br/&gt;

The key points I want to make about the experience are these:&lt;br/&gt;&lt;br/&gt;

- Using DTrace allowed us to avoid the traditional method of additional logging, which could have masked the real problem given that there's a mutex involved in logging.
- DTrace let us discover the problem far more quickly than the traditional method likely would have.  Even given my inexperience, it was maybe a few hours worth of work.  With some experience (and with some knowledge of plockstat), it might have taken ten minutes or less.
- DTrace pointed us to a problem that we likely wouldn't have considered, as it was outside of our code.  And given that we hadn't considered this possibility prior to discovering the problem, we certainly didn't have debugging statements ready to be enabled that would have caught this problem.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114633801463267970?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114633801463267970/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114633801463267970' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114633801463267970'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114633801463267970'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/05/first-dtrace-success.html' title='First DTrace success'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114625520395999267</id><published>2006-04-28T12:34:00.000-07:00</published><updated>2006-08-13T03:10:30.006-07:00</updated><title type='text'>T2000 (Niagara) Evaluation</title><content type='html'>We received a Sun T2000 for evaluation in early January.  As I covered in an earlier post, we'd developed a model for determining whether we'd see any cost savings from using this new hardware.  We now needed to determine how this server would perform with respect to our current hardware choices.&lt;br/&gt;&lt;br/&gt;

The first application we tested on the T2000 was our in-house SMTP daemon running on our MX servers.  (We're a large ISP, so the fact that we receive a lot of mail every day should be a surprise to noone.)  This is a threaded application, so it was an excellent candidate for the Niagara CPU (well, okay, the UltraSPARC T1 CPU.)  The robust nature of the SMTP transaction was a plus, as we wouldn't lose anything if we happened to push the application beyond it's breaking point.  (I should note that we were testing this with live traffic in this case.  Some might consider this foolish, but, as I said, this particular application is robust enough that doing so was safe.  We certainly wouldn't have done so with other applications.)&lt;br/&gt;&lt;br/&gt;

We set up the server and started playing with the weights on the load-balancer it sits behind.  We safely pushed it up to taking twice the traffic of our current hardware choice (Sun v210's).  But the server fell over just below 3x.  (Thread count exploded, response time quickly became unacceptable, etc.)  I was very unhappy with this, because we'd expected to do much better.  I'd been hoping for about 6x.  But we weren't maxing out the hardware or the network, so it must have been a software problem.  I'll save the details for a later post, but DTrace pointed us to libc's malloc() and free().  After some quick QA running the application with libumem, we tried again.&lt;br/&gt;&lt;br/&gt;

This time, we did see the performance we were expecting, and then some.  We eventually pushed the T2000 up to taking 8x the traffic of one of our current servers.  Of course, this was running in four zones, each bound to one of the physical interfaces, so that we wouldn't max out the network.  (We only had FastE available at the time.  We could have gotten GigE, or we could have aggregated the four interfaces, but using zones was the quickest approach.)&lt;br/&gt;&lt;br/&gt;

Of course, that 8x figure is an apples-to-oranges comparison, as it compares the app running with libumem to the app running without libumem.  On the v210, libumem gave us an 80% performance improvement, so that 8x actually dropped down to a 4.5x.  Again, not what I'd been hoping for, but above the threshold where using the T2000 became cost-wise advantageous.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114625520395999267?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114625520395999267/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114625520395999267' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114625520395999267'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114625520395999267'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/04/t2000-niagara-evaluation.html' title='T2000 (Niagara) Evaluation'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114623231453373374</id><published>2006-04-28T05:31:00.000-07:00</published><updated>2006-08-13T03:11:31.350-07:00</updated><title type='text'>T2000 (Niagara) Evaluation (Prelude)</title><content type='html'>(I started out to write about my experience evaluating the T2000.  It turns out that I had a few things to say as a prelude, so I've broken this into multiple posts.)&lt;br/&gt;&lt;br/&gt;

It appears that at least one person has read my blog, as I have a request to post details about the T2000 evaluation I reently performed.  I won't be able to say as much as I'd like to, as some of the information might be considered "material" for SEC purposes, and I'd rather err on the safe side.  For example, I'd love to say how much we'd save in operating costs over three years by using the Niagara servers, but that's probably saying too much.  But I should be able to say enough to make writing this worthwhile.&lt;br/&gt;&lt;br/&gt;

Before we received the T2000, there was some discussion of breakeven ratio, i.e., how many of our current servers (e.g., v210's)  would we need to replace with a T2000 for it to be worth doing so.  The initial conversations took into account nothing more than the price of the hardware, but after a quick whiteboard estimate of space and power savings, I worked up a spreadsheet to determine the breakeven ratio (or to calculate the savings based on the measured ratio, depending on how you look at it.)  (And I'll state here that my first attempt at this spreadsheet was a freshman effort.  A colleague with more accounting experience reworked it to be what it should be.)  &lt;br/&gt;&lt;br/&gt;

(I'll add here that I'm a little bit embarassed to admit that we hadn't been taking space and power costs into consideration for our earlier hardware purchases.  OTOH, there still appear to be quite a few people out there who make the assumption that the cheapest white box they can get is the way to go.)&lt;br/&gt;&lt;br/&gt;

Once we started looking at the space and power costs for servers, the breakeven ratios for the T2000 vs. our current servers dropped quite a bit.  As a purely theoretical example, if we assume that we'll end up paying $16,000 for a T2000, and we're comparing this to an x86 server that we'd pay $2,000 for, the breakeven ratio based purely on hardware cost is 8:1.  But if we factor in space and power costs, that ratio falls to 4.2:1.  (This example is using real space and power costs, but it assumes that the application currently running on those x86 servers could be moved to a SPARC server with no cost.)&lt;br/&gt;&lt;br/&gt;

To sum up the above into an obvious statement:  It's important to look at more than just the cost of hardware when deciding what hardware to purchase.  See more on this &lt;a href="http://www.sun.com/software/solaris/ning.jsp"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114623231453373374?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114623231453373374/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114623231453373374' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114623231453373374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114623231453373374'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/04/t2000-niagara-evaluation-prelude.html' title='T2000 (Niagara) Evaluation (Prelude)'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114605512767157465</id><published>2006-04-26T05:25:00.000-07:00</published><updated>2006-08-13T03:12:18.206-07:00</updated><title type='text'>DTrace</title><content type='html'>I first heard about DTrace almost two years ago now.  I was at a Sun event here in New York, and they'd gotten Jarod Jenson to come in and talk about DTrace.  This was the only time I've ever seen him, so I don't know if he's like this all the time, but he was unbelievaby energetic.  He somehow managed to fit three hours' worth of presentation into an hour and a half or less.  It was obvious that he was very excited about this, and he did an excellent job of demonstrating just how useful DTrace could be.&lt;br/&gt;&lt;br/&gt;

This was before Solaris 10 was officially released and even before DTrace was available in Solaris Express, as I discovered to my disappointment the next day when I installed the currently-available Solaris Express.  I played with it some when it did become available, but I was unable to do anything really useful with it until we started using it in production, which we first did when we were evaluating a T2000.  I'll detail my successes with DTrace in later posts.  I'm hesitant to do so, as the analyses I did with DTrace were close to the simplest things one can do with it.  In the end, however, I think that the simplicity itself will speak to the power of DTrace.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114605512767157465?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114605512767157465/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114605512767157465' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114605512767157465'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114605512767157465'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/04/dtrace.html' title='DTrace'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-26802961.post-114588960358263040</id><published>2006-04-24T07:38:00.000-07:00</published><updated>2006-08-13T03:13:37.520-07:00</updated><title type='text'>First post</title><content type='html'>In the grand tradition of blogging, I'll state that this is my first blog and then ask myself the question, why have I decided to blog.&lt;br/&gt;&lt;br/&gt;

So why have I decided to blog? I've considered it before, but I've always decided not to. As I see it, nobody's interested in what I have to say. OTOH, that doesn't seem to have stopped millions of others from blogging, so I figured I'd give it a shot. After all, if Dylan could put out records with a voice like that, why couldn't Hendrix?&lt;br/&gt;&lt;br/&gt;

What am I likely to blog about, assuming I blog at all after this first entry? Mostly technical stuff. I'm not that terribly interested in talking about my personal life in a public forum, nor is the public interested in hearing about my personal life. No matter how cute the new twins are.&lt;br/&gt;&lt;br/&gt;

Who am I? Given that I'd prefer to focus on technical stuff, I'll answer the part of that question that asks, "What do I do?" I'm a Unix sysadmin. I've been a Unix admin for over ten years now (not counting the two years I spent at Princeton working on my Ph.D. in computer science before deciding it wasn't for me.) I started in the Computer Science Department at the University of Tennessee, where I worked (nominally) part-time while getting my Master's degree and full-time for a couple of years after that before my essay at an academic career. Since then, I've been working the same job for almost six years, first for Juno Online Services, then for United Online, the company formed from the merger of NetZero and Juno.&lt;br/&gt;&lt;br/&gt;

That's likely enough for now. If I don't follow up on this, it's likely not a big loss.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/26802961-114588960358263040?l=cmynhier.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cmynhier.blogspot.com/feeds/114588960358263040/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=26802961&amp;postID=114588960358263040' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114588960358263040'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/26802961/posts/default/114588960358263040'/><link rel='alternate' type='text/html' href='http://cmynhier.blogspot.com/2006/04/first-post.html' title='First post'/><author><name>Chad Mynhier</name><uri>http://www.blogger.com/profile/10160030726742219585</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
