Thursday, January 25, 2007
Microstate accounting in Solaris 10 and CPU latency
Solaris 10 included some pretty major features such as ZFS, Zones and DTrace. There were also other features that weren't quite as major but that are pretty nifty in and of themselves. One of these features is that microstate accounting is turned on by default in Solaris 10. (It was optional in earlier releases of Solaris.)
Eric Schrock blogged about microstate accounting here.
Using 'prstat -m' you can see some of this information. For example:
The additional columns (TRP through LAT) indicate what percentage of time the process spent handling traps, text page faults, data page faults, sleeping or waiting for CPU.
Given that this is only a two-processor machine, it's fairly obvious from the load average reported above that the machine is overloaded, but the load average doesn't give quite as detailed picture of what's going as does 'prstat -m' output. In particular, the LAT column (CPU latency) is pretty instructive. All of the processes in the above example are spending some percentage of their time fully able to run but waiting to be switched onto a CPU. You could have guessed this from the load average, but load average couldn't have told you that some of these processes were spending over 80% of their time sitting in a run queue.
Using 'prstat -m' you can see some of this information. For example:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 9559 mynhier 27 3.9 0.0 0.0 0.0 0.0 0.0 69 0 186 3K 0 ube/1 9562 mynhier 18 6.6 0.0 0.0 0.0 0.0 0.0 76 0 92 7K 0 ube/1 9555 mynhier 21 2.7 0.0 0.0 0.0 0.0 0.0 76 0 143 2K 0 cc1/1 9564 mynhier 14 8.8 0.0 0.0 0.0 0.0 0.0 77 0 124 13K 0 ube/1 9551 mynhier 18 2.2 0.0 0.0 0.0 0.0 0.0 80 0 123 2K 0 cc1/1 9549 mynhier 17 2.1 0.0 0.0 0.0 0.0 0.0 81 0 117 2K 0 cc1/1 9560 mynhier 16 2.6 0.0 0.0 0.0 0.0 0.0 82 0 350 6K 0 iropt/1 9507 mynhier 5.9 1.6 0.0 0.0 0.0 0.0 59 34 14 31 2K 0 dmake/1 9508 mynhier 3.4 1.0 0.0 0.0 0.0 0.0 75 21 14 17 1K 0 dmake/1 9489 mynhier 2.9 0.7 0.0 0.0 0.0 0.0 96 0.8 15 13 1K 0 dmake/1 9490 mynhier 2.9 0.7 0.0 0.0 0.0 0.0 96 0.8 15 13 1K 0 dmake/1 479 mynhier 0.7 0.0 0.0 0.0 0.0 0.0 98 1.7 28 0 113 13 Xorg/1 9543 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 67 32 7 1 215 0 cc/1 9536 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 79 20 6 3 207 0 cc/1 9527 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 95 4.4 6 5 203 0 cc/1 9554 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 86 14 2 0 221 0 gcc/1 9550 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 85 15 2 0 180 0 gcc/1 9548 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 84 16 1 3 163 0 gcc/1 9546 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 87 12 3 0 200 0 cc/1 9544 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 83 16 1 1 239 0 sh/1 Total: 111 processes, 238 lwps, load averages: 5.46, 5.79, 5.68 >
The additional columns (TRP through LAT) indicate what percentage of time the process spent handling traps, text page faults, data page faults, sleeping or waiting for CPU.
Given that this is only a two-processor machine, it's fairly obvious from the load average reported above that the machine is overloaded, but the load average doesn't give quite as detailed picture of what's going as does 'prstat -m' output. In particular, the LAT column (CPU latency) is pretty instructive. All of the processes in the above example are spending some percentage of their time fully able to run but waiting to be switched onto a CPU. You could have guessed this from the load average, but load average couldn't have told you that some of these processes were spending over 80% of their time sitting in a run queue.
Saturday, January 20, 2007
It's the little things
Guh. There's nothing quite as embarassing as having made a simple formatting mistake in your blog and then having propagated it for a couple of months.
I just recently realized that this really only breaks how my blog appears in Firefox. Given that I use Opera, Safari, and IE, and given that they all displayed what I wanted to see, I never realized the mistake. So now I just republished two months' worth of entries to fix it.
Guh
I just recently realized that this really only breaks how my blog appears in Firefox. Given that I use Opera, Safari, and IE, and given that they all displayed what I wanted to see, I never realized the mistake. So now I just republished two months' worth of entries to fix it.
Guh
Monday, January 08, 2007
ZFS and committing data to stable storage
I ran across a couple of blog entries that might be of interest to anyone using ZFS on an NFS server or using ZFS on a storage array with a battery-backed cache.
WRT NFS over ZFS, this entry discusses why you might see worse performance with NFS over ZFS when compared to NFS over UFS (at least for a singly-threaded load.) The article points out that it's actually an apples-to-oranges comparison, as ZFS implements the correct behavior WRT NFS semantics -- NFS COMMITs are guaranteed to be on stable storage (via the ZFS intent log (ZIL)), regardless of whether the disk cache is enabled. With UFS (or other filesystems), NFS COMMITs are considered successful once they've hit the disk cache, which is decidedly not stable storage.
On a similar note (and linked from the above, also), this entry discusses ZFS performance problems with storage arrays using battery-backed cache. In this case, the cache can be considered stable storage, but ZFS still forces a flush of the cache after every write to the ZIL. The article discusses a couple of ways to handle this problem. One of these is definitely a bad idea (disabling the ZIL), but the other has merit (instructing the storage array to ignore the flush commands from ZFS.)
WRT NFS over ZFS, this entry discusses why you might see worse performance with NFS over ZFS when compared to NFS over UFS (at least for a singly-threaded load.) The article points out that it's actually an apples-to-oranges comparison, as ZFS implements the correct behavior WRT NFS semantics -- NFS COMMITs are guaranteed to be on stable storage (via the ZFS intent log (ZIL)), regardless of whether the disk cache is enabled. With UFS (or other filesystems), NFS COMMITs are considered successful once they've hit the disk cache, which is decidedly not stable storage.
On a similar note (and linked from the above, also), this entry discusses ZFS performance problems with storage arrays using battery-backed cache. In this case, the cache can be considered stable storage, but ZFS still forces a flush of the cache after every write to the ZIL. The article discusses a couple of ways to handle this problem. One of these is definitely a bad idea (disabling the ZIL), but the other has merit (instructing the storage array to ignore the flush commands from ZFS.)
Thursday, January 04, 2007
Machine That Goes PING!
I was looking through the source code for ping (the Solaris version, and I noticed this:
So if you want the "-s" functionality without always having to specify the "-s", set this environment variable.
This fix doesn't appear to have been integrated before Solaris 10 06/06 was set in stone, though, and I don't have access to an 11/06 or recent Solaris Express installation to see it in action. It's not terribly Earth-shaking as code changes go, I just found it amusing.
496 if (getenv("MACHINE_THAT_GOES_PING") != NULL) 497 stats = _B_TRUE;
So if you want the "-s" functionality without always having to specify the "-s", set this environment variable.
This fix doesn't appear to have been integrated before Solaris 10 06/06 was set in stone, though, and I don't have access to an 11/06 or recent Solaris Express installation to see it in action. It's not terribly Earth-shaking as code changes go, I just found it amusing.
Wednesday, January 03, 2007
ZFS stores file creation time
I've probably run across this at some point and forgotten it, but I happened to notice it while looking at something else. ZFS stores the file creation time (crtime) in addition to the traditional atime, mtime, and ctime:
And, of course, this can be seen in the source code:
juser@server> sudo zdb -dddd trashme [ ... ] Object lvl iblk dblk lsize asize type 14 2 16K 128K 3.75M 3.75M ZFS plain file 264 bonus ZFS znode path /foo atime Wed Jan 3 16:20:42 2007 mtime Wed Jan 3 16:20:42 2007 ctime Wed Jan 3 16:20:42 2007 crtime Wed Jan 3 15:26:53 2007 [ ... ]
And, of course, this can be seen in the source code:
87 /* 88 * This is the persistent portion of the znode. It is stored 89 * in the "bonus buffer" of the file. Short symbolic links 90 * are also stored in the bonus buffer. 91 */ 92 typedef struct znode_phys { 93 uint64_t zp_atime[2]; /* 0 - last file access time */ 94 uint64_t zp_mtime[2]; /* 16 - last file modification time */ 95 uint64_t zp_ctime[2]; /* 32 - last file change time */ 96 uint64_t zp_crtime[2]; /* 48 - creation time */