Thursday, January 25, 2007
Microstate accounting in Solaris 10 and CPU latency
Solaris 10 included some pretty major features such as ZFS, Zones and DTrace. There were also other features that weren't quite as major but that are pretty nifty in and of themselves. One of these features is that microstate accounting is turned on by default in Solaris 10. (It was optional in earlier releases of Solaris.)
Eric Schrock blogged about microstate accounting here.
Using 'prstat -m' you can see some of this information. For example:
The additional columns (TRP through LAT) indicate what percentage of time the process spent handling traps, text page faults, data page faults, sleeping or waiting for CPU.
Given that this is only a two-processor machine, it's fairly obvious from the load average reported above that the machine is overloaded, but the load average doesn't give quite as detailed picture of what's going as does 'prstat -m' output. In particular, the LAT column (CPU latency) is pretty instructive. All of the processes in the above example are spending some percentage of their time fully able to run but waiting to be switched onto a CPU. You could have guessed this from the load average, but load average couldn't have told you that some of these processes were spending over 80% of their time sitting in a run queue.
Using 'prstat -m' you can see some of this information. For example:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 9559 mynhier 27 3.9 0.0 0.0 0.0 0.0 0.0 69 0 186 3K 0 ube/1 9562 mynhier 18 6.6 0.0 0.0 0.0 0.0 0.0 76 0 92 7K 0 ube/1 9555 mynhier 21 2.7 0.0 0.0 0.0 0.0 0.0 76 0 143 2K 0 cc1/1 9564 mynhier 14 8.8 0.0 0.0 0.0 0.0 0.0 77 0 124 13K 0 ube/1 9551 mynhier 18 2.2 0.0 0.0 0.0 0.0 0.0 80 0 123 2K 0 cc1/1 9549 mynhier 17 2.1 0.0 0.0 0.0 0.0 0.0 81 0 117 2K 0 cc1/1 9560 mynhier 16 2.6 0.0 0.0 0.0 0.0 0.0 82 0 350 6K 0 iropt/1 9507 mynhier 5.9 1.6 0.0 0.0 0.0 0.0 59 34 14 31 2K 0 dmake/1 9508 mynhier 3.4 1.0 0.0 0.0 0.0 0.0 75 21 14 17 1K 0 dmake/1 9489 mynhier 2.9 0.7 0.0 0.0 0.0 0.0 96 0.8 15 13 1K 0 dmake/1 9490 mynhier 2.9 0.7 0.0 0.0 0.0 0.0 96 0.8 15 13 1K 0 dmake/1 479 mynhier 0.7 0.0 0.0 0.0 0.0 0.0 98 1.7 28 0 113 13 Xorg/1 9543 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 67 32 7 1 215 0 cc/1 9536 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 79 20 6 3 207 0 cc/1 9527 mynhier 0.1 0.4 0.0 0.0 0.0 0.0 95 4.4 6 5 203 0 cc/1 9554 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 86 14 2 0 221 0 gcc/1 9550 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 85 15 2 0 180 0 gcc/1 9548 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 84 16 1 3 163 0 gcc/1 9546 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 87 12 3 0 200 0 cc/1 9544 mynhier 0.1 0.2 0.0 0.0 0.0 0.0 83 16 1 1 239 0 sh/1 Total: 111 processes, 238 lwps, load averages: 5.46, 5.79, 5.68 >
The additional columns (TRP through LAT) indicate what percentage of time the process spent handling traps, text page faults, data page faults, sleeping or waiting for CPU.
Given that this is only a two-processor machine, it's fairly obvious from the load average reported above that the machine is overloaded, but the load average doesn't give quite as detailed picture of what's going as does 'prstat -m' output. In particular, the LAT column (CPU latency) is pretty instructive. All of the processes in the above example are spending some percentage of their time fully able to run but waiting to be switched onto a CPU. You could have guessed this from the load average, but load average couldn't have told you that some of these processes were spending over 80% of their time sitting in a run queue.