Friday, July 14, 2006

 

OpenSSH SMF/contract problem

I'm investigating a weird SMF problem with our site ssh service. (We run OpenSSH rather than the Sun-supplied ssh so that we can have a consistent version across different OS's.) The symptom is that the ssh process doesn't get restarted after being killed, even though svcs still shows the process as being online:

server:/var/svc/log> svcs -a | grep ssh
online         Apr_27   svc:/site/ssh:default
server:/var/svc/log> ps -ef | grep sshd
  jruser 28788 28786   0 17:41:11 ?  0:01 
         /usr/local/sbin/sshd -R
    root 13016     1   0 18:08:13 ?  0:00 
         /usr/local/sbin/sshd -R
    root 28786     1   0 17:41:10 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 13018 13016   0 18:08:14 ?  0:01 
         /usr/local/sbin/sshd -R
    root 24437     1   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 24439 24437   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
server:/var/svc/log>

server2:/var/tmp> ssh server
ssh: connect to host server port 22: Connection refused
server2:/var/tmp>
Hmm, knowing something about contracts, I check to see if the sshd processes that are hanging around are still part of some contract:

server:/var/svc/log> ctstat -a -v | grep 24437
        member processes:   1952 9779 9780 9781 10226 
10227 10228 10229 10230 10231 10232 10233 10234 10236 
10238 10241 10242 13016 13018 13020 15881 15882 15883 
15884 17016 17524 17821 17866 17873 17874 22188 22189 
24437 24439 24441 26576 28786 28788 28790 28844 28845
server:/var/svc/log> 
That's odd, there are a lot more processes than just the sshd processes I've listed. What are all of these? It turns out that a lot of these are httpd processes:

server:/var/svc/log> ptree 10226
10226 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10227 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10229 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10230 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10231 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10232 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10233 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10241 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10242 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9779  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9780  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9781  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
server:/var/svc/log>
So at a guess, someone ssh'd in to the server and started httpd, but for some reason, it didn't get put into its own process contract. And some more poking (and some Googling) verifies this for me. Apparently, Sun's sshd does the right thing WRT process contracts, and OpenSSH doesn't. (Or at least doesn't yet as of 4.3p2.)

Sun's sshd will attempt to put children into their own process contracts. This way, if you do a 'svcadm disable ssh', existing connections aren't killed. (I've mentioned process contracts before, but you could just as easily read the man pages -- 'man contracts' is a good place to start.)

OpenSSH (or at least 4.3p2) doesn't appear to do anything at all with process contracts. What you end up with is every sshd process in the same contract. What's more, you end up with the entire process tree in the same contract. This includes the shells and any other processes started from those shells, etc. (Well, with the exception of anything like ctrun(1) that creates new process contracts.)

Why is this a problem? Well, for one, you get what you see above: SMF doesn't restart a service that you think it should, nor does it even report that the service is down. But there's the flipside: in the above case, had I done a 'svcadm disable ssh', not only would I have killed the existing connection, but I would also have killed that running httpd instance. That definitely violates the principle of least surprise.


Monday, July 03, 2006

 

Solaris/Linux reliability

(Note: the title and subject of this post are not a bald-faced attempt to generate traffic to this blog.)

I've recently started a new job in an environment that had historically been entirely Solaris but that introduced Linux a few years ago. The reasons for introducing Linux were perfectly valid -- the developers wanted faster hardware, and at the time, the choice was between bigger Sun hardware (running Solaris) or Intel hardware (running Linux.) At the time, the cost difference was such that there was no question which direction to take. Given that Sun had abandoned Solaris x86, there was only one realistic choice for OS, but now that Sun has thrown its full weight behind Solaris x86, we're questioning whether we should go back to being an all-Solaris shop.

Other details aside, someone recently expressed the opinion that Solaris is more reliable than Linux. This opinion was questioned: was this anything more than just a guess?

I've done some Googling, but everything I find that discusses the comparative reliability of the two OS's seems to be opinion. I can't seem to find any reasonable analysis with hard data. And in the absence of hard data, I'll simply add another opinion. I'll try to give some basis for that opinion, but I realize it's only that.

I believe that Solaris is the more reliable of the two OS's and is the one more suited to being used in an enterprise environment. (And by "enterprise environment", I mean an environment in which you care about server uptime, even if it's only during certain periods of the day, like trading hours. I'm not referring to environments which have been set up to handle server downtime, like web farms behind a load balancer or compute clusters with redundancy designed-in.)

What basis do I have for holding this opinion? I can't claim much experience that's relevant to this issue, as most of my Solaris experience has been on SPARC hardware, and all of my experience with Linux has been on Intel hardware. The SPARC hardware I've dealt with has been of higher quality than the Intel hardware, so that muddies the issue. (And I was fortunate enough to be in grad school during the recent USII E-cache unpleasantness.)

I claim this opinion based on the following: Solaris has an integrated fault management architecture. Others have written about this, so instead of merely repeating what they've said, I'll link directly to what they've written.

Gavin Maltby gives a general overview of the fault management architecture here. He also discusses structured error events in this blog, including why they're important. (These were entries in his as-yet-unfinished Top Solaris 10 features for fault management list. I hope to see writeups for more of these, but they appear to be pretty time-intensive, given the level of detail he goes into.) He also goes into quite a bit of detail related to fault management for the Opteron here.

But just having a structured way to report error events doesn't do much good unless you can act on that information. What you want to do is to correlate errors and take preemptive measures before they become faults. You also want to be able to isolate faults when they occur. Gavin mentions diagnosis engines in his blog, and there's more about them here. Liane Praza discusses smf(5)'s role in fault isolation here, and Richard Elling discusses one part of fault isolation, memory page retirement, here.

Fault isolation is at the heart of my opinion that Solaris is more reliable than Linux. Instead of throwing up a single fault boundary around the entire system ("Uncorrectable memory error? Panic."), Solaris gives us much finer-grained boundaries ("Uncorrectable memory error? Is it in the kernel? All we can do is panic. Is it in user space? Restart the affected services.") With respect to memory errors, given that most of a system's memory is in user space, the probability of a panic becomes something much smaller than the 100% you get with a monolithic fault boundary. (And note that this division between user- and kernel-space is more than just twofold, as smf(5) lets us define the fault boundaries between user processes. It's certainly feasible that the uncorrectable memory error affects the root of the dependency tree, but if it only affects some leaf in the dependency tree, there's only a single process to be restarted.)

With respect to the original question, is Solaris more reliable than Linux, it may be the case that Linux has made great advances in fault management, but my simple searches have yet to uncover that work. If anyone reading this cares to point me in the right direction, I'd greatly appreciate it.


This page is powered by Blogger. Isn't yours?