Friday, July 14, 2006

 

OpenSSH SMF/contract problem

I'm investigating a weird SMF problem with our site ssh service. (We run OpenSSH rather than the Sun-supplied ssh so that we can have a consistent version across different OS's.) The symptom is that the ssh process doesn't get restarted after being killed, even though svcs still shows the process as being online:

server:/var/svc/log> svcs -a | grep ssh
online         Apr_27   svc:/site/ssh:default
server:/var/svc/log> ps -ef | grep sshd
  jruser 28788 28786   0 17:41:11 ?  0:01 
         /usr/local/sbin/sshd -R
    root 13016     1   0 18:08:13 ?  0:00 
         /usr/local/sbin/sshd -R
    root 28786     1   0 17:41:10 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 13018 13016   0 18:08:14 ?  0:01 
         /usr/local/sbin/sshd -R
    root 24437     1   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
  jruser 24439 24437   0 10:36:23 ?  0:00 
         /usr/local/sbin/sshd -R
server:/var/svc/log>

server2:/var/tmp> ssh server
ssh: connect to host server port 22: Connection refused
server2:/var/tmp>
Hmm, knowing something about contracts, I check to see if the sshd processes that are hanging around are still part of some contract:

server:/var/svc/log> ctstat -a -v | grep 24437
        member processes:   1952 9779 9780 9781 10226 
10227 10228 10229 10230 10231 10232 10233 10234 10236 
10238 10241 10242 13016 13018 13020 15881 15882 15883 
15884 17016 17524 17821 17866 17873 17874 22188 22189 
24437 24439 24441 26576 28786 28788 28790 28844 28845
server:/var/svc/log> 
That's odd, there are a lot more processes than just the sshd processes I've listed. What are all of these? It turns out that a lot of these are httpd processes:

server:/var/svc/log> ptree 10226
10226 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10227 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10229 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10230 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10231 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10232 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10233 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10241 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  10242 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9779  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9780  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
  9781  /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf
server:/var/svc/log>
So at a guess, someone ssh'd in to the server and started httpd, but for some reason, it didn't get put into its own process contract. And some more poking (and some Googling) verifies this for me. Apparently, Sun's sshd does the right thing WRT process contracts, and OpenSSH doesn't. (Or at least doesn't yet as of 4.3p2.)

Sun's sshd will attempt to put children into their own process contracts. This way, if you do a 'svcadm disable ssh', existing connections aren't killed. (I've mentioned process contracts before, but you could just as easily read the man pages -- 'man contracts' is a good place to start.)

OpenSSH (or at least 4.3p2) doesn't appear to do anything at all with process contracts. What you end up with is every sshd process in the same contract. What's more, you end up with the entire process tree in the same contract. This includes the shells and any other processes started from those shells, etc. (Well, with the exception of anything like ctrun(1) that creates new process contracts.)

Why is this a problem? Well, for one, you get what you see above: SMF doesn't restart a service that you think it should, nor does it even report that the service is down. But there's the flipside: in the above case, had I done a 'svcadm disable ssh', not only would I have killed the existing connection, but I would also have killed that running httpd instance. That definitely violates the principle of least surprise.


Comments:
great post. I disabled openssh today while I was doing a make install of a newer version (temporarily in telnet) and found all database processes disappeared with no info in the alert log or anything. Took a while to figure out but this it was completely reproducible. Yes, the ossh service , per svcs -xv had went into maintenance mode and the log said it killed a contract. I am new to SMF/contracts so if I can't find an alternative, say changing the service type in its definition, then I'll just use init.d for ssh.
 
I found other posts where you had submitted solaris contracts support patches for openssh back in '06.

Tested the --with-solaris-contracts compile switch and it worked perfectly.

Thanks for sorting this out.
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?