Friday, July 14, 2006
OpenSSH SMF/contract problem
I'm investigating a weird SMF problem with our site ssh service. (We run OpenSSH rather than the Sun-supplied ssh so that we can have a consistent version across different OS's.) The symptom is that the ssh process doesn't get restarted after being killed, even though svcs still shows the process as being online:
Sun's sshd will attempt to put children into their own process contracts. This way, if you do a 'svcadm disable ssh', existing connections aren't killed. (I've mentioned process contracts before, but you could just as easily read the man pages -- 'man contracts' is a good place to start.)
OpenSSH (or at least 4.3p2) doesn't appear to do anything at all with process contracts. What you end up with is every sshd process in the same contract. What's more, you end up with the entire process tree in the same contract. This includes the shells and any other processes started from those shells, etc. (Well, with the exception of anything like ctrun(1) that creates new process contracts.)
Why is this a problem? Well, for one, you get what you see above: SMF doesn't restart a service that you think it should, nor does it even report that the service is down. But there's the flipside: in the above case, had I done a 'svcadm disable ssh', not only would I have killed the existing connection, but I would also have killed that running httpd instance. That definitely violates the principle of least surprise.
server:/var/svc/log> svcs -a | grep ssh online Apr_27 svc:/site/ssh:default server:/var/svc/log> ps -ef | grep sshd jruser 28788 28786 0 17:41:11 ? 0:01 /usr/local/sbin/sshd -R root 13016 1 0 18:08:13 ? 0:00 /usr/local/sbin/sshd -R root 28786 1 0 17:41:10 ? 0:00 /usr/local/sbin/sshd -R jruser 13018 13016 0 18:08:14 ? 0:01 /usr/local/sbin/sshd -R root 24437 1 0 10:36:23 ? 0:00 /usr/local/sbin/sshd -R jruser 24439 24437 0 10:36:23 ? 0:00 /usr/local/sbin/sshd -R server:/var/svc/log> server2:/var/tmp> ssh server ssh: connect to host server port 22: Connection refused server2:/var/tmp>Hmm, knowing something about contracts, I check to see if the sshd processes that are hanging around are still part of some contract:
server:/var/svc/log> ctstat -a -v | grep 24437 member processes: 1952 9779 9780 9781 10226 10227 10228 10229 10230 10231 10232 10233 10234 10236 10238 10241 10242 13016 13018 13020 15881 15882 15883 15884 17016 17524 17821 17866 17873 17874 22188 22189 24437 24439 24441 26576 28786 28788 28790 28844 28845 server:/var/svc/log>That's odd, there are a lot more processes than just the sshd processes I've listed. What are all of these? It turns out that a lot of these are httpd processes:
server:/var/svc/log> ptree 10226 10226 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10227 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10229 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10230 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10231 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10232 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10233 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10241 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 10242 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 9779 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 9780 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf 9781 /usr/local/httpd-2.2.0/bin/httpd -f /var/httpd/httpd.conf server:/var/svc/log>So at a guess, someone ssh'd in to the server and started httpd, but for some reason, it didn't get put into its own process contract. And some more poking (and some Googling) verifies this for me. Apparently, Sun's sshd does the right thing WRT process contracts, and OpenSSH doesn't. (Or at least doesn't yet as of 4.3p2.)
Sun's sshd will attempt to put children into their own process contracts. This way, if you do a 'svcadm disable ssh', existing connections aren't killed. (I've mentioned process contracts before, but you could just as easily read the man pages -- 'man contracts' is a good place to start.)
OpenSSH (or at least 4.3p2) doesn't appear to do anything at all with process contracts. What you end up with is every sshd process in the same contract. What's more, you end up with the entire process tree in the same contract. This includes the shells and any other processes started from those shells, etc. (Well, with the exception of anything like ctrun(1) that creates new process contracts.)
Why is this a problem? Well, for one, you get what you see above: SMF doesn't restart a service that you think it should, nor does it even report that the service is down. But there's the flipside: in the above case, had I done a 'svcadm disable ssh', not only would I have killed the existing connection, but I would also have killed that running httpd instance. That definitely violates the principle of least surprise.
Comments:
<< Home
great post. I disabled openssh today while I was doing a make install of a newer version (temporarily in telnet) and found all database processes disappeared with no info in the alert log or anything. Took a while to figure out but this it was completely reproducible. Yes, the ossh service , per svcs -xv had went into maintenance mode and the log said it killed a contract. I am new to SMF/contracts so if I can't find an alternative, say changing the service type in its definition, then I'll just use init.d for ssh.
I found other posts where you had submitted solaris contracts support patches for openssh back in '06.
Tested the --with-solaris-contracts compile switch and it worked perfectly.
Thanks for sorting this out.
Post a Comment
Tested the --with-solaris-contracts compile switch and it worked perfectly.
Thanks for sorting this out.
<< Home