Monday, July 03, 2006

 

Solaris/Linux reliability

(Note: the title and subject of this post are not a bald-faced attempt to generate traffic to this blog.)

I've recently started a new job in an environment that had historically been entirely Solaris but that introduced Linux a few years ago. The reasons for introducing Linux were perfectly valid -- the developers wanted faster hardware, and at the time, the choice was between bigger Sun hardware (running Solaris) or Intel hardware (running Linux.) At the time, the cost difference was such that there was no question which direction to take. Given that Sun had abandoned Solaris x86, there was only one realistic choice for OS, but now that Sun has thrown its full weight behind Solaris x86, we're questioning whether we should go back to being an all-Solaris shop.

Other details aside, someone recently expressed the opinion that Solaris is more reliable than Linux. This opinion was questioned: was this anything more than just a guess?

I've done some Googling, but everything I find that discusses the comparative reliability of the two OS's seems to be opinion. I can't seem to find any reasonable analysis with hard data. And in the absence of hard data, I'll simply add another opinion. I'll try to give some basis for that opinion, but I realize it's only that.

I believe that Solaris is the more reliable of the two OS's and is the one more suited to being used in an enterprise environment. (And by "enterprise environment", I mean an environment in which you care about server uptime, even if it's only during certain periods of the day, like trading hours. I'm not referring to environments which have been set up to handle server downtime, like web farms behind a load balancer or compute clusters with redundancy designed-in.)

What basis do I have for holding this opinion? I can't claim much experience that's relevant to this issue, as most of my Solaris experience has been on SPARC hardware, and all of my experience with Linux has been on Intel hardware. The SPARC hardware I've dealt with has been of higher quality than the Intel hardware, so that muddies the issue. (And I was fortunate enough to be in grad school during the recent USII E-cache unpleasantness.)

I claim this opinion based on the following: Solaris has an integrated fault management architecture. Others have written about this, so instead of merely repeating what they've said, I'll link directly to what they've written.

Gavin Maltby gives a general overview of the fault management architecture here. He also discusses structured error events in this blog, including why they're important. (These were entries in his as-yet-unfinished Top Solaris 10 features for fault management list. I hope to see writeups for more of these, but they appear to be pretty time-intensive, given the level of detail he goes into.) He also goes into quite a bit of detail related to fault management for the Opteron here.

But just having a structured way to report error events doesn't do much good unless you can act on that information. What you want to do is to correlate errors and take preemptive measures before they become faults. You also want to be able to isolate faults when they occur. Gavin mentions diagnosis engines in his blog, and there's more about them here. Liane Praza discusses smf(5)'s role in fault isolation here, and Richard Elling discusses one part of fault isolation, memory page retirement, here.

Fault isolation is at the heart of my opinion that Solaris is more reliable than Linux. Instead of throwing up a single fault boundary around the entire system ("Uncorrectable memory error? Panic."), Solaris gives us much finer-grained boundaries ("Uncorrectable memory error? Is it in the kernel? All we can do is panic. Is it in user space? Restart the affected services.") With respect to memory errors, given that most of a system's memory is in user space, the probability of a panic becomes something much smaller than the 100% you get with a monolithic fault boundary. (And note that this division between user- and kernel-space is more than just twofold, as smf(5) lets us define the fault boundaries between user processes. It's certainly feasible that the uncorrectable memory error affects the root of the dependency tree, but if it only affects some leaf in the dependency tree, there's only a single process to be restarted.)

With respect to the original question, is Solaris more reliable than Linux, it may be the case that Linux has made great advances in fault management, but my simple searches have yet to uncover that work. If anyone reading this cares to point me in the right direction, I'd greatly appreciate it.


Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?