Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!decwrl!uunet!isis!ico!rcd From: rcd@ico.isc.com (Dick Dunn) Newsgroups: comp.unix.i386 Subject: Re: Altos 5000 Summary: disk mirroring, points of failure, redundancy Message-ID: <1990Aug27.183821.13518@ico.isc.com> Date: 27 Aug 90 18:38:21 GMT References: <1990Aug16.174514.2646@NCoast.ORG> <15759@bfmny0.BFM.COM> <81@maxx.UUCP> Organization: Interactive Systems Corporation, Boulder, CO Lines: 121 tyager@maxx.UUCP (Tom Yager) challenges some of my points about the 5000. At the end, I've got some observations on unintended flames, but first let's take the disk-mirroring ; he hit some shortcomings in my posting... I had said: > > ...disk mirroring is mostly a pawn in the > > feature game. It takes substantially more I/O bandwidth to do the double > > output, and it doubles the cost of disk storage. Why not spend only a few > > bucks extra and buy reliable disks? > > Even "reliable" disks eventually die. True. So do reliable controllers. What I want to get at--and it's something I didn't say at all in my previous posting--is that if you're looking for a certain level of reliability, it's a lot harder than just tossing on extra disks and mirroring. Let's have a look at the overall reliability issues...although I'd remind everyone that fault-tolerant system design is complicated, so we're bound to gloss over things. >You're running a service business, say, a distribution house. Your order entry, > warehouse control, customer service--everything--is on the computer. You've > got everything backed up like a good doobie. One of your drives gets smoked. > Now, if you're mirrored, your system squawks at you but keeps running... OK, this covers one failure mode: a disk dies. There are two questions I think we need to ask: - How likely is this failure mode relative to other failure modes? - Is there another way to get comparable recovery capability? To the second question, I'll suggest "journaling" as providing a lot of what you need, possibly at much less cost. I'm more interested in the first question. I had pointed out that it takes extra I/O bandwidth to handle mirroring; someone responded that if you have the right sort of controller, it will write both disks at once for you. OK, fine, now you've made the controller a single-point-of-failure. I've seen as many motherboard and controller failures as disk failures. I don't pretend my experience is typical, but suppose that it might be. The disks are not the only failure points in the system. Oh, and what about that controller, and the writing: Are you doing read-after-write on both disks to be sure you've got good copies of everything? Are you actually using both disks (always writing both, but reading from whichever is free)? There's a Catch-22 in the failure-mode analysis here: If you're using both disks (to improve performance) you risk having an undetected failure on one disk give you inconsistent data between them...which could quickly make a hash of the data on *both* disks. If you're essentially running on one disk and just writing the other as a backup mirror, you're not getting the ongoing check that you really need for reliability. One of the troublesome aspects of mirroring is that it's just a redundancy at one fairly low level. This is inherently subject to a certain class of error--if there's something just-plain-wrong somewhere, you may end up with nothing more than two wrong copies of your data. It's why I hinted at journaling, because that at least gives you a second copy of the data in a different format, constructed with a different algorithm. (Recovery using a journal takes longer, of course...but it recovers from a different class of problems.) > There are a lot of copies of Netware SFT in the hands of businesspeople who > agree with me, and a large part of the fuss over the Systempro is for its > mirroring and data guarding features. People who can afford extra hardware and software, and who can't afford to be down, will buy solutions that promise greater reliability. That's as it should be. Does the solution make sense? I'm not convinced. I would admonish everyone to bear in mind that being able to sell something is not an inherent indicator of its worth. In this case, I'm not arguing that mirroring is worthless, but I do argue that it's inordinately expensive and only addresses one small part of the overall reliability problem. A single system with mirrored disks on one controller has only one element of redundancy. _ _ _ _ _ Now, let's clean up something about the "insults": > > > See the review of the System 5000 in the July issue of _UNIX WORLD_... > > _UNIX_World_?? Oh, yeah...isn't that the magazine that just carried an > > article about UNIX-based BBSes without a single word about either USENET or > > ARPANET? I think you need a stronger source of review than that. > > Thank you for the insult. I can handle criticism as well as any writer, but I > am not keen on those who badmouth my work without even having read it. Tom: I was *not* criticizing your article. The complete truth of the matter is this: I saw the _UNIX_World_ which contains your article, but what caught my eye first was the BBS article. In my "carefully considered professional (but personal) opinion" the BBS article was so poorly written that it puts the magazine in a bad light. In fact, my comments were directed entirely at _UNIX_World_; at that point I hadn't seen your name on the article. Surely you realize that regardless of how well-researched and well-written your article is, it will be judged in the context of the magazine in which it appears..."birds of a feather" and all that. The BBS article was not the first of its ilk I've seen in _UNIX_World_, either. Having now skimmed quickly through *your* article, I think it's a reason- ably good one--concise, to-the-point. I have some criticisms (primarily the matter of using Neal Nelson benchmarks) but they're specific, and not general dissatisfaction with the article. I've also seen your writing in Byte. I think you're generally reasonably on-target, and in any case you make an obvious effort to get it right. Perhaps you can help _UNIX_World_ improve its reputation within the UNIX community. > ...In any case, I'm objective: The original > posting complained about the 5000's inability to run a certain OS product with > which you have a mild involvement. Come on, come right out and say it. If you think I failed to be objective, point out *where* I failed to be objective. If you think I have some vested interest in whether Interactive's UNIX runs on the Altos 5000, tell me about it. (Honestly, I don't care. They're playing a different game than Interactive is.) > Regardless, I think your swipe at UNIX World, and my review, was out of line. I think your complaint is out of line. I did not take a swipe at your review. I didn't say *anything* about your review. I made a comment about the wisdom of using _UNIX_World_ as a reference. -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...I'm not cynical - just experienced.