Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!lll-winken!sun-barr!newstop!sun!amdcad!mozart.amd.com!nucleus!davec From: davec@nucleus.amd.com (Dave Christie) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <1990Aug6.172146.10614@mozart.amd.com> Date: 6 Aug 90 17:21:46 GMT References: <1990Aug3.204358.330@portia.Stanford.EDU> <1990Aug4.231129.1358@zoo.toronto.edu> Sender: usenet@mozart.amd.com (Usenet News) Reply-To: davec@nucleus.amd.com (Dave Christie) Organization: Advanced Micro Devices, Inc., Austin, Texas Lines: 62 In article <1990Aug4.231129.1358@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes: >>I'm a bit puzzled by the lack of any type of memory error detection/ >>correction on many workstations and high-end PCs. These workstations >>are beginning to have memories that rival or exceed those of >>the previous generation of minicomputers, which almost always used >>some sort of ECC protection... > [some valid points about current dram quality and the temptation to not bother with the extra hardware deleted] >>Some SUNs have parity checking on the memory system, but what does >>the OS do when a parity error occurs, since correction is not >>possible ? > >Depends on the situation. A parity error in a code page is harmless -- >just bring in a fresh copy from disk. A parity error in data in an >ordinary user program can be dealt with by killing that program. You Spoken like a true sysadmin :-). >get into difficulties only when the error hits the kernel or some vital >system daemon. If errors are rare enough, parity is adequate. "Rare enough" is pretty relative - one has to consider the run time of one's programs. (John McCalpin was recently talking of runtimes on the order of months!) And since most cycles are spent running user programs (hopefully!) I think they deserve a little more consideration. But the workstation market is pretty cutthroat and cost/performance is critical - fault tolerance hardware tends to push that ratio in the wrong direction so there's some initiative to leave it out. When comparing the current workstations with previous systems, one has to consider that those systems consisted of many more parts, with a lot more interconnections - a significant cause of failure (especially unsoldered ones); today's increased densities have improved this. And such systems were more often used in enterprise situations, such as maintaining critical company records, rather than for single users. Certain segments of the market certainly do require more fault tolerance than one finds in unix/workstation systems, and if such systems want to penetrate those segments, they are going to have to learn a few lessons from the mainframe hardware and software world. (Gee, I can almost hear some people who think unix on a workstation is the be-all and end-all in computers systems gagging.) And of course is doesn't come for free (I've heard that the fault tolerance aspects of the 3081/3090 was as big a project as the rest of the system!). The RS/6000 has been mentioned: ECC on memory, with an extra bit which is used as a last resort to replace a hard failure that can't be scrubbed. This is what one would expect from a company such as IBM - fault tolerance is a way of life for all mainframe/mini manufacturers. And I bet the associated software is the larger part of the work - I wouldn't be overly surprised if it wasn't all supported yet. But all in all, the overall error rate for workstations relative to what the runtime of most applications that people are running must be satisfactory; it doesn't seem to be a big issue. I know that's true in my environment (uP design) - a few problems now and then, but not enough to push me over the edge and demand better hardware. --------------------------------- Dave Christie My opinions only. All purpose comp.arch disclaimer: It depends.