Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!lll-winken!sun-barr!newstop!sun!amdcad!mozart.amd.com!nucleus!davec
From: davec@nucleus.amd.com (Dave Christie)
Newsgroups: comp.arch
Subject: Re: Workstation Data Integrity
Message-ID: <1990Aug6.172146.10614@mozart.amd.com>
Date: 6 Aug 90 17:21:46 GMT
References: <1990Aug3.204358.330@portia.Stanford.EDU> <1990Aug4.231129.1358@zoo.toronto.edu>
Sender: usenet@mozart.amd.com (Usenet News)
Reply-To: davec@nucleus.amd.com (Dave Christie)
Organization: Advanced Micro Devices, Inc., Austin, Texas
Lines: 62

In article <1990Aug4.231129.1358@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes:
>>I'm a bit puzzled by the lack of any type of memory error detection/
>>correction on many workstations and high-end PCs. These workstations
>>are beginning to have memories that rival or exceed those of
>>the previous generation of minicomputers, which almost always used
>>some sort of ECC protection...
>

 [some valid points about current dram quality and the temptation to not
  bother with the extra hardware deleted]

>>Some SUNs have parity checking on the memory system, but what does
>>the OS do when a parity error occurs, since correction is not
>>possible ?
>
>Depends on the situation.  A parity error in a code page is harmless --
>just bring in a fresh copy from disk.  A parity error in data in an
>ordinary user program can be dealt with by killing that program.  You

Spoken like a true sysadmin :-).

>get into difficulties only when the error hits the kernel or some vital
>system daemon.  If errors are rare enough, parity is adequate.

"Rare enough" is pretty relative - one has to consider the run time of
one's programs.  (John McCalpin was recently talking of runtimes on the
order of months!)  And since most cycles are spent running user programs 
(hopefully!) I think they deserve a little more consideration.  But the 
workstation market is pretty cutthroat and cost/performance is critical - 
fault tolerance hardware tends to push that ratio in the wrong direction
so there's some initiative to leave it out.  When comparing the current
workstations with previous systems, one has to consider that those
systems consisted of many more parts, with a lot more interconnections -
a significant cause of failure (especially unsoldered ones); today's
increased densities have improved this.  And such systems were more often
used in enterprise situations, such as maintaining critical company
records, rather than for single users.

Certain segments of the market certainly do require more fault tolerance 
than one finds in unix/workstation systems, and if such systems want to
penetrate those segments, they are going to have to learn a few lessons
from the mainframe hardware and software world.  (Gee, I can almost hear
some people who think unix on a workstation is the be-all and end-all in
computers systems gagging.)  And of course is doesn't come for free (I've
heard that the fault tolerance aspects of the 3081/3090 was as big a
project as the rest of the system!).  The RS/6000 has been mentioned: ECC 
on memory, with an extra bit which is used as a last resort to replace a 
hard failure that can't be scrubbed.  This is what one would expect from a
company such as IBM - fault tolerance is a way of life for all mainframe/mini 
manufacturers.  And I bet the associated software is the larger part of 
the work - I wouldn't be overly surprised if it wasn't all supported yet.

But all in all, the overall error rate for workstations relative to what the
runtime of most applications that people are running must be satisfactory;
it doesn't seem to be a big issue.  I know that's true in my environment 
(uP design) - a few problems now and then, but not enough to push me over 
the edge and demand better hardware.

---------------------------------
Dave Christie             My opinions only.
All purpose comp.arch disclaimer: It depends.