Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!usc!apple!spies!zorch!xanthian From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) Newsgroups: comp.sys.amiga.tech Subject: Re: Parity Checking / ECC RAM on the A3000 Keywords: parity error detection and correction, marketability Message-ID: <1990May29.204550.27961@zorch.SF-Bay.ORG> Date: 29 May 90 20:45:50 GMT References: <756@bilver.UUCP> <1990May27.101258.24470@zorch.SF-Bay.ORG> <321@tlvx.UUCP> Organization: SF Bay Public-Access Unix Lines: 98 In article <321@tlvx.UUCP> sysop@tlvx.UUCP (SysOp) writes: >kent> [...] the size of gates in modern memory is much smaller and >kent> thus they are more susceptable to alpha radiation induced parity >kent> errors than were the gates of the Cray memory. > >Even "more susceptable"? Ok, why? How? If it were really that bad, then >people using the A3000 now should be at least occasionally noticing weird >things happening, right? But are they? What about my A1000 with 2.5 megs? >(Of course, I don't have parity, so how can I tell? Sigh.) Sorry; its amazing how susceptible I am to thinking that just because I have known something for a decade or more, that it therefore is common knowledge. Herewith the (stupidly) omitted explanation: Alpha radiation (a fast moving, stripped helium nucleus) originates within the naturally occuring radioactive impurities of the memory chip itself. By their nature (big, bumbling and slow compared to other kinds of radioactivity) alpha particles have very limited penetrating power; they do all their mischief near their point of origin. For our purposes, the thing of interest about their action is that, being an extremely positively charged particle among atoms essentially in neutral balance, they have a large effect on the outer shell (conduction) electrons, pulling large numbers after them in their wake. They cause a parity error when these entrained electrons are deposited in a spot that causes a gate to shift its state from 0 to 1 or vice versa, for instance on one of the control lines. The older memory chips had very little susceptability to alpha radiation induced parity errors. Although the alpha radiation exists constantly at a low level in every chip, "large number of electrons" above must be considered relative to the number of electrons required to switch a gate. The older memory chips, with larger "wiring" and component sizes, used larger switching currents; significantly larger than the amount of charge moved by one alpha particle. Since dynamic RAM means the memory is refreshed repeatedly by renewing the control charges holding the state (0 or 1) of each gate, there is not usually time for the charges carried by individual alpha particles to accumulate from several events to switch a gate, before the refresh cycle sets the charge back to its nominal value. In contrast, in denser, newer memory chips with smaller "wiring" and components, the charge delivered by a rogue alpha particle is of comparable size to the holding charge on a gate, and so the gate may be switched before a refresh cycle can correct the problem. Making the refresh cycles faster (than they are, not than the old circuits) is not an option, because most computer chips these days are heat limited, and more refreshes means more heat. So for an individual gate, denser memory means a larger chance of a bit being flipped _from_this_one_cause_. Still, chips are not highly radioactive, so for a single bit, this is a very low probability. The problem comes when you accumulate megabytes of these bits together; the chances of all of them avoiding errors tail off rapidly as their number increases, in math similar to that the birthday paradox employs. I'm a bit shakey on the numbers here, since I was last a hardware practicioner in 1972 and things have changed a trifle, but to the best of my understanding, with today's component sizes, speeds, and numbers of megabytes, you can expect to get in trouble somewhere between 1 and 100 megabytes. I defer to today's hardware practitioners for better data. As to why you don't see problems in your 3Meg AT, well, for one thing, as you mentioned, you don't have parity checking, so they could get by. Next, most of the software you run (or at least what I ran when using a 5 Meg '386 box) is unused by most applications, still stuck at the 640K limit. Again, at least in my Amiga, about 5 megabytes is loaded with software I may not use from boot to boot, but keep around because it is convenient. More, in running code lots of the code space is never touched (use a file zapper; lots of it is huge blocks of zeros). Again, stuff such as screen memory, if you get a bit flipped, you may never notice before you switch screens or windows in a screen and rewrite the soft parity error with good data. Similarly, would you really be likely to notice a one bit error in a sampled sound data block? Besides the above, your machine may sit idle 20 hours a day, not even powered up. In summary, there are lots of reasons why alpha induced parity errors would not be a big enough problem to become noticable. Yet. But like the birthday paradox, you don't have too far to go in terms of bigger applications exercising more of the machine, full time unattended operation (e.g. raytracing, doing accounts), more memory, more critical applications, and so on, before you run into Seymore Cray's problem. Parity checking is a necessity in large machines, just to be able to rely on the results the machine gives you. Error correcting circuitry is a necessity in large machines, to get the kind of uptime and through- put the machine's raw speed and memory size seem to promise. That's probably more than you wanted to know, and please excuse any details that might not be "just so". Since I stopped doing this stuff for a living, I'm a fairly casual student of the art. More is available in IEEE pubs, Scientific American, and so on. Kent, the man from xanth. (xanthian@zorch.sf-bay.org)