Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!uwm.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew From: aglew@oberon.csg.uiuc.edu (Andy Glew) Newsgroups: comp.arch Subject: Re: Reliability Message-ID: Date: 17 Jan 90 20:22:56 GMT References: <34030@mips.mips.COM> <4322@nttmhs.ntt.JP> <39807@ames.arc.nasa.gov> <3101@umn-d-ub.D.UMN.EDU> <28674@amdcad.AMD.COM> <7566@pt.cs.cmu.edu> <34469@mips.mips.COM> <7608@pt.cs.cmu.edu> <15679@haddock.ima.isc.com> Sender: news@ux1.cso.uiuc.edu (News) Organization: University of Illinois, Computer Systems Group Lines: 48 In-Reply-To: news@haddock.ima.isc.com's message of 17 Jan 90 18:14:35 GMT >At least around here, the UNIX based workstations run around the >clock. If you want reliability, why not periodically run >dignostics? A 'cron' job could run at 3 AM and report any >errors. The CPU, FPU, RAM, DISK, and some other parts could be >checked reasonably well without consuming too much resources. >I'd prefer this to parity, where the only thing the machine does >is crash. I still say provide real error correction, or don't bother. Good idea... Except that most of the diagnostic programs written concurrent with hardware development (the diagnostics that may consume most of the hardware development budget) assume that they have exclusive control of the CPU. They can do things like turning cache on and off, deliberately writing bad data and then waiting for the trap, etc. This is why most of these diagnostic programs only run when the system is booting, or otherwise not running UNIX. Some sorts of stress diagnostics can be run on a normal UNIX system. But, normal multiuser activity, such as mandatory interrupts every 1/60th of a second, can mask the very sort of errors that you are looking for. Note that many of these activities also require hardware privilige to do things like turning off the TLB. (Not just root). I have heard of kernels that have diagnostics integrated with them, but the kernel is large enough already. When we were placing Gould's Real Time UNIX on the Gould NPL, the diagnostics engineers started thinking about putting diagnostics up that would run under UNIX. Real-Time UNIX gave (priviliged) user processes the ability to acquire any sort of hardware privilige, and, in effect, take over the entire system. This was particularly attractive for multiple-CPU systems, where one CPU could be isolated, diagnostics run, and then released back to normal UNIX operations. Even on a single CPU system a priviliged diagnostic process could take over the system, run diagnostics, and then return to UNIX. It would probably have to be more careful about starting UNIX up, though, than in the multiple CPU case. SUMMARY: Regularly running diagnostics benefits from real-time UNIX features and multiple CPUs. aglew@uiuc.edu -- Andy Glew, aglew@uiuc.edu