Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!sdsu!ucsd!ucsdhub!hp-sdd!apollo!vinoski From: vinoski@apollo.HP.COM (Stephen Vinoski) Newsgroups: comp.arch Subject: Re: Reliability Message-ID: <481c13a1.20b6d@apollo.HP.COM> Date: 18 Jan 90 14:56:00 GMT References: <34030@mips.mips.COM> <4322@nttmhs.ntt.JP> <39807@ames.arc.nasa.gov> <3101@umn-d-ub.D.UMN.EDU> <28674@amdcad.AMD.COM> <7566@pt.cs.cmu.edu> <34469@mips.mips.COM> <7608@pt.cs.cmu.edu> <15679@haddock.ima.isc.com> Sender: root@apollo.HP.COM Reply-To: vinoski@zep.UUCP (Stephen Vinoski) Organization: Hewlett-Packard Apollo Division, Chelmsford, MA Lines: 40 In article aglew@oberon.csg.uiuc.edu (Andy Glew) writes: >>At least around here, the UNIX based workstations run around the >>clock. If you want reliability, why not periodically run >>dignostics? A 'cron' job could run at 3 AM and report any >>errors. The CPU, FPU, RAM, DISK, and some other parts could be >>checked reasonably well without consuming too much resources. > >Good idea... > >Except that most of the diagnostic programs written concurrent with >hardware development (the diagnostics that may consume most of the >hardware development budget) assume that they have exclusive control >of the CPU. They can do things like turning cache on and off, >deliberately writing bad data and then waiting for the trap, etc. >This is why most of these diagnostic programs only run when the >system is booting, or otherwise not running UNIX. > >Some sorts of stress diagnostics can be run on a normal UNIX system. Stress diagnostics become very applicable when machines are single-user, such as in the (ideal) workstation world. The Testability and Diagnostics Department here at Apollo has a system called SAX (System Acceptance EXercisor) which does just that. It doesn't assume that it has exclusive control of the CPU, but it "beats up" the system so much when it is running that no other useful work can be done. It runs on top of the operating system and, to my knowledge, uses no special system calls. It can be configured so that it runs automatically in a chosen time slot; it then notifies the user of any problems via email. It is usually run overnight, and it can and regularly does catch problems well before they become critical. Due to the fact that it runs in a multitasking environment, it also catches problems that cannot be detected by most boot diagnostics and stand-alone diagnostics, such as bus arbitration problems and cache coherency troubles. -steve | Steve Vinoski | | Hewlett-Packard Apollo Division, Testability and Diagnostics Dept. | | Chelmsford, MA 01824 (508)256-6600 x5904 | | Internet: vinoski@apollo.com UUCP: {mit-eddie,yale,uw-beaver}!apollo!vinoski |