Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!uunet!auspex!guy From: guy@auspex.auspex.com (Guy Harris) Newsgroups: comp.arch Subject: Re: I crashed our MIPS machine today Message-ID: <3899@auspex.auspex.com> Date: 15 Aug 90 19:54:04 GMT References: <1990Aug10.155458.2237@hod.uit.no> Organization: Auspex Systems, Santa Clara Lines: 64 >I thought this should be impossible! Yeah, it *should* be impossible to crash the OS from non-privileged user-mode code, but sometimes there are bugs in the OS. >Would anyone out there like to comment the strange behaviour I have >observed on our machines? From a quick look at a crash dump on an SS1 running 4.0.3c, my suspicion is that the kernel code for handing the floating point unit isn't being careful enough in looking at the floating point state; it appears to be handing a bad pointer to another routine that calls the procedure to which that pointer is supposed to point, only it points into the nether reaches of Hell instead. In other words, it doesn't appear to bear out the conclusions the original poster of the program, in "comp.os.vms", drew: OK. Here is a quick summary of the HOW TO CRASH A RISC machine from a USER-MODE program test. Reports have arrived that all of these machines can be crashed using CRASHME.C: IBM RT, MIPS, DECSTATION 5000, SPARC. On the two CISC architectures tried, VAX/VMS and SUN-3, the program either completed or exited with a core or register dump, as expected. Some background/motivation. My experience with microcode programming taught me that some sequences of MICROINSTRUCTIONS could wedge or jam the hardware in such a way that recovery was impossible without a reboot of some kind. The RISC architectures have some of the same properties of MICROCODE in that certain instruction sequences have UNDEFINED behavior. Now one of the great costs in a CISC machine is usually the trouble the designers go through to make sure that every instruction returns the MACHINE to a KNOWN STATE. That way the behavior of every instruction can be well defined, tested, and documented, individually verified and tested, and by simple induction be valid for arbitrary SEQUENCES of instructions. (In general). Engineers of RISC machines don't bother to do this, which is one of the reasons they are CHEAPER (the hardware, not the engineers). The problem of proving that an arbitary sequence of instructions "N" long will not crash the machine is much more costly if N > 1. (To say the least, if you know anything about mathematical logic). If there are M instructions (and M is probably around 1 BILLION) then there may be about M^N cases to check. And what is N? For a classic CISC machine a price is paid to make N = 1, or at least small. But for a RISC machine, might N be 10 or more? Anyway, no need to make too big a deal about this. Probably all the vendors can fix things in software alone, and certainly CISC chips with bugs in them have been shipped in the past too. Just a reminder though. There is no free lunch. There really is a trade-off between ROBUSTNESS-PRICE/PERFORMANCE-TIME_TO_MARKET. The *only* way in which you *might* be able to agree with this as being the source of the problem - at least in the SPARC case, and maybe in the MIPS case as well - would be to claim that the floating-point support software was part of the implementation of the architecture, and that the checks he alleges are made for CISC but not RISC machines weren't made in the software part of the architecture. It certainly doesn't seem to be the case that the *processor* gets stuck in some state "in such a way that recovery was impossible without a reboot of some kind."