Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!uunet!auspex!guy
From: guy@auspex.auspex.com (Guy Harris)
Newsgroups: comp.arch
Subject: Re: I crashed our MIPS machine today
Message-ID: <3899@auspex.auspex.com>
Date: 15 Aug 90 19:54:04 GMT
References: <1990Aug10.155458.2237@hod.uit.no>
Organization: Auspex Systems, Santa Clara
Lines: 64

>I thought this should be impossible!

Yeah, it *should* be impossible to crash the OS from non-privileged
user-mode code, but sometimes there are bugs in the OS.

>Would anyone out there like to comment the strange behaviour I have
>observed on our machines?

From a quick look at a crash dump on an SS1 running 4.0.3c, my suspicion
is that the kernel code for handing the floating point unit isn't being
careful enough in looking at the floating point state; it appears to be
handing a bad pointer to another routine that calls the procedure to
which that pointer is supposed to point, only it points into the nether
reaches of Hell instead.

In other words, it doesn't appear to bear out the conclusions the
original poster of the program, in "comp.os.vms", drew:

  OK. Here is a quick summary of the HOW TO CRASH A RISC machine from
  a USER-MODE program test. Reports have arrived that all of these machines
  can be crashed using CRASHME.C:
  IBM RT, MIPS, DECSTATION 5000, SPARC.
 
  On the two CISC architectures tried, VAX/VMS and SUN-3, the program
  either completed or exited with a core or register dump, as expected.
 
  Some background/motivation. My experience with microcode programming
  taught me that some sequences of MICROINSTRUCTIONS could wedge or jam
  the hardware in such a way that recovery was impossible without
  a reboot of some kind. The RISC architectures have some of the same
  properties of MICROCODE in that certain instruction sequences have
  UNDEFINED behavior. Now one of the great costs in a CISC machine is
  usually the trouble the designers go through to make sure that
  every instruction returns the MACHINE to a KNOWN STATE. That way
  the behavior of every instruction can be well defined, tested, and
  documented, individually verified and tested, and by simple induction
  be valid for arbitrary SEQUENCES of instructions. (In general).
 
  Engineers of RISC machines don't bother to do this, which is one of
  the reasons they are CHEAPER (the hardware, not the engineers).
 
  The problem of proving that an arbitary sequence of instructions "N"
  long will not crash the machine is much more costly if N > 1.
  (To say the least, if you know anything about mathematical logic).
  If there are M instructions (and M is probably around 1 BILLION)
  then there may be about M^N cases to check. And what is N? 
  For a classic CISC machine a price is paid to make N = 1, or
  at least small. But for a RISC machine, might N be 10 or more?
 
  Anyway, no need to make too big a deal about this. Probably all the
  vendors can fix things in software alone, and certainly CISC chips
  with bugs in them have been shipped in the past too.
 
  Just a reminder though. There is no free lunch. There really is
  a trade-off between ROBUSTNESS-PRICE/PERFORMANCE-TIME_TO_MARKET.

The *only* way in which you *might* be able to agree with this as being
the source of the problem - at least in the SPARC case, and maybe in the
MIPS case as well - would be to claim that the floating-point support
software was part of the implementation of the architecture, and that
the checks he alleges are made for CISC but not RISC machines weren't
made in the software part of the architecture.  It certainly doesn't
seem to be the case that the *processor* gets stuck in some state "in
such a way that recovery was impossible without a reboot of some kind."