Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!ames!oliveb!pyramid!prls!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: comp.arch
Subject: Re: Was the 360 badly-designed?  (was Re: Compatibility with EBCDIC)
Message-ID: <634@winchester.UUCP>
Date: Thu, 27-Aug-87 14:47:19 EDT
Article-I.D.: winchest.634
Posted: Thu Aug 27 14:47:19 1987
Date-Received: Sat, 29-Aug-87 13:03:50 EDT
References: <1486@cullvax.UUCP> <114@spook.UUCP>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 52

In article <114@spook.UUCP> hank@masscomp.UUCP (Hank Cohen) writes:
>One thing that people seem to be missing in this discussion of the 370
>architecture is that the 370 POO specifies much more than the user
>instruction set.
>	 Another provision of the system architecture that is  overlooked 
>in all micro systems that I have seen is error logging and diagnosis.
>In this age of super fast micro processors and very large scale
>integration none of the new RISC chips have thought it prudent to
>provide architectural support for fault detection and diagnosis.

I don't know what the other folks do.  MIPS put a fair amount of effort
into this, although it is important to note that the error logging /
diagnostics approaches inherently differ at least somewhat between
VLSI approaches and mainframe designs, i.e., you don't replace part
of a chip!  Here are things MIPS did:

a) The CPU/FPU are designed to be easily diagnosable by ordinary code,
i.e., hidden state was avoided, and you generally can exercise the
paths quite well.  This is a necessity for testing the chips in the first
place.  [Maybe one of our VLSI folks will comment in some detail.]

b) There aren't "don't care" bits that can surprise you.

c) The CPU-cache interface includes parity bits, 3 for Tag, Validity,
and Page Frame, and 4 for the data.  On a parity error,
the CPU treats it as a cache miss, then does a refill, thus getting you
over an occasional error.  This is very important, in that the speed
ofthe whole system depends on the CPU-cache interface speed, and one will
always be pushing that.  A bit is set that the OS can test whenever it
feels like to detect that an SRAM is failing.

d) The CPU contains status bits for isolating the caches, swapping the
caches, and testing the parity-checking circuits.

f) The write buffer gate arrays have a loop-back mode for testing them.

g) External memory systems can be built with either parity or ECC.
We use ECC, and the CPU was designed in such a way as to be able to do
ECC-checking in parallel with access, without losing performance.

h) There are a bunch of other minor things that are needed for
handling other error conditions in reasonable ways.

In general, the original point is well taken: higher-performance
systems NEED to be designed with diagnosability in mind, or there
will be serious problems sooner or later, especially in wanting these
things to be big multi-user / servers.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086