Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!ames!oliveb!pyramid!prls!mips!mash From: mash@mips.UUCP (John Mashey) Newsgroups: comp.arch Subject: Re: Was the 360 badly-designed? (was Re: Compatibility with EBCDIC) Message-ID: <634@winchester.UUCP> Date: Thu, 27-Aug-87 14:47:19 EDT Article-I.D.: winchest.634 Posted: Thu Aug 27 14:47:19 1987 Date-Received: Sat, 29-Aug-87 13:03:50 EDT References: <1486@cullvax.UUCP> <114@spook.UUCP> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 52 In article <114@spook.UUCP> hank@masscomp.UUCP (Hank Cohen) writes: >One thing that people seem to be missing in this discussion of the 370 >architecture is that the 370 POO specifies much more than the user >instruction set. > Another provision of the system architecture that is overlooked >in all micro systems that I have seen is error logging and diagnosis. >In this age of super fast micro processors and very large scale >integration none of the new RISC chips have thought it prudent to >provide architectural support for fault detection and diagnosis. I don't know what the other folks do. MIPS put a fair amount of effort into this, although it is important to note that the error logging / diagnostics approaches inherently differ at least somewhat between VLSI approaches and mainframe designs, i.e., you don't replace part of a chip! Here are things MIPS did: a) The CPU/FPU are designed to be easily diagnosable by ordinary code, i.e., hidden state was avoided, and you generally can exercise the paths quite well. This is a necessity for testing the chips in the first place. [Maybe one of our VLSI folks will comment in some detail.] b) There aren't "don't care" bits that can surprise you. c) The CPU-cache interface includes parity bits, 3 for Tag, Validity, and Page Frame, and 4 for the data. On a parity error, the CPU treats it as a cache miss, then does a refill, thus getting you over an occasional error. This is very important, in that the speed ofthe whole system depends on the CPU-cache interface speed, and one will always be pushing that. A bit is set that the OS can test whenever it feels like to detect that an SRAM is failing. d) The CPU contains status bits for isolating the caches, swapping the caches, and testing the parity-checking circuits. f) The write buffer gate arrays have a loop-back mode for testing them. g) External memory systems can be built with either parity or ECC. We use ECC, and the CPU was designed in such a way as to be able to do ECC-checking in parallel with access, without losing performance. h) There are a bunch of other minor things that are needed for handling other error conditions in reasonable ways. In general, the original point is well taken: higher-performance systems NEED to be designed with diagnosability in mind, or there will be serious problems sooner or later, especially in wanting these things to be big multi-user / servers. -- -john mashey DISCLAIMER: UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086