Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ll-xn!cit-vax!amdahl!amdcad!rpw3 From: rpw3@amdcad.UUCP (Rob Warnock) Newsgroups: comp.arch Subject: Re: Speed is the one true performance metric Message-ID: <13776@amdcad.UUCP> Date: Sun, 16-Nov-86 09:58:08 EST Article-I.D.: amdcad.13776 Posted: Sun Nov 16 09:58:08 1986 Date-Received: Sun, 16-Nov-86 20:08:26 EST References: <340@euroies.UUCP> <1989@videovax.UUCP> <798@spar.SPAR.SLB.COM> <3576@utcsri.UUCP> Organization: [Consultant] San Mateo, CA Lines: 99 Summary: Broken computers DO give wrong answers (sometimes). In article <3576@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes: > This is silly. Broken computers don't give wrong answers. They crash, > or they log soft errors, or they act flaky. It is almost impossible to > imagine a hardware fault that would have no visible effect other than > to make the 'value' (whatever it may be) of the output wrong. Hard to imagine? Maybe, but I've run into it, more than once. (Not a LOT, you understand, but when you've been around a long time...) Besides, don't you consider "wrong answers" to be "acting flakey"? Anyway, try these on for size (the first two occurred on a PDP-10 I was involved in administering circa 1970-1972): 1. A marginal core memory power supply (but the same thing could happen with RAMs) which caused bad data to be read ONLY when certain data sets were being processed. (Only two programs could cause the failure, and then only with certain inputs. One program was a cross-assembler which failed only when assembling certain versions [!] of a particular program; the other was an NMR simulation program, again, with certain inputs.) In each case, the problem occurred only after the program had consumed at least 5 minutes of CPU time. The failure affected NO other programs (that we could tell), including the operating system, and memory diagnostics did NOT find the problem! (The diagnostic patterns were worst-case for the *memories*; the bug was a pattern worst-case for the *power supplies*...) 2. A FORTRAN program which "occasionally" got "slightly different" answers when run with the same input data. Seems there was a leaky a transistor driving a reset line to a register which held the number of the general register the floating-point result should go into. Certain programs would sometimes generate enough noise to cause this (already marginal) line to twitch, dumping the results into register 0 instead of the correct one. The program in question did a LOT of very involved matrix calculations (that NMR stuff, again), and the odds of the error making a big change in the answers was slim. (Caused a major panic when discovered... all the programs used in calculating published research results had to be re-run, to see whether retractions or corrections were needed.) 3. The infamous ARPAnet "black-hole", wherein an IMP had a memory failure whose only effect was to make the routing table entries return zero for a large number of hosts (it just happened that the bad memory was where the table lived). "Zero" meant "I'm directly connected", so when it told its neighbors this (during the normal exchange or routing info), they cheerfully sent all their packets to the confused IMP, who sent them back out... to IMPs who sent them back in... [I hope I got the story right] Yes, parity-protected memory would have prevented this one, but that's not always the case. Memories can fail into the all-ones, condition, too, and simple parity is not enough. 4. A memory card address-decoder that was shorted, causing two banks of memory to be read at the same time (each got *written* with the correct data). Due to the fact that they collided at a TTL bus, as long as one bank had not been addressed "recently" (within a few microseconds), the correct bank won the "bus fight" (since the "older" bank's internal logic levels drifted up to TTL "high", and since a TTL "low [usually] wins a fight with a "high"). The normal memory diagnostics worked just fine, as did the simple address test. But when a certain user program was run, it made frequent references to both banks "quickly", causing bad data to be read. (Still, NOT necessarily causing parity errors! ...though they did occasionally occur.) > Of course, floating point hardware is a little different, since it > is used only for numerical calculations which are part of the problem > ( as opposed to the CPU alu which is also used for indexing, etc.) > You can always arrange to run an FPU diagnostic every 5 mins if this > is an issue. In case #2 above, it didn't help. The floating-point diagnostics didn't find the problem. The fault wasn't, in fact, in the floating-point hardware per se, but in the very same CPU ALU used for indexing, etc. It was just that there were very very few operations other than F.P. which used that auxiliary "where should the result go?" register, and none (that we ever knew of) other than the program in question which generated the right pattern of noise to clear it WHEN IT WAS BEING USED. Incidentally, problems #1 & #2 (occurred about a year apart) were eventually solved when yours truly finally ignored the diagnostics (which was the only thing the DEC serviceman had been trained to use), and got out an oscilloscope and started probing around looking for something "not quite right". Both errant signals showed up quite clearly as being "not right" on a 'scope, though the systems passed all the diagnostics. MORAL: "Testing can show the presence of bugs, but not their absence." [E. W. Dijkstra] CORRELARY: I bet the same thing happens soon (if it hasn't already) inside somebody's fancy new CPU chips... And this time they won't be able to just poke around with a scope, looking for "something". The only solution will be to tell the customer, "Well, don't run that program!" ;-} Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403