Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!ll-xn!cit-vax!amdahl!amdcad!rpw3
From: rpw3@amdcad.UUCP (Rob Warnock)
Newsgroups: comp.arch
Subject: Re: Speed is the one true performance metric
Message-ID: <13776@amdcad.UUCP>
Date: Sun, 16-Nov-86 09:58:08 EST
Article-I.D.: amdcad.13776
Posted: Sun Nov 16 09:58:08 1986
Date-Received: Sun, 16-Nov-86 20:08:26 EST
References: <340@euroies.UUCP> <1989@videovax.UUCP> <798@spar.SPAR.SLB.COM> <3576@utcsri.UUCP>
Organization: [Consultant] San Mateo, CA
Lines: 99
Summary: Broken computers DO give wrong answers (sometimes).

In article <3576@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes:
> This is silly. Broken computers don't give wrong answers. They crash,
> or they log soft errors, or they act flaky. It is almost impossible to
> imagine a hardware fault that would have no visible effect other than
> to make the 'value' (whatever it may be) of the output wrong.

Hard to imagine? Maybe, but I've run into it, more than once. (Not a LOT,
you understand, but when you've been around a long time...) Besides, don't
you consider "wrong answers" to be "acting flakey"?

Anyway, try these on for size (the first two occurred on a PDP-10 I was
involved in administering circa 1970-1972):

1. A marginal core memory power supply (but the same thing could happen
   with RAMs) which caused bad data to be read ONLY when certain data sets
   were being processed. (Only two programs could cause the failure, and
   then only with certain inputs. One program was a cross-assembler which
   failed only when assembling certain versions [!] of a particular program;
   the other was an NMR simulation program, again, with certain inputs.)
   In each case, the problem occurred only after the program had consumed
   at least 5 minutes of CPU time.

   The failure affected NO other programs (that we could tell), including
   the operating system, and memory diagnostics did NOT find the problem!
   (The diagnostic patterns were worst-case for the *memories*; the bug was
   a pattern worst-case for the *power supplies*...)

2. A FORTRAN program which "occasionally" got "slightly different" answers
   when run with the same input data. Seems there was a leaky a transistor
   driving a reset line to a register which held the number of the general
   register the floating-point result should go into. Certain programs would
   sometimes generate enough noise to cause this (already marginal) line to
   twitch, dumping the results into register 0 instead of the correct one.
   The program in question did a LOT of very involved matrix calculations
   (that NMR stuff, again), and the odds of the error making a big change
   in the answers was slim. (Caused a major panic when discovered... all
   the programs used in calculating published research results had to be
   re-run, to see whether retractions or corrections were needed.)

3. The infamous ARPAnet "black-hole", wherein an IMP had a memory failure
   whose only effect was to make the routing table entries return zero for
   a large number of hosts (it just happened that the bad memory was where
   the table lived). "Zero" meant "I'm directly connected", so when it told
   its neighbors this (during the normal exchange or routing info), they
   cheerfully sent all their packets to the confused IMP, who sent them back
   out... to IMPs who sent them back in...  [I hope I got the story right]

Yes, parity-protected memory would have prevented this one, but that's
not always the case. Memories can fail into the all-ones, condition, too,
and simple parity is not enough.

4. A memory card address-decoder that was shorted, causing two banks of memory
   to be read at the same time (each got *written* with the correct data).
   Due to the fact that they collided at a TTL bus, as long as one bank had
   not been addressed "recently" (within a few microseconds), the correct bank
   won the "bus fight" (since the "older" bank's internal logic levels drifted
   up to TTL "high", and since a TTL "low [usually] wins a fight with a "high").
   The normal memory diagnostics worked just fine, as did the simple address
   test. But when a certain user program was run, it made frequent references to
   both banks "quickly", causing bad data to be read. (Still, NOT necessarily
   causing parity errors! ...though they did occasionally occur.)

> Of course, floating point hardware is a little different, since it
> is used only for numerical calculations which are part of the problem
> ( as opposed to the CPU alu which is also used for indexing, etc.)
> You can always arrange to run an FPU diagnostic every 5 mins if this
> is an issue.

In case #2 above, it didn't help. The floating-point diagnostics didn't
find the problem. The fault wasn't, in fact, in the floating-point
hardware per se, but in the very same CPU ALU used for indexing, etc.
It was just that there were very very few operations other than F.P.
which used that auxiliary "where should the result go?" register, and
none (that we ever knew of) other than the program in question which
generated the right pattern of noise to clear it WHEN IT WAS BEING USED.

Incidentally, problems #1 & #2 (occurred about a year apart) were eventually
solved when yours truly finally ignored the diagnostics (which was the only
thing the DEC serviceman had been trained to use), and got out an oscilloscope
and started probing around looking for something "not quite right". Both
errant signals showed up quite clearly as being "not right" on a 'scope,
though the systems passed all the diagnostics.

MORAL: "Testing can show the presence of bugs, but not their absence."
	[E. W. Dijkstra]

CORRELARY: I bet the same thing happens soon (if it hasn't already) inside
	   somebody's fancy new CPU chips...   And this time they won't
	   be able to just poke around with a scope, looking for "something".
	   The only solution will be to tell the customer, "Well, don't
	   run that program!"   ;-}


Rob Warnock
Systems Architecture Consultant

UUCP:	{amdcad,fortune,sun}!redwood!rpw3
DDD:	(415)572-2607
USPS:	627 26th Ave, San Mateo, CA  94403