Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU
Path: utzoo!decvax!ucbvax!XX.LCS.MIT.EDU!ARMS-D-Request
From: ARMS-D-Request@XX.LCS.MIT.EDU (Moderator)
Newsgroups: mod.politics.arms-d
Subject: Arms-Discussion Digest V7 #23
Message-ID: <8609272208.AA10221@ucbvax.Berkeley.EDU>
Date: Sat, 27-Sep-86 15:51:00 EDT
Article-I.D.: ucbvax.8609272208.AA10221
Posted: Sat Sep 27 15:51:00 1986
Date-Received: Sat, 27-Sep-86 21:54:35 EDT
Sender: daemon@ucbvax.BERKELEY.EDU
Reply-To: ARMS-D@XX.LCS.MIT.EDU
Organization: The ARPA Internet
Lines: 262
Approved: arms-d@xx.lcs.mit.edu

Arms-Discussion Digest             Saturday, September 27, 1986 3:51PM
Volume 7, Issue 23

Today's Topics:

 Reliability, complexity, and confidence in SDI software (from RISKS)
                                 SDI

----------------------------------------------------------------------

Date: Friday, 26 September 1986  17:22-EDT
From: ESTELL ROBERT G <estell at nwc-143b.ARPA>
Reply-To: ESTELL ROBERT G <estell at nwc-143b.ARPA>
To:   risks <risks at csl.sri.com>
Re:   Reliability, complexity, and confidence in SDI software

I apologize in advance for the length of this piece.  But it's briefer
than the growing list of claims and counter-claims, made by resepctable
folks, based on either/both sound theory or/and actual experience.  
And we're dealing with a critical question: 
	Can very large systems be reliable?


The "bathtub curve" for MECHANICAL "failures" has always made sense to me.
I've heard lectures about how software follows similar curves.  
But I've really been stumped by the notion that "software wears out."

I'd like to attempt to "bound the problem" so to speak.
SUPPOSE that we had a system composed of ten modules; and suppose that
each module had ten possible INTERNAL logical paths, albeit only one 
entry and only one exit.

 The MINIMUM number of logical paths through the system  is ten (10); 
 i.e., *IF* path #1 in module A INVARIABLY invokes path #1 in modules 
 B, C, ... J; and likewise, path #2 in A INVARIABLY invokes path #2 
 in B, C, ... J; etc. then there are only ten paths.
 NOTE I'm also assuming that the modules invariably run in alpahbetical
 order, always start with A, and always finish with J; and never fail
 or otherwise get interrupted.  [I'm trying to avoid nits.]
 Some residential wiring systems are so built; there are many switches
 and outlets on each circuit; but each circuit is an isolated loop to the 
 main "fuze" box; "fuzes" for the kitchen are independent of the den.

 The MAXIMUM number of logical paths through the system is ten billion 
 (10.E10); i.e., *IF* each module can take any one of its ten paths in 
 response to any one of the ten paths from any one of the other ten modules, 
 there are 10**10 possibilities.
 AGAIN assuming that the system always starts with A, runs in  order, etc.
 *IF SEQUENCE IS SIGNIFICANT, and if the starting point is random, THEN
 there are ten!10.E10 paths; i.e., ten factorial times ten billion, or
 36,288,000,000,000,000 possible paths in the system.
 
 Further, *IF INTERRUPTS* are allowed and are significant, then I can't
 compute the exact number of possible paths; but I can guarantee that it's
 >MORE> than 10!10.E10.

End of bounds.  The scope reaches from the trivial, to the impossible.

The GOAL of good engineering practices [for hardware, software, and firmware]
is to design and implement modules that control the possible paths; e.g.,
systems should *NOT* interact in every conceivable way.
It does NOT follow that the interactions should be so restricted that 
there are only ten paths through a ten module system.
BUT there is some reason to HOPE that systems may be so designed, in a tree
structure such that:

 a. AT EACH LEVEL, exactly one module will be "in control" at any instant; 
 b. and that each module will run independently of others at its level; 
 c. and that there are a finite [and reasonably small] number of levels.

In "Levels of Abstraction in Operating Systems", RIACS TR 84.5, Brown,
Denning, and Tichy describe 15 levels, reaching from circuits to shell;
applications sit at level 16.  If one must have a layered application,
then add layers 17, 18, et al.

I will conjecture that at levels 1 and 2 [registers, and instruction set],
there are only five possible states (each):
 (1) not running; 
 (2) running - cannot be interrupted;
 (3) running - but at a possible interrupt point;
 (4) interrupted; and
 (5) error.

I will further conjecture that the GOAL of writing modules at each of the
other layers, from O/S kernel, through user application packages, can
reasonably be to limit any one module to ten possible states.  NOTE that
purely "in line code" can perform numerous functions, without putting the
module in more than a few states.  [e.g., Running, Ready to run, Blocked,
Intrerrupted, Critical region, or Error.]

Such a system, comprised of say 15 applications layers, would assume maybe
290 possible states; that's the SUM of the number of possibilities at each
layer, given the path that WAS ACTUALLY TAKEN to reach each layer.

Yet the number of functions that such a system could perform is at least
the sum of all the functions of all the modules in it.  If you're willing
to risk some interaction, then you can start playing with PRODUCTS [vice
SUMS] of calling modules, called modules, etc.  EVEN SO, if the calling
module at layer "n" can assume half a dozen states, and the called module
at layer "n+1" can assume a similar number, then the possible states of
that pair are about 40; that's more than a dozen, but it's still managable.

In real life, both humans and computers deal with enormously complex systems
using similar schemes.  For instance, two popular parlor games: chess, and
contract bridge.  Each admits millions of possible scenarios.  But in each,
the number of possible sensible *NEXT plays* is confined by the present 
state of affairs.  So-called "look ahead" strategies grow very complex; 
but once a legal play has been made, there are again a small number of 
possible legal "next plays."

In bridge, for instance, at least 635,013,559,600 possible hands can be dealt,
to ONE player [combination of 52 things, 13 at a time].  That one hand does
not uniquely determine the contents of the other three hands.
Whether the hands interact is not a simple question in pure mathematics;
in many cases, they do; but in one unique case, they don't; 
e.g., if dealer gets all 4 aces, and all 4 kings, all 4 queens, and any
jack, then he bids 7 no trump; and it doesn't matter who else has what
else; it's an unbeatable bid.  [Non bridge players, accept both my word 
for it; and my apology for an obscure example.]

We've been playing bridge a lot longer than we've been writing large, real-
time software systems.  I'll conjecture that we don't know nearly as much
about "SDI class systems" as we do about the card game.
But in either case, if we aren't careful, the sheer magnitude of the
numbers can overwhelm us.

BOTTOM LINEs:

1. The curve for debugging software has a DOWNslope and length that is 
some function of the number of possible paths through the code.

2. Good software engineering practice says that one checks the design
before writing lots of code.  ["Some" may be necessary, but not "lots."]
*IF* errors show up in the design, fix them there.
*IF* the DESIGN itself is flawed, then change it.  [e.g., Rethink a design
that allows modules to interact geometrically.]

3. Confidence builds as one approaches the 90% [or other arbitrary level]
point in testing the number of possible paths.

4. The reason that we haven't built confidence in the past is that we've
often run thousands of hours, without knowing either:

 a. how many paths got tested; or
 b. how many paths remained untested.

5. INTERACTIONS do occur - even ones that aren't supposed to.
[Trivial example: My car's cooling and electrical systems are NOT supposed
to interact; and they don't - until the heater hose springs a leak, and
squirts coolant all over the distributor and sparkplugs.]
In "The Arbitration Problem", RIACS TR 85.12, Dennning shows that 
computers are fundamentally NOT ABSOLUTELY predictable; it may be that
an unstable state is triggered ONLY by timing idiosyncracies such as:
 At the same minor cycle of the clock, CPU #1 suffers a floating 
 underflow in the midst of a vector multiplication, AND CPU #2 takes an 
 I/O interrupt from a disk read error, while servicing a page fault.

6. Since interactions do occur, experiences that many have had with
small programs in a well-confined environment do *NOT* necessarily
"scale up" to apply to very large, real-time codes, that run on raw
hardware in a hostile [or just "random"] environment.  NOTE that I'm
claiming that in such a system, the O/S kernel is part of the
real-time system.

7. The "problem space" we've been discussing is at least triangular.
In one corner, there are assembly language monoliths, running on
second- generation computers, without hardware protection; such
systems convince Parnas that "SDI won't ever work."  Written that way,
it won't.  [Important aside: It's one thing to argue that *if* SDI
were built using modern software techniques, it would work.  It's
another thing to realize that in DOD, some (not all) tactical systems
run on ancient computers that cost more to maintain than they would to
replace; and offer less power than a PC AT.  Such facts, known to
Parnas, understandably color his thinking.]

In another corner, there are small [1000 or so lines] modules, running
in a controlled environment, that and have been "proven" to work.
Most of us doubt that such experience scales up to SDI sizes.

In another corner, there are 100,000 line systems that work, in real life,
but without formal proofs.  Probably built using good S/W Eng practices.

8. The KISS principle ["Keep It Simple, Stupid"] eliminates lots of problems.
Prof. Richard Sites, at UCSD in 1978, told of a talk given by Seymour Cray.  
In answer to audience questions about "how to make the circuits run at those
speeds", Cray explained that circuit paths were all of known, fixed lengths; 
and that all paths were terminated cleanly at both ends; and other 
"good EE practices" taught to undergrads.  Less successful builders were 
so wrapped up in megaFLOPS that they got careless.

We could do well to adopt Cray's philosophy for hardware as we build
our software; e.g., build "RISC" programs; write code that does only a
few tasks, but does them very well, both quickly and reliably.  Maybe
that's one reason why UNIX systems are so portable, powerful, and
popular?  [Each module is simple; power comes from piping them
together.]  NOTE that I'm claiming that "RISC" computer architecture
is not new; look at almost every machine that Cray has designed;
instruction sets are limited, and their implementation is superb.

Bob
For the record, I'm speaking "off the record" and expressing personal opinion.

------------------------------

Date: 27 Sep 1986 1013-PDT
From: Rem@IMSSS
Subject: SDI

PRK> Date: 11 Sep 86 04:06:42 GMT
PRK> From: karn@petrus.arpa  (Phil R. Karn)
PRK> Subject: Re: SDI delta launch

PRK> You may be correct. However, as with the Homing Overlay shot several
PRK> years ago, the only conceivable purpose of this "experiment" was the
PRK> political promotion of SDI to a technically naive electorate. As
PRK> evidence for this assertion, I offer the following: Before the launch,
PRK> the SDI people stated that while the launch preparations and mission
PRK> were secret at that time, the results would be made public IF AND ONLY
PRK> IF THE TEST SUCCEEDED.

This could backfire on them if people are intelligent. If only
successful tests are reported, then the number of unsuccessful
unreported tests is unbounded. We could say that 99% of tests were
failure and nobody could refute us using public information. Perhaps
if we present this argument to the public, they will change their
report-only-success policy?

PRK> It is precisely because I am so pro-space that I get very angry when I
PRK> see people perverting space technology in a program that a) cannot
PRK> deliver on its promises, b) exploits the universal fear of nuclear war
PRK> for political and pecuniary purposes and c) could well cause the very
PRK> nuclear war that it was supposed to prevent.  Assuming that the worst
PRK> doesn't happen, there will be an inevitable, enormous and
PRK> indiscriminate backlash against science and technology in general and
PRK> space technology in particular when the general public eventually
PRK> realizes how much money they've wasted on this fraud.

This indeed seems possible. We need to emphasize our interest in
peaceful use of space and our prediction that SDI will be worthless to
the general public as much as possible so that when SDI indeed fails
we can say "I told you so, you should've put your money in peaceful
use of space instead of SDI" and avoid some of the backlash. (But then
contacting the general public so they'll remember what we said instead
of what they think we said is easier said then done.)

PRK> If you think it's rough to get money for NASA or NSF now, just wait until
PRK> the post-Star Wars backlash. It won't matter that a majority of the Nobel
PRK> Laureates and NAS members back in the 1980's said that SDI was a bad idea.

If we keep up the flood of anti-SDI pro-peaceful-space opinions
without let-up, it'll be fresh in their minds when SDI fails, perhaps.

PRK> In other words, I oppose SDI for the same reasons that Better Business
PRK> Bureaus (which are supported by local businessmen) oppose shady
PRK> businesses: enlightened professional self-interest.

I like that analogy. (:- Maybe start the Better Space Bureau? :-)

------------------------------

End of Arms-Discussion Digest
*****************************