Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU Path: utzoo!decvax!ucbvax!XX.LCS.MIT.EDU!ARMS-D-Request From: ARMS-D-Request@XX.LCS.MIT.EDU (Moderator) Newsgroups: mod.politics.arms-d Subject: Arms-Discussion Digest V7 #23 Message-ID: <8609272208.AA10221@ucbvax.Berkeley.EDU> Date: Sat, 27-Sep-86 15:51:00 EDT Article-I.D.: ucbvax.8609272208.AA10221 Posted: Sat Sep 27 15:51:00 1986 Date-Received: Sat, 27-Sep-86 21:54:35 EDT Sender: daemon@ucbvax.BERKELEY.EDU Reply-To: ARMS-D@XX.LCS.MIT.EDU Organization: The ARPA Internet Lines: 262 Approved: arms-d@xx.lcs.mit.edu Arms-Discussion Digest Saturday, September 27, 1986 3:51PM Volume 7, Issue 23 Today's Topics: Reliability, complexity, and confidence in SDI software (from RISKS) SDI ---------------------------------------------------------------------- Date: Friday, 26 September 1986 17:22-EDT From: ESTELL ROBERT G Reply-To: ESTELL ROBERT G To: risks Re: Reliability, complexity, and confidence in SDI software I apologize in advance for the length of this piece. But it's briefer than the growing list of claims and counter-claims, made by resepctable folks, based on either/both sound theory or/and actual experience. And we're dealing with a critical question: Can very large systems be reliable? The "bathtub curve" for MECHANICAL "failures" has always made sense to me. I've heard lectures about how software follows similar curves. But I've really been stumped by the notion that "software wears out." I'd like to attempt to "bound the problem" so to speak. SUPPOSE that we had a system composed of ten modules; and suppose that each module had ten possible INTERNAL logical paths, albeit only one entry and only one exit. The MINIMUM number of logical paths through the system is ten (10); i.e., *IF* path #1 in module A INVARIABLY invokes path #1 in modules B, C, ... J; and likewise, path #2 in A INVARIABLY invokes path #2 in B, C, ... J; etc. then there are only ten paths. NOTE I'm also assuming that the modules invariably run in alpahbetical order, always start with A, and always finish with J; and never fail or otherwise get interrupted. [I'm trying to avoid nits.] Some residential wiring systems are so built; there are many switches and outlets on each circuit; but each circuit is an isolated loop to the main "fuze" box; "fuzes" for the kitchen are independent of the den. The MAXIMUM number of logical paths through the system is ten billion (10.E10); i.e., *IF* each module can take any one of its ten paths in response to any one of the ten paths from any one of the other ten modules, there are 10**10 possibilities. AGAIN assuming that the system always starts with A, runs in order, etc. *IF SEQUENCE IS SIGNIFICANT, and if the starting point is random, THEN there are ten!10.E10 paths; i.e., ten factorial times ten billion, or 36,288,000,000,000,000 possible paths in the system. Further, *IF INTERRUPTS* are allowed and are significant, then I can't compute the exact number of possible paths; but I can guarantee that it's >MORE> than 10!10.E10. End of bounds. The scope reaches from the trivial, to the impossible. The GOAL of good engineering practices [for hardware, software, and firmware] is to design and implement modules that control the possible paths; e.g., systems should *NOT* interact in every conceivable way. It does NOT follow that the interactions should be so restricted that there are only ten paths through a ten module system. BUT there is some reason to HOPE that systems may be so designed, in a tree structure such that: a. AT EACH LEVEL, exactly one module will be "in control" at any instant; b. and that each module will run independently of others at its level; c. and that there are a finite [and reasonably small] number of levels. In "Levels of Abstraction in Operating Systems", RIACS TR 84.5, Brown, Denning, and Tichy describe 15 levels, reaching from circuits to shell; applications sit at level 16. If one must have a layered application, then add layers 17, 18, et al. I will conjecture that at levels 1 and 2 [registers, and instruction set], there are only five possible states (each): (1) not running; (2) running - cannot be interrupted; (3) running - but at a possible interrupt point; (4) interrupted; and (5) error. I will further conjecture that the GOAL of writing modules at each of the other layers, from O/S kernel, through user application packages, can reasonably be to limit any one module to ten possible states. NOTE that purely "in line code" can perform numerous functions, without putting the module in more than a few states. [e.g., Running, Ready to run, Blocked, Intrerrupted, Critical region, or Error.] Such a system, comprised of say 15 applications layers, would assume maybe 290 possible states; that's the SUM of the number of possibilities at each layer, given the path that WAS ACTUALLY TAKEN to reach each layer. Yet the number of functions that such a system could perform is at least the sum of all the functions of all the modules in it. If you're willing to risk some interaction, then you can start playing with PRODUCTS [vice SUMS] of calling modules, called modules, etc. EVEN SO, if the calling module at layer "n" can assume half a dozen states, and the called module at layer "n+1" can assume a similar number, then the possible states of that pair are about 40; that's more than a dozen, but it's still managable. In real life, both humans and computers deal with enormously complex systems using similar schemes. For instance, two popular parlor games: chess, and contract bridge. Each admits millions of possible scenarios. But in each, the number of possible sensible *NEXT plays* is confined by the present state of affairs. So-called "look ahead" strategies grow very complex; but once a legal play has been made, there are again a small number of possible legal "next plays." In bridge, for instance, at least 635,013,559,600 possible hands can be dealt, to ONE player [combination of 52 things, 13 at a time]. That one hand does not uniquely determine the contents of the other three hands. Whether the hands interact is not a simple question in pure mathematics; in many cases, they do; but in one unique case, they don't; e.g., if dealer gets all 4 aces, and all 4 kings, all 4 queens, and any jack, then he bids 7 no trump; and it doesn't matter who else has what else; it's an unbeatable bid. [Non bridge players, accept both my word for it; and my apology for an obscure example.] We've been playing bridge a lot longer than we've been writing large, real- time software systems. I'll conjecture that we don't know nearly as much about "SDI class systems" as we do about the card game. But in either case, if we aren't careful, the sheer magnitude of the numbers can overwhelm us. BOTTOM LINEs: 1. The curve for debugging software has a DOWNslope and length that is some function of the number of possible paths through the code. 2. Good software engineering practice says that one checks the design before writing lots of code. ["Some" may be necessary, but not "lots."] *IF* errors show up in the design, fix them there. *IF* the DESIGN itself is flawed, then change it. [e.g., Rethink a design that allows modules to interact geometrically.] 3. Confidence builds as one approaches the 90% [or other arbitrary level] point in testing the number of possible paths. 4. The reason that we haven't built confidence in the past is that we've often run thousands of hours, without knowing either: a. how many paths got tested; or b. how many paths remained untested. 5. INTERACTIONS do occur - even ones that aren't supposed to. [Trivial example: My car's cooling and electrical systems are NOT supposed to interact; and they don't - until the heater hose springs a leak, and squirts coolant all over the distributor and sparkplugs.] In "The Arbitration Problem", RIACS TR 85.12, Dennning shows that computers are fundamentally NOT ABSOLUTELY predictable; it may be that an unstable state is triggered ONLY by timing idiosyncracies such as: At the same minor cycle of the clock, CPU #1 suffers a floating underflow in the midst of a vector multiplication, AND CPU #2 takes an I/O interrupt from a disk read error, while servicing a page fault. 6. Since interactions do occur, experiences that many have had with small programs in a well-confined environment do *NOT* necessarily "scale up" to apply to very large, real-time codes, that run on raw hardware in a hostile [or just "random"] environment. NOTE that I'm claiming that in such a system, the O/S kernel is part of the real-time system. 7. The "problem space" we've been discussing is at least triangular. In one corner, there are assembly language monoliths, running on second- generation computers, without hardware protection; such systems convince Parnas that "SDI won't ever work." Written that way, it won't. [Important aside: It's one thing to argue that *if* SDI were built using modern software techniques, it would work. It's another thing to realize that in DOD, some (not all) tactical systems run on ancient computers that cost more to maintain than they would to replace; and offer less power than a PC AT. Such facts, known to Parnas, understandably color his thinking.] In another corner, there are small [1000 or so lines] modules, running in a controlled environment, that and have been "proven" to work. Most of us doubt that such experience scales up to SDI sizes. In another corner, there are 100,000 line systems that work, in real life, but without formal proofs. Probably built using good S/W Eng practices. 8. The KISS principle ["Keep It Simple, Stupid"] eliminates lots of problems. Prof. Richard Sites, at UCSD in 1978, told of a talk given by Seymour Cray. In answer to audience questions about "how to make the circuits run at those speeds", Cray explained that circuit paths were all of known, fixed lengths; and that all paths were terminated cleanly at both ends; and other "good EE practices" taught to undergrads. Less successful builders were so wrapped up in megaFLOPS that they got careless. We could do well to adopt Cray's philosophy for hardware as we build our software; e.g., build "RISC" programs; write code that does only a few tasks, but does them very well, both quickly and reliably. Maybe that's one reason why UNIX systems are so portable, powerful, and popular? [Each module is simple; power comes from piping them together.] NOTE that I'm claiming that "RISC" computer architecture is not new; look at almost every machine that Cray has designed; instruction sets are limited, and their implementation is superb. Bob For the record, I'm speaking "off the record" and expressing personal opinion. ------------------------------ Date: 27 Sep 1986 1013-PDT From: Rem@IMSSS Subject: SDI PRK> Date: 11 Sep 86 04:06:42 GMT PRK> From: karn@petrus.arpa (Phil R. Karn) PRK> Subject: Re: SDI delta launch PRK> You may be correct. However, as with the Homing Overlay shot several PRK> years ago, the only conceivable purpose of this "experiment" was the PRK> political promotion of SDI to a technically naive electorate. As PRK> evidence for this assertion, I offer the following: Before the launch, PRK> the SDI people stated that while the launch preparations and mission PRK> were secret at that time, the results would be made public IF AND ONLY PRK> IF THE TEST SUCCEEDED. This could backfire on them if people are intelligent. If only successful tests are reported, then the number of unsuccessful unreported tests is unbounded. We could say that 99% of tests were failure and nobody could refute us using public information. Perhaps if we present this argument to the public, they will change their report-only-success policy? PRK> It is precisely because I am so pro-space that I get very angry when I PRK> see people perverting space technology in a program that a) cannot PRK> deliver on its promises, b) exploits the universal fear of nuclear war PRK> for political and pecuniary purposes and c) could well cause the very PRK> nuclear war that it was supposed to prevent. Assuming that the worst PRK> doesn't happen, there will be an inevitable, enormous and PRK> indiscriminate backlash against science and technology in general and PRK> space technology in particular when the general public eventually PRK> realizes how much money they've wasted on this fraud. This indeed seems possible. We need to emphasize our interest in peaceful use of space and our prediction that SDI will be worthless to the general public as much as possible so that when SDI indeed fails we can say "I told you so, you should've put your money in peaceful use of space instead of SDI" and avoid some of the backlash. (But then contacting the general public so they'll remember what we said instead of what they think we said is easier said then done.) PRK> If you think it's rough to get money for NASA or NSF now, just wait until PRK> the post-Star Wars backlash. It won't matter that a majority of the Nobel PRK> Laureates and NAS members back in the 1980's said that SDI was a bad idea. If we keep up the flood of anti-SDI pro-peaceful-space opinions without let-up, it'll be fresh in their minds when SDI fails, perhaps. PRK> In other words, I oppose SDI for the same reasons that Better Business PRK> Bureaus (which are supported by local businessmen) oppose shady PRK> businesses: enlightened professional self-interest. I like that analogy. (:- Maybe start the Better Space Bureau? :-) ------------------------------ End of Arms-Discussion Digest *****************************