Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site ames.UUCP Path: utzoo!watmath!clyde!burl!ulysses!gamma!epsilon!zeta!sabre!petrus!bellcore!decvax!genrad!panda!talcott!harvard!seismo!lll-crg!dual!ames!eugene From: eugene@ames.UUCP (Eugene Miya) Newsgroups: net.arch Subject: Re: Scientific Computing and mips Message-ID: <1119@ames.UUCP> Date: Sun, 1-Sep-85 20:38:47 EDT Article-I.D.: ames.1119 Posted: Sun Sep 1 20:38:47 1985 Date-Received: Wed, 4-Sep-85 05:21:08 EDT References: <419@kontron.UUCP> <2300001@uicsl> <1093@ames.UUCP> Organization: NASA-Ames Research Center, Mtn. View, CA Lines: 272 <29898@lanl.ARPA> <1517@peora.UUCP> <30105@lanl.ARPA> <1062@sdcsvax.UUCP> Thank you for your patience. I have mulled some of this over, and I have some incomplete thoughts if you bear with me. I have some preliminary research, some of it graphical on the Cray-1/S, X-MP, 2, Cyber 205, Convex C-1, and ELXSI, and if I could find a VAX or a 68000 with a good clock, I would do them, too. I think science proceeds of three phases: Detection [note: Boolean], Lots of qualitative information [hunchs] Identification [note: maybe enumerations,classes,sets,types] Detailed Analysis [more complex mathematics: stat, etc.] I would like to summarize some of the issues posted so far. I mentioned "atomicity" in the first posting. Atomic for two reasons: small and low-level and indivisibility. Scalability seems to be another [too many ilities] concern. A variation of this is termed "realistic" by some. [My comments indented from here.] > Message-ID: <29898@lanl.ARPA> > Another metric is a sampling of large, mostly unmodifiable, commercial > codes. My preference is MSC/Nastran (.5e6+lines of finite element code). > It runs on a surprising number of machines, and is a rigorous test of not > only performance (cpu and io) but the scientific/engineering environment > available to the average user. > george spix gas@lanl [Portability, generality] I am surprised no one objected. The problem with this is the behavior of large programs like this are not well understood. Sure you can take a time, you can measure core usage, and so on, but it doesn't generalize, and the code still has problems, bugs, desired new features, otherwise MSC would not still be in business [I'm stretching this one, I know.] This brings up the issue of analytic method: [mentioned below] separability of components. We have few measures of I/O and if we add two things to together, we get a sum. One problem with computer science is that we have poor empirical and laboratory skills and tools. [I am working on a series of TRs on experiment design in CS, but my management wants me to work on their stuff.] On one hand people aren't mathematical enough in their analysis, yet we tend to lack a lot of `controls' when taking measurements [too mathematical]. There seem to be gaps between those who do queueing models, those who insist on measuring the near atomic qualities of systems, and letters like the above insisting on 'realism.' Do these issues scale? >Message-ID: <1512@peora.UUCP> > >> Whetstones were mentioned in another letter, but the only people who use >> these are computer manufacturers. >This statement isn't true. For example, back when I was a graduate-student >researcher in computer architectures, we used the Whetstones to test our >vertical-migration software. I would Still place you in that category. Whetstones are rather arbitrary and I did some work to locate the original book (not a paper) on ALGOL. Our problems have a greater degree of heterogeneity as George points out above. >> What qualities do our performance metrics need to have? > >I think you need to make your performance measurements in such a way that >you get a set of distinct numbers which can be used analytically to determine >performance for a given program if you know certain properties of the >program. For example: > >1) The rate of execution of each member of the set of arithmetic operations >provided by the machine's instruction set, ... >..., with cache disabled. > >2) The rate of execution of 1-word memory-to-memory moves, with cache >disabled. > >3) The rate of execution of a tight loop ...register-to-register >moves, with cache disabled. > >4) The rate of execution of a tight loop ... , with cache enabled. > >5) The rate of execution of a tight loop performing (same word size as #3 >and #4 above) memory-to-memory moves that produce all cache "hits", with >cache enabled. Note that this gives you two properties of your cache: your >speedup for operand fetch and store resulting from caching, and any >performance penalties resulting from a write-through vs. write-back cache. > >6) Specifications such as the number of registers available to the user, >the size of the cache, etc. > >Well, you get the idea, anyway... personally I tend to feel that statistical >performance measurements are not nearly as useful as analytical ones; I >would rather see a list of fairly distinct performance properties of a pro- >cessor anytime, since I think you can do more with them in terms of >saying how the machine will perform for a given application that way. I agree with you here, but why do I say that? >I separated out the various forms of caching (operations in registers, and >use of a cache between the CPU and the primary memory) because so many >people "fudge" their results that way without giving any information from >which you can determine real performance. The above list is just meant to >suggest "qualities" rather than being an exhaustive list; i.e., that the >performance metrics should reveal (rather than hide) the set of factors >that actually influence performance. [Unfortunately, this would never suit >most marketing organizations nor customers, since they want an all- >encompassing number.] Sad but true. The all-encompassing number is a problem. I would like to do a bit in functional analysis: principally vector valued measures, but Alan Smith at Berkeley suggested staying with measures as simple as possible. [I agree in principle.] But I want to consider them [Aan suggest factor analysis at most]. >The metrics should also be compiler-independent. >Shyy-Anzr: J. Eric Roskos How do you convince of compiler independent? What about compiler dependent? What tests which determine compiler characteristics? Limitations? OS independent, too. A problem with uniprocessor architectures, computers, and operating systems is that all are constructed in such as way as to make measurement difficult. Taking a measurement affects the thing being observed. High-level measurement concepts are lacking instead we measure oscilloscope pulses and say, so many x's were transfered only because we know how many x's there were to begin with. > Message-ID: <1517@peora.UUCP> > > It helps draw the line between special purpose and general purpose > > environments (or, less tactfully, usable and unusable machines).. > > Would it be possible to discuss this here? . . . > what properties of general purpose > environments make them "unusable" for scientific/engineering computing? > -- > Shyy-Anzr: J. Eric Roskos How do you measure an OS and make it distinct from the language, the compiler, and the algorithm used? Where does fine line get drawn between a program and its translator unless you have an different compiler implementation to compare? >Message-ID: <3517@dartvax.UUCP> >> A problem is, certainly, how we measure things. >It might be interesting to define some fairly simple standard operations >and ask how long it takes to perform the operations. Typical standard Standards sends a bit of a chill up my back. It's a bit early to standardize. The following are shortened, but much like the above: >add -- takes two words (at least 32 bits) from memory, adds them together, >index -- picks up an array offset from memory, performs bounds checking >on the offset (we don't all write in C), >ptr_load -- (P->Record.Field) >array_loop -- load each element of an array into a register. > >These >simple operations would be a better measure than even simpler instructions >because each operation does something "useful". These operations can >also have advantages over high-level language benchmarks because they are >not dependent on the quality of a compiler. > >The qualities that I am aiming for here are primarily usefulness and >simplicity. > >-- Chuck >chuck@dartvax Dependence and independence seems to be a common theme. How dependent are most tests? Another problem is one of decomposition and parallelism. This will be especially important in future architctures. Are two operations performed sequentially equivalent to two operations performed in parallel. I think the answer is YES AND NO. We have a situation analogous to the Brooks Mythical Man-Month or you can have 9 women working 9 months (81 women-months) for 9 babies, but you can't get 9 women work 1 month for 1 baby. Another problem, more down to earth is the clock on a given system. Crays have a beautiful system clock. I cannot say the same for the Cyber. One of my problems is to just understand the behavior of different systems clocks. Needless to say 1/50-1/100th second don't cut it. Too much can happen during a tick. Repeating things for future division with a tick, leaves to much to compilers and OSes. I'd love to get my hands on a VAX with a 1 microsecond clock. > jww@SDCSVAX.ARPA > Another side issue that certain problems benchmark certain ways. > For example, in supporting a SIMSCRIPT II.5 discrete-event simulation, > we find that the best predictor of user performance is double-precision > ("single" on your Cray, george) floating point speed. There are a > lot of floating point comparisons on the event chain, plus the heavy > use of psuedo-random gamma functions, etc. requires F.P. multiplies and > divides. How many people really use gamma functions? [sorry, don't answer that] A local comment on this. One of our users gave a talk the other day. He placed a single statement of FORTRAN on the screen. The problem is a fluid dynamics problem and noteability this statement had 18 FP divisions on 3-D arrays [user wanted to point that out: the Cray 1/X division is relatively inefficient]. This says nothing of the +s and -s for the array indices of the 30-40 variables, the FP +s, -s, and *s, or the tremendous storage requirements. He user liked to point out that the CFT compiler, as much as we complain, reduced the 18 divides to 7. The Cray's real power is what it does on the indicies! > For compilation, however, integer performance -- particularly simple moves > and single-level indirect addressing -- is the best predictor of speed. ... > That's why machines with > strong integer scalar performance (e.g., Cray 1?) have it over those that > focus only on MFLOP's. What machine only focuses on MFLOPS? Have you run on it? Good architectures, as Brian Reid pointed out in net.micro.mac, are a good balance of tradeoffs. > Benchmarks typically are several hundred lines, with limited complexity > and usually small data cases. If you want to test typical throughput, > you need a typical program--even . . . 200,000 > lines of source. This also assures that if the system was "tuned", > it was probably a very limited sort of tuning that any owner of such > program would try anyway. Do we in principal really need the 200K line program? Why can't we come up with adequate smaller programs to give us an idea how the the 200K line program works? In other words why does the US need a Missouri [a show me state]? Can't we just take it for granted the sun is 93 million miles rather than remeasure it, some long as we know what a mile is? Our benchmarks tend to be too kind. We need benchmarks, I think which deliberately `break machines' along the line of these validation suites which check compiler limits, and so forth. On MWFs I tend to think that we can separate the OS, the compiler, the language from the machine. On TTS and I think is not possible. Today's Sunday, so I don't care. > > > It's my belief that this market requires > > "general purpose architectures" with "general purpose (usable) > > environments" > > george spix gas@lanl > > There have been no shortages of proposed architectures. There > haven't been as many "usable architectures," [true] > > A clever user will take "Program A" and put it up on machines X,Y,Z, > spending less than a week on each test. Which ever machine runs it > fastest, wins. From the user's standpoint, that's much better > than listening to MIP's, MFLOP's, or other mumbo-jumbo. > > Joel West CACI, Inc. - Federal (c/o UC San Diego) The tension here is the desire to make general portable, useable programs and to take advantage of machine performance features. I sometimes wonder if we will really have a Cray-on-a-desk and then it passes ;-). Few consider what a Cray is: word-oriented, big memory, vector registers, underdeveloped software [oops, sorry Bence and George]. Lastly, I wish to thank, LLNL, Cray, and Convex for time on some of their machines. I tried cutting this down more, I will try better next time. Sorry for rambo-ing, :-), I am still working on these ideas. Some of my existing prototype tests look at memory contention, vector instruction sets, compiler tricks and limitations. --eugene miya NASA Ames Research Center {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene emiya@ames-vmsb