Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ames.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!gamma!epsilon!zeta!sabre!petrus!bellcore!decvax!genrad!panda!talcott!harvard!seismo!lll-crg!dual!ames!eugene
From: eugene@ames.UUCP (Eugene Miya)
Newsgroups: net.arch
Subject: Re: Scientific Computing and mips <sorry, obscenely long>
Message-ID: <1119@ames.UUCP>
Date: Sun, 1-Sep-85 20:38:47 EDT
Article-I.D.: ames.1119
Posted: Sun Sep  1 20:38:47 1985
Date-Received: Wed, 4-Sep-85 05:21:08 EDT
References: <419@kontron.UUCP> <2300001@uicsl> <1093@ames.UUCP>
Organization: NASA-Ames Research Center, Mtn. View, CA
Lines: 272

<29898@lanl.ARPA> <1517@peora.UUCP> <30105@lanl.ARPA> <1062@sdcsvax.UUCP>

Thank you for your patience.  I have mulled some of this over, and
I have some incomplete thoughts if you bear with me.  I have some
preliminary research, some of it graphical on the Cray-1/S, X-MP, 2, Cyber
205, Convex C-1, and ELXSI, and if I could find a VAX or a 68000 with
a good clock, I would do them, too.

I think science proceeds of three phases:
	Detection [note: Boolean], Lots of qualitative information [hunchs]
	Identification [note: maybe enumerations,classes,sets,types]
	Detailed Analysis [more complex mathematics: stat, etc.]
I would like to summarize some of the issues posted so far.  I mentioned
"atomicity" in the first posting.  Atomic for two reasons: small and
low-level and indivisibility. Scalability seems to be another [too many
ilities] concern.  A variation of this is termed "realistic" by some.
[My comments indented from here.]

> Message-ID: <29898@lanl.ARPA>
> Another metric is a sampling of large, mostly unmodifiable, commercial
> codes.  My preference is MSC/Nastran (.5e6+lines of finite element code).
> It runs on a surprising number of machines, and is a rigorous test of not
> only performance (cpu and io) but the scientific/engineering environment
> available to the average user.
> george spix    gas@lanl
	[Portability, generality]
	I am surprised no one objected.  The problem with this is the behavior
	of large programs like this are not well understood.  Sure you can
	take a time, you can measure core usage, and so on, but it
	doesn't generalize, and the code still has problems, bugs, desired
	new features, otherwise MSC would not still be in business
	[I'm stretching this one, I know.]

	This brings up the issue of analytic method: [mentioned below]
		separability of components.
	We have few measures of I/O and if we add two things to
	together, we get a sum.  One problem with computer science
	is that we have poor empirical and laboratory skills and tools.
	[I am working on a series of TRs on experiment design in CS,
	but my management wants me to work on their stuff.]
	On one hand people aren't mathematical enough in their analysis,
	yet we tend to lack a lot of `controls' when taking measurements
	[too mathematical].

	There seem to be gaps between those who do
	queueing models, those who insist on measuring the near
	atomic qualities of systems, and letters like the above
	insisting on 'realism.'  Do these issues scale?

>Message-ID: <1512@peora.UUCP>
>
>> Whetstones were mentioned in another letter, but the only people who use
>> these are computer manufacturers.
>This statement isn't true.  For example, back when I was a graduate-student
>researcher in computer architectures, we used the Whetstones to test our
>vertical-migration software.

	I would Still place you in that category.  Whetstones are
	rather arbitrary and I did some work to locate the original
	book (not a paper) on ALGOL.  Our problems have a greater
	degree of heterogeneity as George points out above.

>> What qualities do our performance metrics need to have?
>
>I think you need to make your performance measurements in such a way that
>you get a set of distinct numbers which can be used analytically to determine
>performance for a given program if you know certain properties of the
>program.  For example:
>
>1) The rate of execution of each member of the set of arithmetic operations
>provided by the machine's instruction set, ...
>..., with cache disabled.
>
>2) The rate of execution of 1-word memory-to-memory moves, with cache
>disabled.
>
>3) The rate of execution of a tight loop ...register-to-register
>moves, with cache disabled.
>
>4) The rate of execution of a tight loop ... , with cache enabled.
>
>5) The rate of execution of a tight loop performing (same word size as #3
>and #4 above) memory-to-memory moves that produce all cache "hits", with
>cache enabled.  Note that this gives you two properties of your cache: your
>speedup for operand fetch and store resulting from caching, and any
>performance penalties resulting from a write-through vs. write-back cache.
>
>6) Specifications such as the number of registers available to the user,
>the size of the cache, etc.
>
>Well, you get the idea, anyway... personally I tend to feel that statistical
>performance measurements are not nearly as useful as analytical ones; I
>would rather see a list of fairly distinct performance properties of a pro-
>cessor anytime, since I think you can do more with them in terms of
>saying how the machine will perform for a given application that way.

	I agree with you here, but why do I say that?

>I separated out the various forms of caching (operations in registers, and
>use of a cache between the CPU and the primary memory) because so many
>people "fudge" their results that way without giving any information from
>which you can determine real performance.  The above list is just meant to
>suggest "qualities" rather than being an exhaustive list; i.e., that the
>performance metrics should reveal (rather than hide) the set of factors
>that actually influence performance. [Unfortunately, this would never suit
>most marketing organizations nor customers, since they want an all-
>encompassing number.]
	Sad but true.  The all-encompassing number is a problem.
	I would like to do a bit in functional analysis: principally
	vector valued measures, but Alan Smith at Berkeley suggested
	staying with measures as simple as possible. [I agree in principle.]
	But I want to consider them [Aan suggest factor analysis at most].
>The metrics should also be compiler-independent.
>Shyy-Anzr:  J. Eric Roskos

	How do you convince of compiler independent?
	What about compiler dependent?  What tests which determine compiler
	characteristics? Limitations?
	OS independent, too.
	A problem with uniprocessor architectures, computers, and operating
	systems is that all are constructed in such as way as to make
	measurement difficult.  Taking a measurement affects the
	thing being observed.  High-level measurement concepts are lacking
	instead we measure oscilloscope pulses and say, so many x's were
	transfered only because we know how many x's there were to begin
	with.

> Message-ID: <1517@peora.UUCP>
> > It helps draw the line between special purpose and general purpose
> > environments (or, less tactfully, usable and unusable machines)..
> 
> Would it be possible to discuss this here? . . .
> what properties of general purpose
> environments make them "unusable" for scientific/engineering computing?
> -- 
> Shyy-Anzr:  J. Eric Roskos

	How do you measure an OS and make it distinct from the language,
	the compiler, and the algorithm used?  Where does fine line
	get drawn between a program and its translator unless you
	have an different compiler implementation to compare?

>Message-ID: <3517@dartvax.UUCP>
>> A problem is, certainly, how we measure things.
>It might be interesting to define some fairly simple standard operations
>and ask how long it takes to perform the operations.  Typical standard

	Standards sends a bit of a chill up my back.  It's a bit
	early to standardize.
	The following are shortened, but much like the above:
>add -- takes two words (at least 32 bits) from memory, adds them together,
>index -- picks up an array offset from memory, performs bounds checking
>on the offset (we don't all write in C),
>ptr_load --  (P->Record.Field)
>array_loop -- load each element of an array into a register.
>
>These
>simple operations would be a better measure than even simpler instructions
>because each operation does something "useful".  These operations can
>also have advantages over high-level language benchmarks because they are
>not dependent on the quality of a compiler.
>
>The qualities that I am aiming for here are primarily usefulness and 
>simplicity.
>
>-- Chuck
>chuck@dartvax

	Dependence and independence seems to be a common theme.
	How dependent are most tests?

	Another problem is one of decomposition and parallelism.
	This will be especially important in future architctures.
	Are two operations performed sequentially equivalent to two
	operations performed in parallel.  I think the answer is YES AND NO.
	We have a situation analogous to the Brooks Mythical Man-Month or
	you can have 9 women working 9 months (81 women-months) for 9 babies,
	but you can't get 9 women work 1 month for 1 baby.

	Another problem, more down to earth is the clock on a given system.
	Crays have a beautiful system clock.  I cannot say the same for
	the Cyber.  One of my problems is to just understand the behavior
	of different systems clocks.  Needless to say 1/50-1/100th
	second don't cut it.  Too much can happen during a tick.
	Repeating things for future division with a tick, leaves to
	much to compilers and OSes.  I'd love to get my hands on a VAX
	with a 1 microsecond clock.  

> 	jww@SDCSVAX.ARPA
> Another side issue that certain problems benchmark certain ways.
> For example, in supporting a SIMSCRIPT II.5 discrete-event simulation,
> we find that the best predictor of user performance is double-precision
> ("single" on your Cray, george) floating point speed.  There are a
> lot of floating point comparisons on the event chain, plus the heavy
> use of psuedo-random gamma functions, etc. requires F.P. multiplies and
> divides.
	How many people really use gamma functions?  [sorry, don't answer that]
	A local comment on this.  One of our users gave a talk the other
	day.  He placed a single statement of FORTRAN on the screen.
	The problem is a fluid dynamics problem and noteability this
	statement had 18 FP divisions on 3-D arrays [user wanted to point that
	out: the Cray 1/X division is relatively inefficient].  This says
	nothing of the +s and -s for the array indices of the 30-40
	variables, the FP +s, -s, and *s, or the tremendous storage
	requirements.  He user liked to point out that the CFT compiler,
	as much as we complain, reduced the 18 divides to 7.
	The Cray's real power is what it does on the indicies!

> For compilation, however, integer performance -- particularly simple moves
> and single-level indirect addressing -- is the best predictor of speed.
...
> That's why machines with 
> strong integer scalar performance (e.g., Cray 1?) have it over those that 
> focus only on MFLOP's.

	What machine only focuses on MFLOPS?  Have you run on it?
	Good architectures, as Brian Reid pointed out in net.micro.mac,
	are a good balance of tradeoffs.

> Benchmarks typically are several hundred lines, with limited complexity
> and usually small data cases.  If you want to test typical throughput,
> you need a typical program--even . . . 200,000
> lines of source.  This also assures that if the system was "tuned",
> it was probably a very limited sort of tuning that any owner of such
> program would try anyway.
	Do we in principal really need the 200K line program?
	Why can't we come up with adequate smaller programs to give us an
	idea how the the 200K line program works?  In other words
	why does the US need a Missouri [a show me state]?  Can't
	we just take it for granted the sun is 93 million miles
	rather than remeasure it, some long as we know what a mile is?

	Our benchmarks tend to be too kind.  We need benchmarks, I think
	which deliberately `break machines' along the line of these
	validation suites which check compiler limits, and so forth.
	On MWFs I tend to think that we can separate the OS, the compiler,
	the language from the machine.  On TTS and I think is not possible.
	Today's Sunday, so I don't care.
> 
> > It's my belief that this market requires
> > "general purpose architectures" with "general purpose (usable)
> > environments"
> > 	george spix       gas@lanl
> 
> There have been no shortages of proposed architectures.  There
> haven't been as many "usable architectures," [true]
> 
> A clever user will take "Program A" and put it up on machines X,Y,Z,
> spending less than a week on each test.  Which ever machine runs it
> fastest, wins.  From the user's standpoint, that's much better
> than listening to MIP's, MFLOP's, or other mumbo-jumbo.
> 
> 	Joel West	CACI, Inc. - Federal (c/o UC San Diego)

	The tension here is the desire to make general portable, useable
	programs and to take advantage of machine performance features.

	I sometimes wonder if we will really have a Cray-on-a-desk
	and then it passes ;-).  Few consider what a Cray is:
	word-oriented, big memory, vector registers, underdeveloped
	software [oops, sorry Bence and George].

Lastly, I wish to thank, LLNL, Cray, and Convex for time on some of
their machines.  I tried cutting this down more, I will try better next
time. Sorry for rambo-ing, :-), I am still working on these
ideas.  Some of my existing prototype tests look at memory contention,
vector instruction sets, compiler tricks and limitations.

--eugene miya
  NASA Ames Research Center
  {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene
  emiya@ames-vmsb