Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!caip!topaz!rutgers!husc6!panda!genrad!decvax!decwrl!labrea!glacier!mips!hansen
From: hansen@mips.UUCP
Newsgroups: net.arch
Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz
Message-ID: <726@mips.UUCP>
Date: Sun, 19-Oct-86 03:15:53 EDT
Article-I.D.: mips.726
Posted: Sun Oct 19 03:15:53 1986
Date-Received: Tue, 21-Oct-86 21:31:53 EDT
References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP> <377@garth.UUCP>
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 177

In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes:
>I don't understand how someone of John's sophistication can insist on
>repeating such a clearly fallacious argument.  The statement "cycle time
>is likely to be a property of the technology" is simply untrue, as I have
>pointed out in previous postings.  Cycle time is a the product of gate delays
>(a property of technology) and the number of sequential gates between latches
>(a property of architecture).  For example, let us consider two machines
>that are familiar to John and myself and yet of interest to the newsgroup:
>the MIPS R2000 and the Fairchild Clipper.  An 8 Mhz R2000 has a cycle time
>of 125ns.  A 33Mhz Clipper has a cycle time of 30ns.  Yet both are built
>with essentially the same 2-micron CMOS technology.  I somehow doubt that
>Fairchild's CMOS transistors switch four times faster than that of whoever
>is secretly building R2000s this week.  The difference is architectural.

	"cycle time is likely to be a property of the technology" is
	clearly a simplification that is useful for making relatively
	crude comparisons between widely varying machine designs.
	Cycle time, while a crude measure, has the advantage
	that it is clearly observable and well-documented.

	In practice, the number of sequential gates between latches
	is also generally a property of the technology, given that
	designers are attempting to optimize their own design.
	It is counterproductive to over-pipeline a design, as
	pipe registers themselves add delay and complexity.
	Let me emphasize, however, that I do not intend to
	assert that the Fairchild design is over-pipelined.

	Now let us address the general issue of a comparison
	of the technology of the two machines discussed above,
	(two machines that were clearly chosen entirely at random).
	It is indeed safe to assume that an 8 MHz R2000 has a
	cycle time of 125 ns. However, 8 MHz is not the maximum
	clock rate that the silicon will support - that figure
	is 16.67 MHz, or a cycle time of 60 ns (worst case over commercial
	temperatures). This 16.67 MHz R2000 part is built in a
	2-micron CMOS technology, and Fairchild's part is
	built in a process that is also described as a 2-micron CMOS
	technology. However, the phrase "2-micron CMOS technology"
	is actually very vague.
	
	The available public literature from both companies is
	not sufficient to compare these technologies point-by-point,
	but I fully expect that Fairchild has pushed harder
	on effective transistor gate length and oxide thickness to
	reach 33 MHz than MIPS has yet employed to reach 16.67 MHz.
	A difference in comparable gate speed of a factor of
	two is actually entirely plausable, though we believe the
	actual ratio is more on the order of 1.5.

	We have been getting our process technology from the same
	suppliers week after week. By using a slightly less agressive
	technology, we are able to get reliable, multiple-sourced processing.

>As I understand it, the R2000 was designed to take advantage of delayed
>load/branch techniques, and to execute instructions in a small number of
>clocks, which in fact go hand-in-hand.  A load or branch can take as little
>as two clocks.  But the addition of two numbers cannot take less than one
>clock, and so the ALU has a leasurely 125ns to do something that it could
>in principle have done more quickly, had it been more heavily pipelined.

	I have to disagree on several of the points claimed here.
	The R2000 design will execute load and branch instructions
	at a rate of one instruction per cycle (a 60 ns cycle),
	and takes one 60 ns cycle to perform an integer ALU operation.
	In fact, the R2000 will execute ALL instructions in a
	single cycle, which substantially simplified the design.
	It is, of course, entirely untrue that the addition of
	two numbers cannot take less than one clock, but this is
	not the heart of the matter: the integer ALU is not
	the critical path in the R2000 design.

>The Clipper was designed from fairly well-established supercomputer and
>mainframe techniques.  The cycle time is the time required to do the smallest
>amount  of useful work - an integer ALU operation at 30ns.  Other instructions
>must then of course be multiples of that basic unit.  Assuming cache hits,
>a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes
>9 (270ns vs. 250ns for the R2000).

	Correcting the numbers above, we have 120/180 ns (Clipper)
	vs. 60 ns (R2000) for a load, and 270 ns vs 60 ns for a branch.

>It should be noted that both machines allow for the overlapped execution
>of instructions, but in different ways.  The R2000 overlaps register
>operations with loads and branches using delay slots.  The Clipper
>overlaps loads but not branches, using resource scoreboarding instead
>of delay slots.  This means that the R2000 can branch more efficiently
>(assuming the assembler can fill the delay slot), but the Clipper can
>have more instructions executing concurrently than the R2000 (4 vs 2)
>in in-line code.

	Resource scoreboarding is no more effective at using load
	delay slots (which are delays inherent in the computation)
	than static scheduling. Since instructions are issued in
	the order in which they are presented in a scoreboard
	controller, an operation that depends on the value of
	a pending load instruction must wait for
	the load to complete on either machine. The number of
	delay cycles, is, however, an important factor in
	determining performance. It is hardly advantageous
	to have 4 cycle (is is it 6 cycle?) load instructions,
	no matter how slickly this is portrayed as a feature with
	the phrase "can have more instructions executing concurrently."
	The R2000 can fill the delay slot with a useful instruction,
	(which can even be an additional load instruction) over 70%
	of the time. With what frequency can Clipper compilers find
	three instructions, none of which can be a load, to
	fill the three load delay slots on a Clipper?

>Draw your own conclusions about "architectural efficiency".

	The Clipper designers claim 5 MIPS performance at 33 MHz,
	while the R2000 performs at 10 MIPS at 16.67 MHz.
	The Fairchild technology is as much as twice as
	agressive as the R2000 technology, but the Clipper
	only achieves half the performance. My conclusion
	is that the R2000 is two-four times as "efficient"
	an architecture.

	For Clipper to reach the same performance in the same technology,
	using their current architecture, they need 66 MHz parts,
	with an input clock rate well above the broadcast FM radio band.

>>Machine	Mhz	KWhet	KWhet/Mhz
>>80287		 8	 300	 40
>>32332-32081	15	 728	 50		(these from Ray Curry,
>>32332-32381	15	1200	 80		in <3833@nsc.UUCP>) (projected)
>>32332-32310	15	1600	100*		"" "" (projected)
>>Clipper?	33	1200?	 40		guess? anybody know better #?
>>68881		12.5	 755	 60		(from discussion)
>>68881		20	1240	 60		claimed by Moto, in SUN3-260
>>SUN FPA	16.6	1700	100*		DP (from Hough) (in SUN3-160)
>>MIPS R2360	 8	1160	140*		DP (interim, with restrictions)
>>MIPS R2010	 8	4500	560		DP (simulated)
>
>John's guess for the Clipper is off by over a factor of two.  The Clipper
>FORTRAN compiler was brought up only recently.  In its present sane but
>unoptimizing state, I obtained the following result on an Interpro 32C
>running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
>Hills Clipper FORTRAN compiler with Fairchild math libraries:
>
>		Mhz	Kwhet	Kwhet/Mhz
>Clipper	33	2920	Who cares?  Kwhet/Kg and Kwhet/cm2 are of
>				more practical consequence.
>
>Kevin D. Kissell
>Fairchild Advanced Processor Division

Clipper		33	2930	90 = Kwhet/MHz

	I'd like thank Kevin for providing this performance data
	and point out that this ratio is a respectable accomplishment
	on Fairchild's part - this number is comparable to the
	values obtained by using multiple-chip FP processors
	built with Weitek arithmetic units and interfaced to
	microcoded processors. While the FP arithmetic operations
	take longer in the Clipper than in Weitek parts
	(which are built in an unmistakably slower technology),
	by reducing communications overhead, the overall performance
	comes out comparably well.

	Let me make clear why Kwhet/MHz or MIPS/MHz ratios are useful: 
	they provide some insight into where the emphasis was placed 
	in the design, and where future derivative designs can reach. 
	It's my view that Kevin's remarks confirm that the Clipper design 
	was intended from the start to build a machine with a low MIPS/MHz
	ratio, with the clock rate based on the lowest conceivable
	executable unit. It should also be clear what level of 
	architectural efficiency results from optimizing integer
	ALU operations (Clipper), rather than by optimizing the architecture 
	to execute load, store and branch operations (MIPS).

-- 

Craig Hansen			|	 "Evahthun' tastes
MIPS Computer Systems		|	 bettah when it
...decwrl!mips!hansen		|	 sits on a RISC"