Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!sri-spam!sri-unix!hplabs!amdcad!bcase
From: bcase@amdcad.UUCP (Brian Case)
Newsgroups: net.arch
Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz
Message-ID: <13553@amdcad.UUCP>
Date: Tue, 28-Oct-86 13:44:33 EST
Article-I.D.: amdcad.13553
Posted: Tue Oct 28 13:44:33 1986
Date-Received: Tue, 28-Oct-86 21:38:22 EST
References: <340@euroies.UUCP> <1989@videovax.UUCP>
Reply-To: bcase@amdcad.UUCP (Brian Case)
Organization: Advanced Micro Devices, Sunnyvale, California
Lines: 70

>Perhaps what we are missing is that for a given level of technology, a longer
>clock cycle allows us to have a larger depth of combinational circuitry.  That
>is, we can have each clock work through more gates.  So, a 4 MHz clock which
>governs propogation through a combinational circuit 4 gates deep will do
>roughly the same work as a 1 MHz clock governing propogation through a
>combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
>gates required to implement a FLOP, (or an instruction, or a window, etc.).

Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
pipeline can be kept full (one of the major goals of RISC), then it will do
4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
MIPS/MHz or whatever/MHz will be the same!  Thus, I still think this isn't
such a bad metric to use for comparison.  If pipelining can't be implemented
or the pipeline can't be kept full for a reasonable portion of the time,
the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

>The very fast clock, heavily pipelined machines like the Cray and Clipper
>follow the first approach, while the slower clock, less pipelined machines
>like the Berkley RISC and MIPS follow the second approach.  Which is better is

Now wait a minute.  I don't think anyone at Berkeley, Stanford, or MIPS Co.
will agree with this statement.  The clock speeds may vary among the machines
you mention, but that is basically a consequense of implementation technology.
I think everyone is trying to make pipestages as short as possible so that
future implementations will be able to exploit future technology to the
fullest extent.

>probably dependent upon the technology used to implement the architecture and
>the desired speed.  For instance, if we want a very fast vector processor, we
>should probably choose the fast clock, more pipelined architecture.  If we want
>a better price/performance ratio, we should probably choose the slow clock,
>less pipelined architecture.

I certainly agree that if a very fast vector processor is required, the higest
clock speed possible with the most pipelining that makes sense should be
chosen.  But why should we chose a different approach for the better price/
performance ratio?  Unless you are trying only to decrease price (which is
not the same as increasing price/performance), one should still aim for the
highest possible clock speed and pipelining.  If the price/performance is
right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz.  In
addition, for little extra cost (I claim but can't unconditionally prove),
the 4 at 4 Mhz version will in some cases give me the option of 4 times the
throughput.  I do acknowledge that I am starting to talk about a machine
for which FLOPS/MHz may not be a good comparison metric.

>BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
>quality of an architecture is dependent on the technology used to implement it,
>and no architecture is "best" under more than a limited range of technologies.
>For instance, under technologies in which the bandwidth to memory is most
>limited, stack architectures (Burroughs, Lilith) will be "better".  Under 
>technologies where the ability to process instructions is most limited, the
>wide register to register architectures will be "better".

I agree that technology influences (or maybe "should influence") architecture.
But I don't think limited memory bandwidth indicates a stack architecture,
rather, I would say a stack archtitecture is contraindicated!  If memory
bandwidth is a limiting factor on performance, then many registers are needed!
Optimizations which reduce memory bandwidth requirements are those that keep
computed results in registers for later re-use; such optimizations are
difficult, at best, to realize for a stack architecture.  When you say "the
ability to process instructions is most limited" I guess that you mean "the
ability to fetch instructions is most limited" (because any processor whose
ability to actually process its own instructions is most limited is probably
not worth discussing).  In this case, I would think that shorter instructions
in which some part of operand addressing is implicit (e.g. instructions for a
stack machine) would be indicated; "wide register to register" instructions
would simply make matters worse.  Probably the best thing to do is design the
machine right the first time, i.e. give it enough instruction bandwidth.

I fear that this posting reads like a flame; it is not intended to be a flame.