Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rochester!crowl
From: crowl@rochester.ARPA (Lawrence Crowl)
Newsgroups: net.arch
Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz
Message-ID: <22097@rochester.ARPA>
Date: Mon, 3-Nov-86 17:50:59 EST
Article-I.D.: rocheste.22097
Posted: Mon Nov  3 17:50:59 1986
Date-Received: Tue, 4-Nov-86 02:41:33 EST
References: <340@euroies.UUCP> <1989@videovax.UUCP>
Reply-To: crowl@rochester.UUCP (Lawrence Crowl)
Organization: U of Rochester, CS Dept, Rochester, NY
Lines: 125

>>> mash@mips.UUCP (John Mashey)
)) crowl@rochtest.UUCP (Lawrence Crowl)
> bcase@amdcad.UUCP (Brian Case)
] mash@mips.UUCP (John Mashey)
crowl@rochtest.UUCP (Lawrence Crowl)

>>> ... MWhets/Mhz, etc, as way to factor out transient technology...

))Perhaps what we are missing is that for a given level of technology, a longer
))clock cycle allows us to have a larger depth of combinational circuitry.  That
))is, we can have each clock work through more gates.  So, a 4 MHz clock which
))governs propogation through a combinational circuit 4 gates deep will do
))roughly the same work as a 1 MHz clock governing propogation through a
))combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
))gates required to implement a FLOP, (or an instruction, or a window, etc.).

]Can you suggest some numbers for different machines? One of the reasons
]I proposed a (simplsitic) measure is the absolute difficulty of finding
]such thing out.

No, I cannot suggest numbers.  I suspect they would be difficult to obtain.
Maybe I should think more next time.

>Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
>pipeline can be kept full (one of the major goals of RISC), then it will do
>4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
>MIPS/MHz or whatever/MHz will be the same!  Thus, I still think this isn't
>such a bad metric to use for comparison.  If pipelining can't be implemented
>or the pipeline can't be kept full for a reasonable portion of the time,
>the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

One of us is confused here, and I do not know which.  Assume a IPS takes a
constant 16 combinational gates.  The 4 MHz and 4 gates will require 4 stages
while the 1 MHz and 16 gates will require one stage.  Both machines will
execute 1 MIPS.  But they have a factor of 4 difference in MHz/MIPS.  If we
pipeline the 4 MHz and 4 gates into a four stage pipeline, the MHz/MIPS will
be the same but the performance will be a factor of 4 different.

))The very fast clock, heavily pipelined machines like the Cray and Clipper
))follow the first approach, while the slower clock, less pipelined machines
))like the Berkley RISC and MIPS follow the second approach.  Which is better is

>Now wait a minute.  I don't think anyone at Berkeley, Stanford, or MIPS Co.
>will agree with this statement.  The clock speeds may vary among the machines
>you mention, but that is basically a consequense of implementation technology.
>I think everyone is trying to make pipestages as short as possible so that
>future implementations will be able to exploit future technology to the
>fullest extent.

There are at least two approaches, exemplified by the following two examples.
The first has a clock controlling progress through three stages from the
register bank to the ALU, through the ALU, and back to the register bank.  
The second approach is to do all this in one stage.  The first approach has
the potential to pipe while the second has a lower clock rate.  In both cases
faster clock rates allow faster implementations.  Which machines take which
approach?

))probably dependent upon the technology used to implement the architecture and
))the desired speed.  For instance, if we want a very fast vector processor, we
))should probably choose the fast clock, more pipelined architecture.  If we
))want a better price/performance ratio, we should probably choose the slow
))clock, less pipelined architecture.

>I certainly agree that if a very fast vector processor is required, the higest
>clock speed possible with the most pipelining that makes sense should be
>chosen.  But why should we chose a different approach for the better price/
>performance ratio?  Unless you are trying only to decrease price (which is
>not the same as increasing price/performance), one should still aim for the
>highest possible clock speed and pipelining.  If the price/performance is
>right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz.  In
>addition, for little extra cost (I claim but can't unconditionally prove),
>the 4 at 4 Mhz version will in some cases give me the option of 4 times the
>throughput.  I do acknowledge that I am starting to talk about a machine
>for which FLOPS/MHz may not be a good comparison metric.

Higher clock rates generally imply higher quality parts, more EMI shielding,
etc, which implies a higher cost.  You do not expect a 3000 RPM engine to
cost the same as a 8000 RPM engine do you?  In addition, exploiting pipeline
potential generally costs significant development effort and gates to control
the piping.  Now, adding some pipeling to a simple scheme is probably cost
effective, but adding as much as is possible is not.  We must find a balance.

))BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
))quality of an architecture is dependent on the technology used to implement
))it, and no architecture is "best" under more than a limited range of
))technologies.  For instance, under technologies in which the bandwidth to
))memory is most limited, stack architectures (Burroughs, Lilith) will be
))"better".  Under technologies where the ability to process instructions is
))most limited, the wide register to register architectures will be "better".

>I agree that technology influences (or maybe "should influence") architecture.
>But I don't think limited memory bandwidth indicates a stack architecture,
>rather, I would say a stack archtitecture is contraindicated!  If memory
>bandwidth is a limiting factor on performance, then many registers are needed!
>Optimizations which reduce memory bandwidth requirements are those that keep
>computed results in registers for later re-use; such optimizations are
>difficult, at best, to realize for a stack architecture.  

Stacks and registers are not incompatible.  It is easy to imagine a machine
which did pushes and pops between the stack and a register bank.  If register
to register architectures are allowed to store temporaries and local variables
in registers, the stack architecture should be allowed to also.  We should
separate the notion of registers as a means to evaluate expressions and as
a storage media.

>When you say "the ability to process instructions is most limited" I guess
>that you mean "the ability to fetch instructions is most limited" (because
>any processor whose ability to actually process its own instructions is most
>limited is probably not worth discussing).  In this case, I would think that
>shorter instructions in which some part of operand addressing is implicit
>(e.g. instructions for a stack machine) would be indicated; "wide register to
>register" instructions would simply make matters worse.  Probably the best
>thing to do is design the machine right the first time, i.e. give it enough
>instruction bandwidth.

"The ability to fetch instructions" is precisely what I did NOT mean.  You
seem to have effectively argued for a stack architecture when bandwidth to
memory is limited.  After all, instructions are in memory.  What I meant
by "the ability to process instructions" is once you have the instruction
in the CPU, how quickly can you deal with it (relative to getting it into
the CPU in the first place).
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627