Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!sri-spam!sri-unix!hplabs!amdcad!bcase From: bcase@amdcad.UUCP (Brian Case) Newsgroups: net.arch Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz Message-ID: <13553@amdcad.UUCP> Date: Tue, 28-Oct-86 13:44:33 EST Article-I.D.: amdcad.13553 Posted: Tue Oct 28 13:44:33 1986 Date-Received: Tue, 28-Oct-86 21:38:22 EST References: <340@euroies.UUCP> <1989@videovax.UUCP> Reply-To: bcase@amdcad.UUCP (Brian Case) Organization: Advanced Micro Devices, Sunnyvale, California Lines: 70 >Perhaps what we are missing is that for a given level of technology, a longer >clock cycle allows us to have a larger depth of combinational circuitry. That >is, we can have each clock work through more gates. So, a 4 MHz clock which >governs propogation through a combinational circuit 4 gates deep will do >roughly the same work as a 1 MHz clock governing propogation through a >combinational circuit 16 gates deep. Perhaps a better measure is the depth of >gates required to implement a FLOP, (or an instruction, or a window, etc.). Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the pipeline can be kept full (one of the major goals of RISC), then it will do 4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't such a bad metric to use for comparison. If pipelining can't be implemented or the pipeline can't be kept full for a reasonable portion of the time, the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator. >The very fast clock, heavily pipelined machines like the Cray and Clipper >follow the first approach, while the slower clock, less pipelined machines >like the Berkley RISC and MIPS follow the second approach. Which is better is Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co. will agree with this statement. The clock speeds may vary among the machines you mention, but that is basically a consequense of implementation technology. I think everyone is trying to make pipestages as short as possible so that future implementations will be able to exploit future technology to the fullest extent. >probably dependent upon the technology used to implement the architecture and >the desired speed. For instance, if we want a very fast vector processor, we >should probably choose the fast clock, more pipelined architecture. If we want >a better price/performance ratio, we should probably choose the slow clock, >less pipelined architecture. I certainly agree that if a very fast vector processor is required, the higest clock speed possible with the most pipelining that makes sense should be chosen. But why should we chose a different approach for the better price/ performance ratio? Unless you are trying only to decrease price (which is not the same as increasing price/performance), one should still aim for the highest possible clock speed and pipelining. If the price/performance is right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In addition, for little extra cost (I claim but can't unconditionally prove), the 4 at 4 Mhz version will in some cases give me the option of 4 times the throughput. I do acknowledge that I am starting to talk about a machine for which FLOPS/MHz may not be a good comparison metric. >BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The >quality of an architecture is dependent on the technology used to implement it, >and no architecture is "best" under more than a limited range of technologies. >For instance, under technologies in which the bandwidth to memory is most >limited, stack architectures (Burroughs, Lilith) will be "better". Under >technologies where the ability to process instructions is most limited, the >wide register to register architectures will be "better". I agree that technology influences (or maybe "should influence") architecture. But I don't think limited memory bandwidth indicates a stack architecture, rather, I would say a stack archtitecture is contraindicated! If memory bandwidth is a limiting factor on performance, then many registers are needed! Optimizations which reduce memory bandwidth requirements are those that keep computed results in registers for later re-use; such optimizations are difficult, at best, to realize for a stack architecture. When you say "the ability to process instructions is most limited" I guess that you mean "the ability to fetch instructions is most limited" (because any processor whose ability to actually process its own instructions is most limited is probably not worth discussing). In this case, I would think that shorter instructions in which some part of operand addressing is implicit (e.g. instructions for a stack machine) would be indicated; "wide register to register" instructions would simply make matters worse. Probably the best thing to do is design the machine right the first time, i.e. give it enough instruction bandwidth. I fear that this posting reads like a flame; it is not intended to be a flame.