Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rochester!crowl From: crowl@rochester.ARPA (Lawrence Crowl) Newsgroups: net.arch Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz Message-ID: <22097@rochester.ARPA> Date: Mon, 3-Nov-86 17:50:59 EST Article-I.D.: rocheste.22097 Posted: Mon Nov 3 17:50:59 1986 Date-Received: Tue, 4-Nov-86 02:41:33 EST References: <340@euroies.UUCP> <1989@videovax.UUCP> Reply-To: crowl@rochester.UUCP (Lawrence Crowl) Organization: U of Rochester, CS Dept, Rochester, NY Lines: 125 >>> mash@mips.UUCP (John Mashey) )) crowl@rochtest.UUCP (Lawrence Crowl) > bcase@amdcad.UUCP (Brian Case) ] mash@mips.UUCP (John Mashey) crowl@rochtest.UUCP (Lawrence Crowl) >>> ... MWhets/Mhz, etc, as way to factor out transient technology... ))Perhaps what we are missing is that for a given level of technology, a longer ))clock cycle allows us to have a larger depth of combinational circuitry. That ))is, we can have each clock work through more gates. So, a 4 MHz clock which ))governs propogation through a combinational circuit 4 gates deep will do ))roughly the same work as a 1 MHz clock governing propogation through a ))combinational circuit 16 gates deep. Perhaps a better measure is the depth of ))gates required to implement a FLOP, (or an instruction, or a window, etc.). ]Can you suggest some numbers for different machines? One of the reasons ]I proposed a (simplsitic) measure is the absolute difficulty of finding ]such thing out. No, I cannot suggest numbers. I suspect they would be difficult to obtain. Maybe I should think more next time. >Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the >pipeline can be kept full (one of the major goals of RISC), then it will do >4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or >MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't >such a bad metric to use for comparison. If pipelining can't be implemented >or the pipeline can't be kept full for a reasonable portion of the time, >the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator. One of us is confused here, and I do not know which. Assume a IPS takes a constant 16 combinational gates. The 4 MHz and 4 gates will require 4 stages while the 1 MHz and 16 gates will require one stage. Both machines will execute 1 MIPS. But they have a factor of 4 difference in MHz/MIPS. If we pipeline the 4 MHz and 4 gates into a four stage pipeline, the MHz/MIPS will be the same but the performance will be a factor of 4 different. ))The very fast clock, heavily pipelined machines like the Cray and Clipper ))follow the first approach, while the slower clock, less pipelined machines ))like the Berkley RISC and MIPS follow the second approach. Which is better is >Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co. >will agree with this statement. The clock speeds may vary among the machines >you mention, but that is basically a consequense of implementation technology. >I think everyone is trying to make pipestages as short as possible so that >future implementations will be able to exploit future technology to the >fullest extent. There are at least two approaches, exemplified by the following two examples. The first has a clock controlling progress through three stages from the register bank to the ALU, through the ALU, and back to the register bank. The second approach is to do all this in one stage. The first approach has the potential to pipe while the second has a lower clock rate. In both cases faster clock rates allow faster implementations. Which machines take which approach? ))probably dependent upon the technology used to implement the architecture and ))the desired speed. For instance, if we want a very fast vector processor, we ))should probably choose the fast clock, more pipelined architecture. If we ))want a better price/performance ratio, we should probably choose the slow ))clock, less pipelined architecture. >I certainly agree that if a very fast vector processor is required, the higest >clock speed possible with the most pipelining that makes sense should be >chosen. But why should we chose a different approach for the better price/ >performance ratio? Unless you are trying only to decrease price (which is >not the same as increasing price/performance), one should still aim for the >highest possible clock speed and pipelining. If the price/performance is >right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In >addition, for little extra cost (I claim but can't unconditionally prove), >the 4 at 4 Mhz version will in some cases give me the option of 4 times the >throughput. I do acknowledge that I am starting to talk about a machine >for which FLOPS/MHz may not be a good comparison metric. Higher clock rates generally imply higher quality parts, more EMI shielding, etc, which implies a higher cost. You do not expect a 3000 RPM engine to cost the same as a 8000 RPM engine do you? In addition, exploiting pipeline potential generally costs significant development effort and gates to control the piping. Now, adding some pipeling to a simple scheme is probably cost effective, but adding as much as is possible is not. We must find a balance. ))BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The ))quality of an architecture is dependent on the technology used to implement ))it, and no architecture is "best" under more than a limited range of ))technologies. For instance, under technologies in which the bandwidth to ))memory is most limited, stack architectures (Burroughs, Lilith) will be ))"better". Under technologies where the ability to process instructions is ))most limited, the wide register to register architectures will be "better". >I agree that technology influences (or maybe "should influence") architecture. >But I don't think limited memory bandwidth indicates a stack architecture, >rather, I would say a stack archtitecture is contraindicated! If memory >bandwidth is a limiting factor on performance, then many registers are needed! >Optimizations which reduce memory bandwidth requirements are those that keep >computed results in registers for later re-use; such optimizations are >difficult, at best, to realize for a stack architecture. Stacks and registers are not incompatible. It is easy to imagine a machine which did pushes and pops between the stack and a register bank. If register to register architectures are allowed to store temporaries and local variables in registers, the stack architecture should be allowed to also. We should separate the notion of registers as a means to evaluate expressions and as a storage media. >When you say "the ability to process instructions is most limited" I guess >that you mean "the ability to fetch instructions is most limited" (because >any processor whose ability to actually process its own instructions is most >limited is probably not worth discussing). In this case, I would think that >shorter instructions in which some part of operand addressing is implicit >(e.g. instructions for a stack machine) would be indicated; "wide register to >register" instructions would simply make matters worse. Probably the best >thing to do is design the machine right the first time, i.e. give it enough >instruction bandwidth. "The ability to fetch instructions" is precisely what I did NOT mean. You seem to have effectively argued for a stack architecture when bandwidth to memory is limited. After all, instructions are in memory. What I meant by "the ability to process instructions" is once you have the instruction in the CPU, how quickly can you deal with it (relative to getting it into the CPU in the first place). -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627