Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!caip!topaz!rutgers!husc6!panda!genrad!decvax!decwrl!labrea!glacier!mips!hansen From: hansen@mips.UUCP Newsgroups: net.arch Subject: Re: Floating point performance & Mr. Mashey's Mythical Mhz Message-ID: <726@mips.UUCP> Date: Sun, 19-Oct-86 03:15:53 EDT Article-I.D.: mips.726 Posted: Sun Oct 19 03:15:53 1986 Date-Received: Tue, 21-Oct-86 21:31:53 EDT References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP> <377@garth.UUCP> Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 177 In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes: >I don't understand how someone of John's sophistication can insist on >repeating such a clearly fallacious argument. The statement "cycle time >is likely to be a property of the technology" is simply untrue, as I have >pointed out in previous postings. Cycle time is a the product of gate delays >(a property of technology) and the number of sequential gates between latches >(a property of architecture). For example, let us consider two machines >that are familiar to John and myself and yet of interest to the newsgroup: >the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time >of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built >with essentially the same 2-micron CMOS technology. I somehow doubt that >Fairchild's CMOS transistors switch four times faster than that of whoever >is secretly building R2000s this week. The difference is architectural. "cycle time is likely to be a property of the technology" is clearly a simplification that is useful for making relatively crude comparisons between widely varying machine designs. Cycle time, while a crude measure, has the advantage that it is clearly observable and well-documented. In practice, the number of sequential gates between latches is also generally a property of the technology, given that designers are attempting to optimize their own design. It is counterproductive to over-pipeline a design, as pipe registers themselves add delay and complexity. Let me emphasize, however, that I do not intend to assert that the Fairchild design is over-pipelined. Now let us address the general issue of a comparison of the technology of the two machines discussed above, (two machines that were clearly chosen entirely at random). It is indeed safe to assume that an 8 MHz R2000 has a cycle time of 125 ns. However, 8 MHz is not the maximum clock rate that the silicon will support - that figure is 16.67 MHz, or a cycle time of 60 ns (worst case over commercial temperatures). This 16.67 MHz R2000 part is built in a 2-micron CMOS technology, and Fairchild's part is built in a process that is also described as a 2-micron CMOS technology. However, the phrase "2-micron CMOS technology" is actually very vague. The available public literature from both companies is not sufficient to compare these technologies point-by-point, but I fully expect that Fairchild has pushed harder on effective transistor gate length and oxide thickness to reach 33 MHz than MIPS has yet employed to reach 16.67 MHz. A difference in comparable gate speed of a factor of two is actually entirely plausable, though we believe the actual ratio is more on the order of 1.5. We have been getting our process technology from the same suppliers week after week. By using a slightly less agressive technology, we are able to get reliable, multiple-sourced processing. >As I understand it, the R2000 was designed to take advantage of delayed >load/branch techniques, and to execute instructions in a small number of >clocks, which in fact go hand-in-hand. A load or branch can take as little >as two clocks. But the addition of two numbers cannot take less than one >clock, and so the ALU has a leasurely 125ns to do something that it could >in principle have done more quickly, had it been more heavily pipelined. I have to disagree on several of the points claimed here. The R2000 design will execute load and branch instructions at a rate of one instruction per cycle (a 60 ns cycle), and takes one 60 ns cycle to perform an integer ALU operation. In fact, the R2000 will execute ALL instructions in a single cycle, which substantially simplified the design. It is, of course, entirely untrue that the addition of two numbers cannot take less than one clock, but this is not the heart of the matter: the integer ALU is not the critical path in the R2000 design. >The Clipper was designed from fairly well-established supercomputer and >mainframe techniques. The cycle time is the time required to do the smallest >amount of useful work - an integer ALU operation at 30ns. Other instructions >must then of course be multiples of that basic unit. Assuming cache hits, >a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes >9 (270ns vs. 250ns for the R2000). Correcting the numbers above, we have 120/180 ns (Clipper) vs. 60 ns (R2000) for a load, and 270 ns vs 60 ns for a branch. >It should be noted that both machines allow for the overlapped execution >of instructions, but in different ways. The R2000 overlaps register >operations with loads and branches using delay slots. The Clipper >overlaps loads but not branches, using resource scoreboarding instead >of delay slots. This means that the R2000 can branch more efficiently >(assuming the assembler can fill the delay slot), but the Clipper can >have more instructions executing concurrently than the R2000 (4 vs 2) >in in-line code. Resource scoreboarding is no more effective at using load delay slots (which are delays inherent in the computation) than static scheduling. Since instructions are issued in the order in which they are presented in a scoreboard controller, an operation that depends on the value of a pending load instruction must wait for the load to complete on either machine. The number of delay cycles, is, however, an important factor in determining performance. It is hardly advantageous to have 4 cycle (is is it 6 cycle?) load instructions, no matter how slickly this is portrayed as a feature with the phrase "can have more instructions executing concurrently." The R2000 can fill the delay slot with a useful instruction, (which can even be an additional load instruction) over 70% of the time. With what frequency can Clipper compilers find three instructions, none of which can be a load, to fill the three load delay slots on a Clipper? >Draw your own conclusions about "architectural efficiency". The Clipper designers claim 5 MIPS performance at 33 MHz, while the R2000 performs at 10 MIPS at 16.67 MHz. The Fairchild technology is as much as twice as agressive as the R2000 technology, but the Clipper only achieves half the performance. My conclusion is that the R2000 is two-four times as "efficient" an architecture. For Clipper to reach the same performance in the same technology, using their current architecture, they need 66 MHz parts, with an input clock rate well above the broadcast FM radio band. >>Machine Mhz KWhet KWhet/Mhz >>80287 8 300 40 >>32332-32081 15 728 50 (these from Ray Curry, >>32332-32381 15 1200 80 in <3833@nsc.UUCP>) (projected) >>32332-32310 15 1600 100* "" "" (projected) >>Clipper? 33 1200? 40 guess? anybody know better #? >>68881 12.5 755 60 (from discussion) >>68881 20 1240 60 claimed by Moto, in SUN3-260 >>SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160) >>MIPS R2360 8 1160 140* DP (interim, with restrictions) >>MIPS R2010 8 4500 560 DP (simulated) > >John's guess for the Clipper is off by over a factor of two. The Clipper >FORTRAN compiler was brought up only recently. In its present sane but >unoptimizing state, I obtained the following result on an Interpro 32C >running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green >Hills Clipper FORTRAN compiler with Fairchild math libraries: > > Mhz Kwhet Kwhet/Mhz >Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of > more practical consequence. > >Kevin D. Kissell >Fairchild Advanced Processor Division Clipper 33 2930 90 = Kwhet/MHz I'd like thank Kevin for providing this performance data and point out that this ratio is a respectable accomplishment on Fairchild's part - this number is comparable to the values obtained by using multiple-chip FP processors built with Weitek arithmetic units and interfaced to microcoded processors. While the FP arithmetic operations take longer in the Clipper than in Weitek parts (which are built in an unmistakably slower technology), by reducing communications overhead, the overall performance comes out comparably well. Let me make clear why Kwhet/MHz or MIPS/MHz ratios are useful: they provide some insight into where the emphasis was placed in the design, and where future derivative designs can reach. It's my view that Kevin's remarks confirm that the Clipper design was intended from the start to build a machine with a low MIPS/MHz ratio, with the clock rate based on the lowest conceivable executable unit. It should also be clear what level of architectural efficiency results from optimizing integer ALU operations (Clipper), rather than by optimizing the architecture to execute load, store and branch operations (MIPS). -- Craig Hansen | "Evahthun' tastes MIPS Computer Systems | bettah when it ...decwrl!mips!hansen | sits on a RISC"