Path: utzoo!mnetor!uunet!husc6!bbn!rochester!cornell!batcomputer!pyramid!prls!mips!earl From: earl@mips.COM (Earl Killian) Newsgroups: comp.arch Subject: Re: RPM-40 [really forwarding] Message-ID: <1800@gumby.mips.COM> Date: 8 Mar 88 00:01:55 GMT References: <9758@steinmetz.steinmetz.UUCP> <9799@steinmetz.steinmetz.UUCP> Lines: 79 In-reply-to: oconnor@sungoddess.steinmetz's message of 5 Mar 88 03:28:02 GMT In article <9799@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes: So far I agree, but there's more ... How often forwarding is needed is only PART of the story. The other part is how often you could "fill" the delay from forwarding. ] Here are some numbers from the Am29000 simulator running a small "nroff" ] instructions executed: 89435 ] instructions requiring alu forwarding: 41420 (46%) ] instructions forwarding from load buffer: 13669 (15%) But if I can fill 90%, say, of the one-cycle latency delays with a reorganizer, then I only incur a penalty of 5%, which means, for RPM40, that a bypass path is justified only if it incurs a penalty of 1.2 nanoseconds or less. If I can fill 80% of the latencies, then a bypass that inflicts a penalty on the basic cycle time of 2.5 nanoseconds or less is a win. SO not only do we need data like you've provided, we need to know how often we can reorganize the delay away. Unfortuneately, I don't really have good data for either of these factors. ] I haven't seen published studies on dynamic forwarding frequencies -- ] does anyone know of such papers? I, too, would be VERY interested in any such works. In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes: 1) Slows down critical path. Any finely tuned risc CPU will most probably have it's cycle time determined by the latency through the ALU. Using a loopback of ALU results might result (depending on layout, tech, etc) in up to a 20% slowdown in the ALU, plus increase the chip area and layout problems. This doesn't mean a loopback is a loss necessarily, but that it does have a measurable cost which must be weighed against the benefits. 2) In combination with (1) above, I'm not sure that having a one-cycle delay in ALU results causes any large loss. A good reorganizer can fill those latencies, or move the ALU op forward into, for example, a load delay. In high-speed (> 15 Mhz) RISCs (and maybe slower ones as well), load delays are usually the determining factor, or a large part of it. What studies do you have that compare RISC's with 1 cycles ALU delays and 0-cycle? I'd like to see anything you can drag up. To answer these questions I reran a local analysis program on the results of 13 program runs. First a note on terminology: I call the latency of an op the time it takes until you can reference the result. The delay is the latency minus the time to issue the instruction itself (usually latency - 1). The program defaults to -alu_rate 1 -alu_latency 1 -shift_rate 1 -shift_latency 1 -load_rate 1 -load_latency 2 i.e. a model where you can use the result of an alu/shift instruction in the next instruction and the result of a load one after that. E.g. the MIPSco R2000. I instead specified -alu_rate 1 -alu_latency 2 -shift_rate 1 -shift_latency 2 -load_rate 1 -load_latency 3 -reorganize which simulates no bypassing (i.e. increase latencies by 1, but leave rates alone). The -reorganize says to reorganize to the new constraints before analysis. I then took the ratio of the new cycle count and the old count and averaged: 13 samples minimum 1.024 (-1.7o) harmonic mean 1.207 (-0.091o) geometric mean 1.212 (-0.045o) mean 1.217 o=0.1150, cov=0.09449 median 1.228 (+0.096o) maximum 1.408 (+1.7o) I.e. the lack of bypassing is equivalent to a cycle time increase of 20%. I.e. 5ns @ 40MHz. The effect was as low as 2.4% and as high as 41%, which simply proves you can prove anything you like by looking at single data points. Anyway, I hope the hard data helps the discussion.