Path: utzoo!mnetor!uunet!husc6!bbn!rochester!cornell!batcomputer!pyramid!prls!mips!earl
From: earl@mips.COM (Earl Killian)
Newsgroups: comp.arch
Subject: Re: RPM-40 [really forwarding]
Message-ID: <1800@gumby.mips.COM>
Date: 8 Mar 88 00:01:55 GMT
References: <9758@steinmetz.steinmetz.UUCP> <9799@steinmetz.steinmetz.UUCP>
Lines: 79
In-reply-to: oconnor@sungoddess.steinmetz's message of 5 Mar 88 03:28:02 GMT


In article <9799@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes:

   So far I agree, but there's more ...
   How often forwarding is needed is only PART of the story. The other
   part is how often you could "fill" the delay from forwarding.

   ] Here are some numbers from the Am29000 simulator running a small "nroff"
   ] instructions executed:				89435
   ] instructions requiring alu forwarding:		41420 (46%)
   ] instructions forwarding from load buffer:	13669 (15%)

   But if I can fill 90%, say, of the one-cycle latency delays with
   a reorganizer, then I only incur a penalty of 5%, which means,
   for RPM40, that a bypass path is justified only if it incurs
   a penalty of 1.2 nanoseconds or less. If I can fill 80% of
   the latencies, then a bypass that inflicts a penalty on the
   basic cycle time of 2.5 nanoseconds or less is a win. SO
   not only do we need data like you've provided, we need to
   know how often we can reorganize the delay away. Unfortuneately,
   I don't really have good data for either of these factors.

   ] I haven't seen published studies on dynamic forwarding frequencies --
   ] does anyone know of such papers?

   I, too, would be VERY interested in any such works.

In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes:

   1) Slows down critical path.  Any finely tuned risc CPU will most
   probably have it's cycle time determined by the latency through the
   ALU.  Using a loopback of ALU results might result (depending on
   layout, tech, etc) in up to a 20% slowdown in the ALU, plus
   increase the chip area and layout problems.  This doesn't mean a
   loopback is a loss necessarily, but that it does have a measurable
   cost which must be weighed against the benefits.

   2) In combination with (1) above, I'm not sure that having a
   one-cycle delay in ALU results causes any large loss.  A good
   reorganizer can fill those latencies, or move the ALU op forward
   into, for example, a load delay.  In high-speed (> 15 Mhz) RISCs
   (and maybe slower ones as well), load delays are usually the
   determining factor, or a large part of it.  What studies do you
   have that compare RISC's with 1 cycles ALU delays and 0-cycle?  I'd
   like to see anything you can drag up.

To answer these questions I reran a local analysis program on the
results of 13 program runs.

First a note on terminology: I call the latency of an op the time it
takes until you can reference the result.  The delay is the latency
minus the time to issue the instruction itself (usually latency - 1).

The program defaults to
	-alu_rate 1 -alu_latency 1 -shift_rate 1 -shift_latency 1
	-load_rate 1 -load_latency 2
i.e. a model where you can use the result of an alu/shift instruction
in the next instruction and the result of a load one after that.  E.g.
the MIPSco R2000.  I instead specified
	-alu_rate 1 -alu_latency 2 -shift_rate 1 -shift_latency 2
	-load_rate 1 -load_latency 3 -reorganize
which simulates no bypassing (i.e. increase latencies by 1, but leave
rates alone).  The -reorganize says to reorganize to the new
constraints before analysis.  I then took the ratio of the new cycle
count and the old count and averaged:

13 samples
minimum		    1.024 (-1.7o)
harmonic mean	    1.207 (-0.091o)
geometric mean	    1.212 (-0.045o)
mean		    1.217 o=0.1150, cov=0.09449
median		    1.228 (+0.096o)
maximum		    1.408 (+1.7o)

I.e. the lack of bypassing is equivalent to a cycle time increase of
20%.  I.e. 5ns @ 40MHz.  The effect was as low as 2.4% and as high as
41%, which simply proves you can prove anything you like by looking at
single data points.

Anyway, I hope the hard data helps the discussion.