Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!cs.utexas.edu!oakhill!marvin
From: marvin@oakhill.UUCP (Marvin Denman)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <2811@yogi.oakhill.UUCP>
Date: 11 Jan 90 00:55:42 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> <TOM.90Jan9101628@hcx2.ssd.csd.harris.com>
Reply-To: cs.utexas.edu!oakhill!marvin (Marvin Denman)
Organization: Motorola Inc., Austin, Texas
Lines: 86

In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
>>  ...
>>  Discussion by Tom Wood of Data General about the possibility of boosting
>>  Mflops by a factor of 3 or better with improved compiler technology.
>>  ...
>This may be true for single precision, but it is hard to see how you can get
>the pipe full for double precision. Any instruction with a double precision
>source operand requires two (count'em 2) cycles before the 88k will even
>bother looking at the next instruction. Then for double precision float
>instructions there are two cycles required in the first FP1 pipe stage
>(although the one of these FP1 cycles can overlap with the last of the two
>decode cycles, so perhaps this is not so bad).

Two cycles to issue a double precision operation is an artifact of the 
88100 implementation.  The penalty is only two cycles for initiating
and terminating instructions though.  The pipes generally compress out
bubbles, so any stalls at the end of the pipe are usually hidden unless
the pipe is full for some reason.

>
>>Code Generation Technique      Cycles/iteration      Mflops
>>    Naive code                      19                 2.10
>>    Naive code, 2 unrolls          35/2		             2.28
>>    Sophisticated, 4 unrolls       28/4		             5.71
>>    Sophisticated, 8 unrolls       48/8 	             6.67
 
>In your example, even if everything is pipelined, the minimum number of
>instructions that seem to be required just to do the computation is:
 
>instruction   number   cycles
>       addu        2        2   loop overhead
>        bb1        1        1
>        cmp        1        1
>   fadd.ddd        8       16   loop body
>   fmul.ddd        8       16
>       ld.d       16       16
>       st.d        8       16
>-----------------------------
>                           68

>As near as I can tell, this example does not work out as well as the
>original poster implied.  Couple this with the real world fact (known even
>by Cray users with heavy duty vectorizing compilers) that an awful lot of
>real world algorithms have dependencies on previous results. No matter how
>good your compiler is, it cannot pipeline these algorithms, because the next
>thing depends on the last thing.  (Obviously it is worth the trouble to
>pipeline when you can, I am just saying it is not always possible).

The example in question was obviously for single precision.  The 68 cycle
appears to be approximately correct for best case double precision assuming 
the loop in double precision can be unrolled 8 times before running out of 
registers.  One clock could probably be saved in this case by optimizing the
loop to use bcnd instead of the compare and branch sequence.

I haven't coded this loop, but I have unrolled similar loops such as Linpack.  
Comparing 68 cycles to 19 cycles is not an apples to apples comparison.
The naive code would also be slowed somewhat by using double precision.  
As a first guess I would say that the ratio of unrolled code to naive code will 
still be close to 3.  Compilers have much room for improvement particularly
in floating point numerical code.  The current compilers do very little
scheduling and no unrolling of loops that I am aware of.  Just scheduling
operations with latencies greater than 1 will improve performance significantly.
Unrolling loops will make a large difference in this type of code.

Data dependencies between iterations of a loop are a very significant problem
with unrolling loops.  Hopefully the compiler will recognize the nondependencies
well enough to unroll most loops that can be unrolled.  I agree that on some
loops there are dependencies that hinder unrolling.  If these can be identified
though the compiler may even be able to remove redundant loads.  There is so
much room for improvement that I find it difficult to be pessimistic about 
the amount of improvement that is possible.

>For highest performance in all cases, give me the float unit with the
>highest raw speed, pipelining only works if my algorithm is suitable, raw
>speed always works.

I disagree.  I think that unless the latency is very short (2 or maybe 3 cycles)
that pipelining will pay off on a normal application mix.  The longer the
latency, the more likely it is that you will want to unroll or reschedule code.
It will be interesting to see if MIPS goes to pipelining floating point 
instructions in future parts.
-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin