Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!uakari.primate.wisc.edu!uflorida!novavax!hcx1!tom
From: tom@ssd.csd.harris.com (Tom Horsley)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <TOM.90Jan9101628@hcx2.ssd.csd.harris.com>
Date: 9 Jan 90 15:16:28 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu>
	<648@s5.Morgan.COM> <1879@xyzzy.UUCP>
Sender: news@hcx1.UUCP
Organization: Harris Computer Systems Division
Lines: 84
In-reply-to: wood@dg-rtp.dg.com's message of 8 Jan 90 20:22:06 GMT

>I'd like to entertain a discussion on the FP performance of the 88k.
>I have yet to see a compiler that takes advantage of the pipeline
>on this machine to any extent.  Theoretically, you can have 5 FP adds
>and 6 FP multiplies going on at once (if I understand correctly, the total
>here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and
>no more than 9 total).  So how would you feel if someone were able to
>boost Mflops by a factor of say 3 (or better) by improving the compiler 
>technology?

This may be true for single precision, but it is hard to see how you can get
the pipe full for double precision. Any instruction with a double precision
source operand requires two (count'em 2) cycles before the 88k will even
bother looking at the next instruction. Then for double precision float
instructions there are two cycles required in the first FP1 pipe stage
(although the one of these FP1 cycles can overlap with the last of the two
decode cycles, so perhaps this is not so bad).

>Code Generation Technique      Cycles/iteration      Mflops
>
>    Naive code                      19                 2.10
>    Naive code, 2 unrolls          35/2		2.28
>    Sophisticated, 4 unrolls       28/4		5.71
>    Sophisticated, 8 unrolls       48/8 	        6.67
>
>Well, how 'bout it!?

In your example, even if everything is pipelined, the minimum number of
instructions that seem to be required just to do the computation is:

instruction   number   cycles

       addu        2        2   loop overhead
        bb1        1        1
        cmp        1        1

   fadd.ddd        8       16   loop body
   fmul.ddd        8       16
       ld.d       16       16
       st.d        8       16
-----------------------------
                           68

As near as I can tell 68 is not equal to 48. Do you have actual assembler
code that does this inner loop in 48 cycles? Could you post it?

As near as I can tell, this example does not work out as well as the
original poster implied.  Couple this with the real world fact (known even
by Cray users with heavy duty vectorizing compilers) that an awful lot of
real world algorithms have dependencies on previous results. No matter how
good your compiler is, it cannot pipeline these algorithms, because the next
thing depends on the last thing.  (Obviously it is worth the trouble to
pipeline when you can, I am just saying it is not always possible).

Another note said something about doing these sorts of optimizations at the
assembly level. This is also likely to turn out to be very hard.  The code
generated by the compiler is very likely to have the st.d instruction right
after the fadd.ddd instruction and right before the next set of ld.d
instructions. Unless the assembler is equipped to do enough symbolic
execution to prove that there is no aliasing it is going to have to leave
the st.d in front of the next set of ld.d instructions. This effectively
serializes the code since the thing being stored is the result of the fadd,
and there are very few things that can be reordered to fill pipeline slots.

For highest performance in all cases, give me the float unit with the
highest raw speed, pipelining only works if my algorithm is suitable, raw
speed always works.

Note: If the sample code had a divide instruction in it, it would be orders
of magnitude worse. Divides are *really* awful (they can't even be
pipelined).

Note Note: I am not fundamentally against the 88k. In fact, I like it. I
just wish the double precision performance were better. The main reason to
buy an 88k box over and above a MIPS or a 486 hot box is the existence of
the BCS standard. DEC has effectively shot MIPS in the foot by deciding to
run their boxes with the bytes backward. This makes it nearly impossible to
imagine a useful BCS ever happening across the full line of MIPS based
boxes.
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================