Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!uakari.primate.wisc.edu!uflorida!novavax!hcx1!tom From: tom@ssd.csd.harris.com (Tom Horsley) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Message-ID: Date: 9 Jan 90 15:16:28 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> Sender: news@hcx1.UUCP Organization: Harris Computer Systems Division Lines: 84 In-reply-to: wood@dg-rtp.dg.com's message of 8 Jan 90 20:22:06 GMT >I'd like to entertain a discussion on the FP performance of the 88k. >I have yet to see a compiler that takes advantage of the pipeline >on this machine to any extent. Theoretically, you can have 5 FP adds >and 6 FP multiplies going on at once (if I understand correctly, the total >here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and >no more than 9 total). So how would you feel if someone were able to >boost Mflops by a factor of say 3 (or better) by improving the compiler >technology? This may be true for single precision, but it is hard to see how you can get the pipe full for double precision. Any instruction with a double precision source operand requires two (count'em 2) cycles before the 88k will even bother looking at the next instruction. Then for double precision float instructions there are two cycles required in the first FP1 pipe stage (although the one of these FP1 cycles can overlap with the last of the two decode cycles, so perhaps this is not so bad). >Code Generation Technique Cycles/iteration Mflops > > Naive code 19 2.10 > Naive code, 2 unrolls 35/2 2.28 > Sophisticated, 4 unrolls 28/4 5.71 > Sophisticated, 8 unrolls 48/8 6.67 > >Well, how 'bout it!? In your example, even if everything is pipelined, the minimum number of instructions that seem to be required just to do the computation is: instruction number cycles addu 2 2 loop overhead bb1 1 1 cmp 1 1 fadd.ddd 8 16 loop body fmul.ddd 8 16 ld.d 16 16 st.d 8 16 ----------------------------- 68 As near as I can tell 68 is not equal to 48. Do you have actual assembler code that does this inner loop in 48 cycles? Could you post it? As near as I can tell, this example does not work out as well as the original poster implied. Couple this with the real world fact (known even by Cray users with heavy duty vectorizing compilers) that an awful lot of real world algorithms have dependencies on previous results. No matter how good your compiler is, it cannot pipeline these algorithms, because the next thing depends on the last thing. (Obviously it is worth the trouble to pipeline when you can, I am just saying it is not always possible). Another note said something about doing these sorts of optimizations at the assembly level. This is also likely to turn out to be very hard. The code generated by the compiler is very likely to have the st.d instruction right after the fadd.ddd instruction and right before the next set of ld.d instructions. Unless the assembler is equipped to do enough symbolic execution to prove that there is no aliasing it is going to have to leave the st.d in front of the next set of ld.d instructions. This effectively serializes the code since the thing being stored is the result of the fadd, and there are very few things that can be reordered to fill pipeline slots. For highest performance in all cases, give me the float unit with the highest raw speed, pipelining only works if my algorithm is suitable, raw speed always works. Note: If the sample code had a divide instruction in it, it would be orders of magnitude worse. Divides are *really* awful (they can't even be pipelined). Note Note: I am not fundamentally against the 88k. In fact, I like it. I just wish the double precision performance were better. The main reason to buy an 88k box over and above a MIPS or a 486 hot box is the existence of the BCS standard. DEC has effectively shot MIPS in the foot by deciding to run their boxes with the bytes backward. This makes it nearly impossible to imagine a useful BCS ever happening across the full line of MIPS based boxes. -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================