Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!cs.utexas.edu!oakhill!marvin From: marvin@oakhill.UUCP (Marvin Denman) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Message-ID: <2811@yogi.oakhill.UUCP> Date: 11 Jan 90 00:55:42 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> Reply-To: cs.utexas.edu!oakhill!marvin (Marvin Denman) Organization: Motorola Inc., Austin, Texas Lines: 86 In article tom@ssd.csd.harris.com (Tom Horsley) writes: >> ... >> Discussion by Tom Wood of Data General about the possibility of boosting >> Mflops by a factor of 3 or better with improved compiler technology. >> ... >This may be true for single precision, but it is hard to see how you can get >the pipe full for double precision. Any instruction with a double precision >source operand requires two (count'em 2) cycles before the 88k will even >bother looking at the next instruction. Then for double precision float >instructions there are two cycles required in the first FP1 pipe stage >(although the one of these FP1 cycles can overlap with the last of the two >decode cycles, so perhaps this is not so bad). Two cycles to issue a double precision operation is an artifact of the 88100 implementation. The penalty is only two cycles for initiating and terminating instructions though. The pipes generally compress out bubbles, so any stalls at the end of the pipe are usually hidden unless the pipe is full for some reason. > >>Code Generation Technique Cycles/iteration Mflops >> Naive code 19 2.10 >> Naive code, 2 unrolls 35/2 2.28 >> Sophisticated, 4 unrolls 28/4 5.71 >> Sophisticated, 8 unrolls 48/8 6.67 >In your example, even if everything is pipelined, the minimum number of >instructions that seem to be required just to do the computation is: >instruction number cycles > addu 2 2 loop overhead > bb1 1 1 > cmp 1 1 > fadd.ddd 8 16 loop body > fmul.ddd 8 16 > ld.d 16 16 > st.d 8 16 >----------------------------- > 68 >As near as I can tell, this example does not work out as well as the >original poster implied. Couple this with the real world fact (known even >by Cray users with heavy duty vectorizing compilers) that an awful lot of >real world algorithms have dependencies on previous results. No matter how >good your compiler is, it cannot pipeline these algorithms, because the next >thing depends on the last thing. (Obviously it is worth the trouble to >pipeline when you can, I am just saying it is not always possible). The example in question was obviously for single precision. The 68 cycle appears to be approximately correct for best case double precision assuming the loop in double precision can be unrolled 8 times before running out of registers. One clock could probably be saved in this case by optimizing the loop to use bcnd instead of the compare and branch sequence. I haven't coded this loop, but I have unrolled similar loops such as Linpack. Comparing 68 cycles to 19 cycles is not an apples to apples comparison. The naive code would also be slowed somewhat by using double precision. As a first guess I would say that the ratio of unrolled code to naive code will still be close to 3. Compilers have much room for improvement particularly in floating point numerical code. The current compilers do very little scheduling and no unrolling of loops that I am aware of. Just scheduling operations with latencies greater than 1 will improve performance significantly. Unrolling loops will make a large difference in this type of code. Data dependencies between iterations of a loop are a very significant problem with unrolling loops. Hopefully the compiler will recognize the nondependencies well enough to unroll most loops that can be unrolled. I agree that on some loops there are dependencies that hinder unrolling. If these can be identified though the compiler may even be able to remove redundant loads. There is so much room for improvement that I find it difficult to be pessimistic about the amount of improvement that is possible. >For highest performance in all cases, give me the float unit with the >highest raw speed, pipelining only works if my algorithm is suitable, raw >speed always works. I disagree. I think that unless the latency is very short (2 or maybe 3 cycles) that pipelining will pay off on a normal application mix. The longer the latency, the more likely it is that you will want to unroll or reschedule code. It will be interesting to see if MIPS goes to pipelining floating point instructions in future parts. -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin