Path: utzoo!attcan!uunet!mstan!amull From: amull@Morgan.COM (Andrew P. Mullhaupt) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Summary: Unrolled Linpack Message-ID: <671@s5.Morgan.COM> Date: 12 Jan 90 01:55:26 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <2811@yogi.oakhill.UUCP> Organization: Morgan Stanley & Co. NY, NY Lines: 88 In article <2811@yogi.oakhill.UUCP>, marvin@oakhill.UUCP (Marvin Denman) writes: > In article tom@ssd.csd.harris.com (Tom Horsley) writes: ||| ... ||| Discussion by Tom Wood of Data General about the possibility of boosting ||| Mflops by a factor of 3 or better with improved compiler technology. ||| ... ||This may be true for single precision, but it is hard to see how you can get ||the pipe full for double precision. Any instruction with a double precision ||source operand requires two (count'em 2) cycles before the 88k will even ||bother looking at the next instruction. Then for double precision float ||instructions there are two cycles required in the first FP1 pipe stage ||(although the one of these FP1 cycles can overlap with the last of the two ||decode cycles, so perhaps this is not so bad). | | | The example in question was obviously for single precision. The 68 cycle | appears to be approximately correct for best case double precision assuming | the loop in double precision can be unrolled 8 times before running out of | registers. One clock could probably be saved in this case by optimizing the | loop to use bcnd instead of the compare and branch sequence. | | I haven't coded this loop, but I have unrolled similar loops such as Linpack. | Comparing 68 cycles to 19 cycles is not an apples to apples comparison. | The naive code would also be slowed somewhat by using double precision. | As a first guess I would say that the ratio of unrolled code to naive code will | still be close to 3. Compilers have much room for improvement particularly | in floating point numerical code. The current compilers do very little | scheduling and no unrolling of loops that I am aware of. Just scheduling | operations with latencies greater than 1 will improve performance significantly. | Unrolling loops will make a large difference in this type of code. God help us I hope not. Unless we're reading off different pages, the BLAS (Basic Linear Algebra Subroutines) have loops which are unrolled in many places just for this reason. If the compiler insists on rolling them back up - that's its affair. Now you can't win by unrolling every loop, because some loops are big enough that unrolling them pops you out of cache, etc., so don't expect unrolling loops to be the winner every time. You can sometimes get nailed by inlining stuff for the same reason. Now what to unroll is a harder question than it used to be because you've so many different sizes of cache and stuff across the different machines, but looking at the old CDC 6600 architecture and it's multiply scheduled pipelines will likely show that scientific computing has been around this block once before. The tricks are often worthwhile, and I would expect every self-respecting compiler to be aware of the available weapons. | | Data dependencies between iterations of a loop are a very significant problem | with unrolling loops. Hopefully the compiler will recognize the nondependencies | well enough to unroll most loops that can be unrolled. I agree that on some | loops there are dependencies that hinder unrolling. If these can be identified | though the compiler may even be able to remove redundant loads. There is so | much room for improvement that I find it difficult to be pessimistic about | the amount of improvement that is possible. | ||For highest performance in all cases, give me the float unit with the ||highest raw speed, pipelining only works if my algorithm is suitable, raw ||speed always works. | | I disagree. I think that unless the latency is very short (2 or maybe 3 cycles) | that pipelining will pay off on a normal application mix. The longer the | latency, the more likely it is that you will want to unroll or reschedule code. | It will be interesting to see if MIPS goes to pipelining floating point | instructions in future parts. It seems to me that 68 versus 48 clocks is about a 40 % penalty for double precision. That's too much for my taste - (I can tolerate about a 20% differential). If you think about exercising your bus, etc., double precision probably gets higher efficiency but hurts your cache hit ratio as compared to single precision. It is quite likely that there are two user communities here - the single precision fans and the double precision fans. We will most likely end up preferring different machines. I come from the double precision school of thought, and almost ignore single precision benchmarks. I would expect the other camp does the reverse. It should be pretty obvious that unrolling, (a control overhead reduction technique) will be more efficacious when the amount of real work done on each pass of the loop is smaller. All other things being equal, we should expect single precision code to benefit more by the application of unrolling. (It really helps integer code no end). But back to my original confusion - am I the only one with a BLAS which unrolls its own loops? Given that I'm talking about double precision arithmetic, should I really expect the compiler to find yet another factor of two? I'll believe it when I see it. Later, Andrew Mullhaupt