Path: utzoo!attcan!uunet!mstan!amull
From: amull@Morgan.COM (Andrew P. Mullhaupt)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Summary: Unrolled Linpack
Message-ID: <671@s5.Morgan.COM>
Date: 12 Jan 90 01:55:26 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <2811@yogi.oakhill.UUCP>
Organization: Morgan Stanley & Co. NY, NY
Lines: 88

In article <2811@yogi.oakhill.UUCP>, marvin@oakhill.UUCP (Marvin Denman) writes:
> In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
|||  ...
|||  Discussion by Tom Wood of Data General about the possibility of boosting
|||  Mflops by a factor of 3 or better with improved compiler technology.
|||  ...
||This may be true for single precision, but it is hard to see how you can get
||the pipe full for double precision. Any instruction with a double precision
||source operand requires two (count'em 2) cycles before the 88k will even
||bother looking at the next instruction. Then for double precision float
||instructions there are two cycles required in the first FP1 pipe stage
||(although the one of these FP1 cycles can overlap with the last of the two
||decode cycles, so perhaps this is not so bad).
| 
| 
| The example in question was obviously for single precision.  The 68 cycle
| appears to be approximately correct for best case double precision assuming 
| the loop in double precision can be unrolled 8 times before running out of 
| registers.  One clock could probably be saved in this case by optimizing the
| loop to use bcnd instead of the compare and branch sequence.
| 
| I haven't coded this loop, but I have unrolled similar loops such as Linpack.  
| Comparing 68 cycles to 19 cycles is not an apples to apples comparison.
| The naive code would also be slowed somewhat by using double precision.  
| As a first guess I would say that the ratio of unrolled code to naive code will 
| still be close to 3.  Compilers have much room for improvement particularly
| in floating point numerical code.  The current compilers do very little
| scheduling and no unrolling of loops that I am aware of.  Just scheduling
| operations with latencies greater than 1 will improve performance significantly.
| Unrolling loops will make a large difference in this type of code.

God help us I hope not. Unless we're reading off different pages, the
BLAS (Basic Linear Algebra Subroutines) have loops which are unrolled
in many places just for this reason. If the compiler insists on rolling
them back up - that's its affair. Now you can't win by unrolling every
loop, because some loops are big enough that unrolling them pops you
out of cache, etc., so don't expect unrolling loops to be the winner
every time. You can sometimes get nailed by inlining stuff for the same
reason. Now what to unroll is a harder question than it used to be 
because you've so many different sizes of cache and stuff across the
different machines, but looking at the old CDC 6600 architecture
and it's multiply scheduled pipelines will likely show that scientific
computing has been around this block once before. The tricks are often
worthwhile, and I would expect every self-respecting compiler to be
aware of the available weapons. 
| 
| Data dependencies between iterations of a loop are a very significant problem
| with unrolling loops.  Hopefully the compiler will recognize the nondependencies
| well enough to unroll most loops that can be unrolled.  I agree that on some
| loops there are dependencies that hinder unrolling.  If these can be identified
| though the compiler may even be able to remove redundant loads.  There is so
| much room for improvement that I find it difficult to be pessimistic about 
| the amount of improvement that is possible.
| 
||For highest performance in all cases, give me the float unit with the
||highest raw speed, pipelining only works if my algorithm is suitable, raw
||speed always works.
| 
| I disagree.  I think that unless the latency is very short (2 or maybe 3 cycles)
| that pipelining will pay off on a normal application mix.  The longer the
| latency, the more likely it is that you will want to unroll or reschedule code.
| It will be interesting to see if MIPS goes to pipelining floating point 
| instructions in future parts.

It seems to me that 68 versus 48 clocks is about a 40 % penalty for
double precision. That's too much for my taste - (I can tolerate about
a 20% differential). If you think about exercising your bus, etc.,
double precision probably gets higher efficiency but hurts your cache
hit ratio as compared to single precision. 

It is quite likely that there are two user communities here - the
single precision fans and the double precision fans. We will most
likely end up preferring different machines. I come from the double
precision school of thought, and almost ignore single precision
benchmarks. I would expect the other camp does the reverse. It should
be pretty obvious that unrolling, (a control overhead reduction
technique) will be more efficacious when the amount of real work
done on each pass of the loop is smaller. All other things being
equal, we should expect single precision code to benefit more by the
application of unrolling. (It really helps integer code no end). 

But back to my original confusion - am I the only one with a BLAS
which unrolls its own loops? Given that I'm talking about double
precision arithmetic, should I really expect the compiler to find 
yet another factor of two? I'll believe it when I see it.

Later,
Andrew Mullhaupt