Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull
From: amull@Morgan.COM (Andrew P. Mullhaupt)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Summary: Some strange history
Message-ID: <681@terminus.Morgan.COM>
Date: 16 Jan 90 03:43:02 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <50855@bbn.COM>
Organization: Morgan Stanley & Co. NY, NY
Lines: 51

In article <50855@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> 
> >But back to my original confusion - am I the only one with a BLAS
> >which unrolls its own loops? Given that I'm talking about double
> >precision arithmetic, should I really expect the compiler to find 
> >yet another factor of two? I'll believe it when I see it.
> 
> In older machines (those without any scalar pipeline) the only
> advantage of unrolling loops was to reduce loop overhead.  Now, with
> scalar pipelines, a good instruction scheduler can likewise take
> advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is
> unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop
> and put into 4 registers by 4 sequential loads (I assume using
> displacement addressing).  Then the four muls can be started
> sequentially, followed by 4 stores.  The time to do 4 loop interations
> in this case should be only slightly more than the time to do one
> (with all cache hits).
> 
> Note that with more in the loop (like maybe two fetched vectors instead of
> one) and maybe an add, you use up all the registers real fast, especially
> with double precision.
> 
As a matter of fact, the first time I had to worry about unrolling loops was
on a CDC 6600 (it was delivered in 1963 - and was the third one built). Not
that I was programming it then - but that's how old the machine was. Now this
'box' had (if memory serves) eight arithmetic pipelines which could all be
simultaneously running: It was something along the lines of integer and floating
add, subtract, multiply and divide, (but it wasn't exactly that - the exact
specification for the Cyber (a descendant) machine can be found in Michael
Metcalf's interesting book _FORTRAN Optimization_.) I'm not sure how old loop
unrolling is, but the FORTRAN compiler for the CDC 6600 had it by the time I
got around to that machine. In fact this is one of the machines where hand-coded
assembler was as likely to slow down code as the FORTRAN compiler's code because
the compiler took care to schedule the pipes. It could even move code across
loops or function calls in order to schedule better. Now this is at least a
15 year old compiler and a 27 year old machine. I don't think the role of 
loop unrolling is really in a new and different light - and I'm somewhat
disappointed in at least on of the compilers I've run across for RISC.

For example: the Sun 4 compiler willfully punishes you if you unroll your loops
in the source. It doesn't unroll them for you either. The gcc-1.35 compiler
for the same machine quite understands what you want and you get as much as a
factor of 10 speedup. This proves that the hardware is not the problem. 

Can anyone who has an m88k and a C compiler check out what happens if you unroll
loops at the source level and post a short summary? 


Later,
Andrew Mullhaupt