Path: utzoo!attcan!uunet!aplcen!samsung!brutus.cs.uiuc.edu!apple!bbn!bbn.com!slackey
From: slackey@bbn.com (Stan Lackey)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <50855@bbn.COM>
Date: 12 Jan 90 20:07:41 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <2811@yogi.oakhill.UUCP> <671@s5.Morgan.COM>
Sender: news@bbn.COM
Reply-To: slackey@BBN.COM (Stan Lackey)
Organization: Bolt Beranek and Newman Inc., Cambridge MA
Lines: 23

In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:

>But back to my original confusion - am I the only one with a BLAS
>which unrolls its own loops? Given that I'm talking about double
>precision arithmetic, should I really expect the compiler to find 
>yet another factor of two? I'll believe it when I see it.

In older machines (those without any scalar pipeline) the only
advantage of unrolling loops was to reduce loop overhead.  Now, with
scalar pipelines, a good instruction scheduler can likewise take
advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is
unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop
and put into 4 registers by 4 sequential loads (I assume using
displacement addressing).  Then the four muls can be started
sequentially, followed by 4 stores.  The time to do 4 loop interations
in this case should be only slightly more than the time to do one
(with all cache hits).

Note that with more in the loop (like maybe two fetched vectors instead of
one) and maybe an add, you use up all the registers real fast, especially
with double precision.

-Stan