Path: utzoo!attcan!uunet!aplcen!samsung!brutus.cs.uiuc.edu!apple!bbn!bbn.com!slackey From: slackey@bbn.com (Stan Lackey) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Message-ID: <50855@bbn.COM> Date: 12 Jan 90 20:07:41 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <2811@yogi.oakhill.UUCP> <671@s5.Morgan.COM> Sender: news@bbn.COM Reply-To: slackey@BBN.COM (Stan Lackey) Organization: Bolt Beranek and Newman Inc., Cambridge MA Lines: 23 In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: >But back to my original confusion - am I the only one with a BLAS >which unrolls its own loops? Given that I'm talking about double >precision arithmetic, should I really expect the compiler to find >yet another factor of two? I'll believe it when I see it. In older machines (those without any scalar pipeline) the only advantage of unrolling loops was to reduce loop overhead. Now, with scalar pipelines, a good instruction scheduler can likewise take advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop and put into 4 registers by 4 sequential loads (I assume using displacement addressing). Then the four muls can be started sequentially, followed by 4 stores. The time to do 4 loop interations in this case should be only slightly more than the time to do one (with all cache hits). Note that with more in the loop (like maybe two fetched vectors instead of one) and maybe an add, you use up all the registers real fast, especially with double precision. -Stan