Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull From: amull@Morgan.COM (Andrew P. Mullhaupt) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Summary: Some strange history Message-ID: <681@terminus.Morgan.COM> Date: 16 Jan 90 03:43:02 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <50855@bbn.COM> Organization: Morgan Stanley & Co. NY, NY Lines: 51 In article <50855@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > > >But back to my original confusion - am I the only one with a BLAS > >which unrolls its own loops? Given that I'm talking about double > >precision arithmetic, should I really expect the compiler to find > >yet another factor of two? I'll believe it when I see it. > > In older machines (those without any scalar pipeline) the only > advantage of unrolling loops was to reduce loop overhead. Now, with > scalar pipelines, a good instruction scheduler can likewise take > advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is > unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop > and put into 4 registers by 4 sequential loads (I assume using > displacement addressing). Then the four muls can be started > sequentially, followed by 4 stores. The time to do 4 loop interations > in this case should be only slightly more than the time to do one > (with all cache hits). > > Note that with more in the loop (like maybe two fetched vectors instead of > one) and maybe an add, you use up all the registers real fast, especially > with double precision. > As a matter of fact, the first time I had to worry about unrolling loops was on a CDC 6600 (it was delivered in 1963 - and was the third one built). Not that I was programming it then - but that's how old the machine was. Now this 'box' had (if memory serves) eight arithmetic pipelines which could all be simultaneously running: It was something along the lines of integer and floating add, subtract, multiply and divide, (but it wasn't exactly that - the exact specification for the Cyber (a descendant) machine can be found in Michael Metcalf's interesting book _FORTRAN Optimization_.) I'm not sure how old loop unrolling is, but the FORTRAN compiler for the CDC 6600 had it by the time I got around to that machine. In fact this is one of the machines where hand-coded assembler was as likely to slow down code as the FORTRAN compiler's code because the compiler took care to schedule the pipes. It could even move code across loops or function calls in order to schedule better. Now this is at least a 15 year old compiler and a 27 year old machine. I don't think the role of loop unrolling is really in a new and different light - and I'm somewhat disappointed in at least on of the compilers I've run across for RISC. For example: the Sun 4 compiler willfully punishes you if you unroll your loops in the source. It doesn't unroll them for you either. The gcc-1.35 compiler for the same machine quite understands what you want and you get as much as a factor of 10 speedup. This proves that the hardware is not the problem. Can anyone who has an m88k and a C compiler check out what happens if you unroll loops at the source level and post a short summary? Later, Andrew Mullhaupt