Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!bonnie.concordia.ca!IRO.UMontreal.CA!matrox!uvm-gen!kira!news From: pegram@kira.UUCP (Robert B. Pegram) Newsgroups: comp.arch Subject: Re: Loop instructions Message-ID: <1991Apr30.191526.3746@uvm.edu> Date: 30 Apr 91 19:15:26 GMT References: <63942@bbn.BBN.COM> Sender: news@uvm.edu Organization: Univ. of Vermont, Eng., Math., and Bus. Admin. (EMBA) Computer Facility Lines: 52 From article <63942@bbn.BBN.COM>, by pplacewa@bbn.com (Paul W Placeway): > ts@cup.portal.com (Tim W Smith) writes: > > < Chris Torek says: > < > However, it turns out that on the 68020 it is almost invariably faster > < > to avoid DBcc anyway (bcopy, for instance, should be unrolled). Score > > < If you unroll too far, don't you start to miss on the instruction > < cache? The optimum seems to be unrolled enough to lower loop overhead > < but rolled enough to fit the loop in the cache. > > After hacking too many DSP things, all I have to say about loop > unrolling is that it's a good technique to make up for a bad > architecture. > > What I want in my processor is a zero-overhead-per-loop down-counting > loop instruction. The TMS320 series, and Motorola DSP 56000 and 96000 > have had this sort of thing for quite a while. For those of you who > happen to be unfamilar, the idea is that the PC addressing hardware > has a loop beginning, end, and count register and the hardware does a > decrement-branch-nonzero when the PC == end-of-loop, resetting it to > beginning-of-loop, while in the instruction fetch stage. > Yup, that's very familiar, I see it in my Analog Devices 2100s also. The other nice thing is the modified Harvard architecture. It allows for two data fetches at one time - provided the instructions are in the cache, and some of the data is stored in the instruction memory space. Frankly, I think it's good for general purpose computer architects to get away from the idea that all programs are geared to the time frame of human beings - the optimizations necessary to maintain milli or microsecond responses can also be useful in more general purpose programs. > Note that this could also be done in a bunch of different ways, like > for instance in a superscaler just doing a DBNZ in parallel with the > previous instruction. I don't care much how it actually works, but it > is quite fun to have things like a vector sum or vector-vector add or > an inline bcopy() that run at one instruction per operation. > > Now if only the "real" processor in my workstation could do such > things. > > -- Paul Placeway Amen to that. Bob Pegram pegram@griffin.uvm.edu or ...!uvm-gen!pegram