Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!bonnie.concordia.ca!IRO.UMontreal.CA!matrox!uvm-gen!kira!news
From: pegram@kira.UUCP (Robert B. Pegram)
Newsgroups: comp.arch
Subject: Re: Loop instructions
Message-ID: <1991Apr30.191526.3746@uvm.edu>
Date: 30 Apr 91 19:15:26 GMT
References: <63942@bbn.BBN.COM>
Sender: news@uvm.edu
Organization: Univ. of Vermont, Eng., Math., and Bus. Admin. (EMBA) Computer Facility
Lines: 52

From article <63942@bbn.BBN.COM>, by pplacewa@bbn.com (Paul W Placeway):
> ts@cup.portal.com (Tim W Smith) writes:
> 
> < Chris Torek says:
> < > However, it turns out that on the 68020 it is almost invariably faster
> < > to avoid DBcc anyway (bcopy, for instance, should be unrolled).  Score
> 
> < If you unroll too far, don't you start to miss on the instruction
> < cache?  The optimum seems to be unrolled enough to lower loop overhead
> < but rolled enough to fit the loop in the cache.
> 
> After hacking too many DSP things, all I have to say about loop
> unrolling is that it's a good technique to make up for a bad
> architecture.
> 
> What I want in my processor is a zero-overhead-per-loop down-counting
> loop instruction.  The TMS320 series, and Motorola DSP 56000 and 96000
> have had this sort of thing for quite a while.  For those of you who
> happen to be unfamilar, the idea is that the PC addressing hardware
> has a loop beginning, end, and count register and the hardware does a
> decrement-branch-nonzero when the PC == end-of-loop, resetting it to
> beginning-of-loop, while in the instruction fetch stage.
> 

Yup, that's very familiar, I see it in my Analog Devices 2100s also.
The other nice thing is the modified Harvard architecture.  It
allows for two data fetches at one time - provided the instructions
are in the cache, and some of the data is stored in the instruction 
memory space.  Frankly, I think it's good for general purpose computer
architects to get away from the idea that all programs are geared to
the time frame of human beings - the optimizations necessary to
maintain milli or microsecond responses can also be useful in more
general purpose programs. 

> Note that this could also be done in a bunch of different ways, like
> for instance in a superscaler just doing a DBNZ in parallel with the
> previous instruction.  I don't care much how it actually works, but it
> is quite fun to have things like a vector sum or vector-vector add or
> an inline bcopy() that run at one instruction per operation.
> 
> Now if only the "real" processor in my workstation could do such
> things.
> 
> 		-- Paul Placeway <pplaceway@bbn.com>

Amen to that.

Bob Pegram

pegram@griffin.uvm.edu
	or
...!uvm-gen!pegram