Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!bloom-beacon!apple!baum
From: baum@Apple.COM (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: How to use silicon (was Re: Intel/MIPS Dhrystone ratio)
Message-ID: <27600@apple.Apple.COM>
Date: 20 Mar 89 18:26:51 GMT
References: <37196@bbn.COM> <1989Mar16.190043.23227@utzoo.uucp> <24889@amdcad.AMD.COM> <355@bnr-fos.UUCP>
Reply-To: baum@apple.UUCP (Allen Baum)
Organization: Apple Computer, Inc.
Lines: 59

[]
In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <1989Mar16.190043.23227@utzoo.uucp> (Henry Spencer) writes:
>| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>| >I predict that the next hardware features to come back will be
>| >auto-increment addressing and the hardware handling of unaligned data.
>| 
>| Again, why?  Auto-increment addressing is useful only if instructions
>| are expensive, because it sneaks two instructions into one.  However,
>| the trend today is just the opposite:  the CPUs are outrunning the
>| main memory.  Since instructions can be cached fairly effectively,
>| they are getting cheaper and data is getting more expensive.  Doing
>| the increment by hand often costs you almost nothing, because it can
>| be hidden in the delay slot(s) of the memory access.  Autoincrement
>| showed up best in tight loops, exactly where effective caching can be
>| expected to largely eliminate memory accesses for instructions.  Why
>| bother with autoincrement?
>
>Also, auto-incrementing addressing modes imply:
>
>	- Another adder (to increment the address register in parallel)
>
>	- Another writeback port to the register file
>
>Unless you wish to sequence the instruction over multiple cycles :-(
>
>I'm certain that most people can find something better to do with these
>resources than auto-increment.

Well, I'll have to slightly disagree here. Auto-increment does not cost another
adder (for my particular definition of auto-increment); it just writes the
result of the effective address calculation back to the base register. If you 
want to be tricky, you can use a multiplexor to select the memory address to
be the base register itself, or the effective address calculation, giving you
pre- or post-  auto-increment. It does cost an extra writeback port. This
can be finessed, perhaps, by waiting for a cycle not using the writeback port,
but you can't count on it.
   Now, the question is, can loops profitably use this kind of addressing mode?
Or, should you just schedule the address updates in branch and load shadows
because you can't find anything else to put there?
   Note that if you have a superscalar architecture, and can do two inst.
in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you
can do this kind of thing as a matter of course; but its a lot more expensive
to do it that way- you do need a separate read port and adder, as well as a 
write port. 
   If, in fact, compilers can generate this code (and I believe they
can), and it can be scheduled (i.e. there aren't lots of dead cycles hanging
around just waiting to be filled with these address update instructions),
then it looks like a reasonable tradeoff. It's probably time to dust off those
benchmarks and see how often it occurs, and how many cycles it will save.
   Since this kind of operation is used almost exclusively inside a loop,
it has quite a bit of leverage. Yes, instruction caching is most effective
there, but that just means it won't cost you additional cycles, above and 
beyond the separate update instruction, not that it won't save you any cycles.
   Besides, who says you can't find soething else to
do with the extra write port when you're not doing address updates?
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum