Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!samsung!zaphod.mps.ohio-state.edu!rpi!leah!albanycs!crdgw1!crdos1!davidsen
From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr)
Newsgroups: comp.arch
Subject: Re: '040 vs. SPARC (was: Next computer...)
Message-ID: <2114@crdos1.crd.ge.COM>
Date: 9 Feb 90 18:30:37 GMT
References: <8905@portia.Stanford.EDU> <160@zds-ux.UUCP> <38415@apple.Apple.COM> <2101@crdos1.crd.ge.COM> <19233@dartvax.Dartmouth.EDU> <2105@crdos1.crd.ge.COM> <29099@amdcad.AMD.COM>
Reply-To: davidsen@crdos1.crd.ge.com (bill davidsen)
Organization: GE Corp R&D Center, Schenectady NY
Lines: 44

In article <29099@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:

| But the complex instruction typically binds many operations together,
| *reducing* the ability to efficiently overlap subsequent operations.
| However, if the complex instruction is split into its constituent
| parts, there is much more opportunity for instruction scheduling.

  Performance depends on how it's done. If the CPU can't do anything
else when it starts a complex instruction, then the gains from possible
internal overlap of phases will have to outweigh the blocking of the
CPU. If the CPU can continue to execute at least some other
instructions, then a smart compiler can probably find instructions.

  This isn't black and white, where all complex instructions are a lose
and all simple ones are a win. Volume of instructions impacts memory
bandwidth, too.

| Either this will take an extra cycle to write back the incremented
| address register (in which case an explicit add is just as fast), or
| an extra register file port just to write the incremented address at
| the same time the load data is written.  If more register file ports
| are going to be added, I'd rather issue multiple, general-purpose
| instructions, which have a much greater chance of being used than a
| limited auto-increment mode.

  What I said about memory bandwidth applies here, but even more to the
point, a load or store through a pointer (address register) usually has
at least one cycle overhead after the address is used, even with cache.
This can be used to do the increment without slowing anything down, and
without running another instruction decode. The issue which I believe is
primary is if the added complexity of the instruction decode slows it
down. Given the number of gates available I believe the answer is
"usually not."

  There are people who argue against having increment, stating that it's
not general purpose and that the incrment should be two discrete
instructions, namely (1) load immediate to 2nd register value 1, and (2)
add 2nd register to the register to be incremented. I don't agree with
this, either, but I can see that it is the ultimate extension of the
RISC method.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me