Path: utzoo!utgpu!water!watmath!clyde!rutgers!rochester!cornell!batcomputer!itsgw!imagine!pawl22.pawl.rpi.edu!jesup
From: jesup@pawl22.pawl.rpi.edu (Randell E. Jesup)
Newsgroups: comp.arch
Subject: Re: conditional branches
Message-ID: <400@imagine.PAWL.RPI.EDU>
Date: 21 Feb 88 04:51:55 GMT
References: <191@telesoft.UUCP> <1556@gumby.mips.COM> <375@imagine.PAWL.RPI.EDU> <1610@gumby.mips.COM>
Sender: news@imagine.PAWL.RPI.EDU
Reply-To: beowulf!lunge!jesup@steinmetz.UUCP
Organization: RPI Public Access Workstation Lab - Troy, NY
Lines: 53

In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes:
>
>   Think about compare & branch from a hardware point of view.  To do
...
>   for this computation and run it in parallel, you MIGHT be able to
>   pull it off, though I doubt it.  It would cost LOTS of chip area,
>   and would probably be your critical path that determines your cycle
>   time (certainly it would be if you didn't have a parallel adder!)

>Hardware makes things go faster.  That's why RISC machines tend to
>have more hardware in them than CISCs (they find room the extra
>hardware by tossing out the firmware, for a net savings).  It is
>perfectly reasonable to dedicate an adder to computing
>PC+branchdisplacement on every instruction (not just branch
>instructions), and selecting between that and PC+1 based on the branch
>decision.  Perfectly reasonable because that one adder just added 10%
>to your performance.

	You may well be right.  It's always worth checking into when doing
a design, because it IS a potential win, if you can pull it off.  However,
when working at the cutting edge, a large adder is a BIG amount of chip
space to use.  A full ALU (which is just an adder and a shifter) can take
20+% of the entire chip.  The bigger the chip, the lower the yield, and
the more delays in intra-chip runs.
	There also are timing considerations.  If you have your cycle timed
out to be the same as the time through a 32-bit adder, you may have trouble
getting result of the comparison out in time.  Also, the calculation in a
pipelined machine is between PC+N+1 and PC+N+disp, but thats minor.  Another
thing is that running those bits and the PC off to the adder might slow down
one of the decode stages.  Lastly, you have to find the bits in the instruction
to specify both registers and the displacement, and this extra format may also
slow down decoding.
	The point here isn't that it's impossible to get a win with conditional
branches, but that there are a LOT of side-effects that have to be dealt
with in doing it.

>Branch decisions can have practically the same timing constraints as
>load/store instructions in a simple pipeline; if you can do the
>address add for the load/stores, then you can do the branch decision.

"simple pipeline"?  Many chips use the ALU stage of loads/stores for the
address computation.  Once again, you can throw an address-calculator in here
to speed them up by a cycle or so, but here it might impact your register
file design, for indexed load/stores.

The moral:  Chip design is a wonderously complicated place, full of hidden
connections and fragilites.  Very few ideas are easy to implement.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup