Path: utzoo!utgpu!water!watmath!clyde!rutgers!rochester!cornell!batcomputer!itsgw!imagine!pawl22.pawl.rpi.edu!jesup From: jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) Newsgroups: comp.arch Subject: Re: conditional branches Message-ID: <400@imagine.PAWL.RPI.EDU> Date: 21 Feb 88 04:51:55 GMT References: <191@telesoft.UUCP> <1556@gumby.mips.COM> <375@imagine.PAWL.RPI.EDU> <1610@gumby.mips.COM> Sender: news@imagine.PAWL.RPI.EDU Reply-To: beowulf!lunge!jesup@steinmetz.UUCP Organization: RPI Public Access Workstation Lab - Troy, NY Lines: 53 In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes: > > Think about compare & branch from a hardware point of view. To do ... > for this computation and run it in parallel, you MIGHT be able to > pull it off, though I doubt it. It would cost LOTS of chip area, > and would probably be your critical path that determines your cycle > time (certainly it would be if you didn't have a parallel adder!) >Hardware makes things go faster. That's why RISC machines tend to >have more hardware in them than CISCs (they find room the extra >hardware by tossing out the firmware, for a net savings). It is >perfectly reasonable to dedicate an adder to computing >PC+branchdisplacement on every instruction (not just branch >instructions), and selecting between that and PC+1 based on the branch >decision. Perfectly reasonable because that one adder just added 10% >to your performance. You may well be right. It's always worth checking into when doing a design, because it IS a potential win, if you can pull it off. However, when working at the cutting edge, a large adder is a BIG amount of chip space to use. A full ALU (which is just an adder and a shifter) can take 20+% of the entire chip. The bigger the chip, the lower the yield, and the more delays in intra-chip runs. There also are timing considerations. If you have your cycle timed out to be the same as the time through a 32-bit adder, you may have trouble getting result of the comparison out in time. Also, the calculation in a pipelined machine is between PC+N+1 and PC+N+disp, but thats minor. Another thing is that running those bits and the PC off to the adder might slow down one of the decode stages. Lastly, you have to find the bits in the instruction to specify both registers and the displacement, and this extra format may also slow down decoding. The point here isn't that it's impossible to get a win with conditional branches, but that there are a LOT of side-effects that have to be dealt with in doing it. >Branch decisions can have practically the same timing constraints as >load/store instructions in a simple pipeline; if you can do the >address add for the load/stores, then you can do the branch decision. "simple pipeline"? Many chips use the ALU stage of loads/stores for the address computation. Once again, you can throw an address-calculator in here to speed them up by a cycle or so, but here it might impact your register file design, for indexed load/stores. The moral: Chip design is a wonderously complicated place, full of hidden connections and fragilites. Very few ideas are easy to implement. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup