Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!ames!amdcad!bcase From: bcase@amdcad.AMD.COM (Brian Case) Newsgroups: comp.arch Subject: Re: AM29000 Booleans [numbers; long] Message-ID: <16588@amdcad.AMD.COM> Date: Thu, 7-May-87 12:12:10 EDT Article-I.D.: amdcad.16588 Posted: Thu May 7 12:12:10 1987 Date-Received: Sat, 9-May-87 04:43:24 EDT References: <1270@aw.sei.cmu.edu> <16560@amdcad.AMD.COM> <369@winchester.UUCP> Reply-To: bcase@amdcad.UUCP (Brian Case) Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca. Lines: 61 In article <369@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >This is clearly correct [i.e., optimize for more frequent case], >although I suspect compiler writers may moan a little. Why? The straight-forward approach works very well: always generate compare/branch sequences, then look at the pairs and see if they can be optimized to eliminate the compare. Believe me on this one, there is no cause for moaning over adding this optimization to the peepholer. >However, it does raise an interesting question: was it not possible >with the 29K pipeline to offer the "other" fast branches, i.e., those that >do no arithmetic comparison, but that include the following set: >beqz >beq (compares 2 regs) >bne (compares 2 regs) >bnez >bgtz >blez >bgez* >bltz* > >The *'d ones are the ones equivalent to the 29K's instructions. >Here is some data: over a set of 12 programs [as, ccom, compress, dhrystone, [lots off good data, typical of John's postings (thanks again!)] >b) Not having the other fast branches is about a 9% hit. If the >cycle time is improved that much by not supporting them [unlikely, >but possible], then not having them is a win, else, it would have better >to do the full set, and then the 1-bit can go back in the other end >of the register. Well, with our four-stage pipeline, branches must be executed in the decode stage in order to avoid having double-delayed branches (something that we *very* much want to avoid since at least some of our customers will be doing some non-trivial assembly coding. Single-delayed is tough enough for human beings). In order to execute a conditional branch in the decode stage, all the information must be present then: the branch condition and the target address. The Am29000 has a dedicated branch-offset adder to form target addresses, and the branch condition is available, at the VERY end of the cycle, from either the register file or from the ALU (if a compare instruction is generating the branch condition, then it is in the execute stage when the branch is in the decode stage, and forwarding will send the boolean result to the branch target mux select line just in time). Since our register file does write before read to avoid two levels of forwarding (and for other reasons, such as overlapping the local register file sp+ offset calculation with writing), there is no time to do a full, 32-bit zero detect to implement the most common branch if register== zero/notzero construct. We realized that there would be significant benefit if it were possible, but there just didn't seem to be a way to fit it in with all the other features we wanted. Plus, we wanted to clear the way for future implementation in other technologies. Granted, we may have made other decisions that will make those implementations less than pefect, but who's perfect? (Don't answer that. :-) Does the MIPS processor have double-delayed branches? That would explain why that processor can get away with these things. bcase