Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!ames!amdcad!bcase
From: bcase@amdcad.AMD.COM (Brian Case)
Newsgroups: comp.arch
Subject: Re: AM29000 Booleans [numbers; long]
Message-ID: <16588@amdcad.AMD.COM>
Date: Thu, 7-May-87 12:12:10 EDT
Article-I.D.: amdcad.16588
Posted: Thu May  7 12:12:10 1987
Date-Received: Sat, 9-May-87 04:43:24 EDT
References: <1270@aw.sei.cmu.edu> <16560@amdcad.AMD.COM> <369@winchester.UUCP>
Reply-To: bcase@amdcad.UUCP (Brian Case)
Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca.
Lines: 61

In article <369@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>This is clearly correct [i.e., optimize for more frequent case],
>although I suspect compiler writers may moan a little.

Why?  The straight-forward approach works very well:  always generate
compare/branch sequences, then look at the pairs and see if they can
be optimized to eliminate the compare.  Believe me on this one, there
is no cause for moaning over adding this optimization to the peepholer.

>However, it does raise an interesting question:  was it not possible
>with the 29K pipeline to offer the "other" fast branches, i.e., those that
>do no arithmetic comparison, but that include the following set:
>beqz
>beq	(compares 2 regs)
>bne	(compares 2 regs)
>bnez
>bgtz
>blez
>bgez*
>bltz*
>
>The *'d ones are the ones equivalent to the 29K's instructions.
>Here is some data: over a set of 12 programs [as, ccom, compress, dhrystone,

[lots off good data, typical of John's postings (thanks again!)]

>b) Not having the other fast branches is about a 9% hit.  If the
>cycle time is improved that much by not supporting them [unlikely,
>but possible], then not having them is a win, else, it would have better
>to do the full set, and then the 1-bit can go back in the other end
>of the register.

Well, with our four-stage pipeline, branches must be executed in the
decode stage in order to avoid having double-delayed branches (something
that we *very* much want to avoid since at least some of our customers
will be doing some non-trivial assembly coding.  Single-delayed is
tough enough for human beings).  In order to execute a conditional
branch in the decode stage, all the information must be present then:
the branch condition and the target address.  The Am29000 has a
dedicated branch-offset adder to form target addresses, and the branch
condition is available, at the VERY end of the cycle, from either the
register file or from the ALU (if a compare instruction is generating
the branch condition, then it is in the execute stage when the branch
is in the decode stage, and forwarding will send the boolean result
to the branch target mux select line just in time).  Since our register
file does write before read to avoid two levels of forwarding (and
for other reasons, such as overlapping the local register file sp+
offset calculation with writing), there is no time to do a full,
32-bit zero detect to implement the most common branch if register==
zero/notzero construct.  We realized that there would be significant
benefit if it were possible, but there just didn't seem to be a way
to fit it in with all the other features we wanted.  Plus, we wanted
to clear the way for future implementation in other technologies.
Granted, we may have made other decisions that will make those
implementations less than pefect, but who's perfect?  (Don't answer
that. :-)

Does the MIPS processor have double-delayed branches?  That would
explain why that processor can get away with these things.

    bcase