Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!mips!zalman
From: zalman@mips.com (Zalman Stern)
Newsgroups: comp.arch
Subject: Re: Bitfield and loop instructions--a good idea?
Message-ID: <2302@spim.mips.COM>
Date: 16 Apr 91 01:36:59 GMT
References: <1991Apr15.193425.3436@waikato.ac.nz>
Sender: news@mips.COM
Organization: MIPS Computer Systems, Sunnyvale, California
Lines: 107
Nntp-Posting-Host: dish.mips.com

Check out the IBM RS/6000 (POWER architecture for He-Man fans). It has
interesting mask and loop counter instructions.

The mask instructions come in many flavors. The basic idea is to construct
a mask from two 5 bit quantities called Mask Begin (MB) and Mask End (ME).
The following is from the IBM manual (*) page 5-205 (text in square
brackets is mine):

    If the MB value is less than the ME value + 1, then the mask bits between
    and including the starting point [bit numbered MB] and the end point [bit
    numbered ME] are set to ones. All other bits are set to zeros.

    If the MB value is the same as the ME value + 1, then all 32 mask bits are
    set to ones.

    If the MB value is greater than the ME value + 1, then the mask bits
    between and including the ME value + 1 and the MB value - 1 are set to
    zeros. All other bits are set to ones.

    [I'm pretty sure bit 0 is most significant.]

(*) IBM RISC System/6000 Assembler Language Reference (SC23-2197-00)

Here is a brief listing of instructions that use the mask mechanism:

    maskg	Given MB and ME in registers, construct a mask and put it
                a third register.
    maskir	Given an already constructed mask in a register, insert a
		source register into a destination register under the mask.
		That is, every where the is a one in the mask, copy the bit
		from the source register to the destination register. Where
		there are zeros in the mask, leave the bits alone. Note
		that the mask can be more general than the above
		description of masks.
    rlimi	Rotate left immediate then mask insert. Takes rotation
		amount, MB, and ME as 5 bit immediates. Rotates source
		register and then does a mask insert into destination
		register. (See above description of insertion.)
    rlinm	Rotate left immediate then AND with mask. Takes rotation
		amount, MB, and ME as 5 bit immediates. Rotates source
		register then ands result with mask. Places result of and in
		destination register.
    rlmi	Same as rlimi except rotate amount is in a register.
    rlnm	Same as rlinm except rotate amount is in a register.
    rrib	Takes bit 0 of source register, rotates it by an amount in
		a register and inserts it into the destination register.
		(I'm not sure what this is used for.)

Rlimi does left and right logical immediate shifts. (I think it corresponds
to what most shifter hardware looks like internally as well.) I hacked some
assembly for this machine and was amazed at how often rlimi came in handy.
(This instruction is so cool I'd almost advocate adding it to our
instruction set. But Earl would probably hit me over the head with an R4000
architecture manual and make me write "Thou shalt pay attention to dynamic
instruction frequency statistics!" 100 times on my whiteboard.)

The POWER architecture also provides a vast set of shift instructions.
These can be used to do double precision shifts and multiple precision
shifts with great efficiency. Add in the count leading zeros and you have a
fairly complete set of bit flicking instructions. (Herman Rubin might even
like this machine. Then again he'd probably bitch about it being worthless
due to the difficulty of using the FP hardware on integer data :-))

As to loop type instructions, the RIOS has a count register in the branch
unit (a separate chip in the current implementation). There is an
instruction that will decrement the count register and conditionally branch
based on the result of the decrement and possibly a condition register
field. This lends itself to loops which are testing a bound and an exit
condition. (I.e stepping through an array looking for a value.) The count
register cannot easily be used as an index since moving it from the branch
chip to the fixed point chip is expensive (i.e. it is not a general purpose
register (GPR)).

Loops which fit this model incur no overhead for the counter or the branch
as the branch unit executes instructions in parallel with the fixed point
and floating point units. In fact it is key to the architecture since
maintaining the counter in the fixed point unit would likely impose a delay
(up to three cycles) on every iteration of the loop. This will show up on
more complex loop conditions anyway. (Loops with function calls also lose
as the count register is caller save.) I'm convinced that compare-and
branch instructions (like on MIPS R-series and HP PA) are the way to go.
(Some other time maybe I'll post a comparison of the R4000 and RIOS branch
instructions. If the RIOS had a load instruction that set the condition
codes it would do better...)

All in all this stuff is neat and adds character to the architecture.
However, I'm not sure it is the best way to do things. Some codes will
perform better due to the bit manipulation instructions, but I doubt they
show up very often at all. (And for multi-precision shifts, there better be
one very important code to make it worthwhile because they never show up in
general purpose stuff.)

The only real way to know is to measure these things across large bodies of
code. I would like to hear what sort of measurements the IBM people did and
how they justified many of these instructions. Both MIPS and HP have been
rumored to use the "one percent rule" in that if an instruction can't boost
performance by 1%, then it doesn't belong in the architecture. For
measurement tools, MIPS has had pixie from very early on and Sun now has
spix for SPARC. What sort of tools do the IBM people use to analyze their
architecture? (At least one RIOS architect has posted here recently...)

Beyond performance, I'd argue that these features limit future
architectural flexibility.  For example, I bet doing an upward compatible
64 bit RIOS architecture is going to be much more difficult than going from
MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6
bits for those mask constants. It is a less important to make double
precision shifts fast when single precision will now do 64 bits. Etc...)