Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!mips!zalman From: zalman@mips.com (Zalman Stern) Newsgroups: comp.arch Subject: Re: Bitfield and loop instructions--a good idea? Message-ID: <2302@spim.mips.COM> Date: 16 Apr 91 01:36:59 GMT References: <1991Apr15.193425.3436@waikato.ac.nz> Sender: news@mips.COM Organization: MIPS Computer Systems, Sunnyvale, California Lines: 107 Nntp-Posting-Host: dish.mips.com Check out the IBM RS/6000 (POWER architecture for He-Man fans). It has interesting mask and loop counter instructions. The mask instructions come in many flavors. The basic idea is to construct a mask from two 5 bit quantities called Mask Begin (MB) and Mask End (ME). The following is from the IBM manual (*) page 5-205 (text in square brackets is mine): If the MB value is less than the ME value + 1, then the mask bits between and including the starting point [bit numbered MB] and the end point [bit numbered ME] are set to ones. All other bits are set to zeros. If the MB value is the same as the ME value + 1, then all 32 mask bits are set to ones. If the MB value is greater than the ME value + 1, then the mask bits between and including the ME value + 1 and the MB value - 1 are set to zeros. All other bits are set to ones. [I'm pretty sure bit 0 is most significant.] (*) IBM RISC System/6000 Assembler Language Reference (SC23-2197-00) Here is a brief listing of instructions that use the mask mechanism: maskg Given MB and ME in registers, construct a mask and put it a third register. maskir Given an already constructed mask in a register, insert a source register into a destination register under the mask. That is, every where the is a one in the mask, copy the bit from the source register to the destination register. Where there are zeros in the mask, leave the bits alone. Note that the mask can be more general than the above description of masks. rlimi Rotate left immediate then mask insert. Takes rotation amount, MB, and ME as 5 bit immediates. Rotates source register and then does a mask insert into destination register. (See above description of insertion.) rlinm Rotate left immediate then AND with mask. Takes rotation amount, MB, and ME as 5 bit immediates. Rotates source register then ands result with mask. Places result of and in destination register. rlmi Same as rlimi except rotate amount is in a register. rlnm Same as rlinm except rotate amount is in a register. rrib Takes bit 0 of source register, rotates it by an amount in a register and inserts it into the destination register. (I'm not sure what this is used for.) Rlimi does left and right logical immediate shifts. (I think it corresponds to what most shifter hardware looks like internally as well.) I hacked some assembly for this machine and was amazed at how often rlimi came in handy. (This instruction is so cool I'd almost advocate adding it to our instruction set. But Earl would probably hit me over the head with an R4000 architecture manual and make me write "Thou shalt pay attention to dynamic instruction frequency statistics!" 100 times on my whiteboard.) The POWER architecture also provides a vast set of shift instructions. These can be used to do double precision shifts and multiple precision shifts with great efficiency. Add in the count leading zeros and you have a fairly complete set of bit flicking instructions. (Herman Rubin might even like this machine. Then again he'd probably bitch about it being worthless due to the difficulty of using the FP hardware on integer data :-)) As to loop type instructions, the RIOS has a count register in the branch unit (a separate chip in the current implementation). There is an instruction that will decrement the count register and conditionally branch based on the result of the decrement and possibly a condition register field. This lends itself to loops which are testing a bound and an exit condition. (I.e stepping through an array looking for a value.) The count register cannot easily be used as an index since moving it from the branch chip to the fixed point chip is expensive (i.e. it is not a general purpose register (GPR)). Loops which fit this model incur no overhead for the counter or the branch as the branch unit executes instructions in parallel with the fixed point and floating point units. In fact it is key to the architecture since maintaining the counter in the fixed point unit would likely impose a delay (up to three cycles) on every iteration of the loop. This will show up on more complex loop conditions anyway. (Loops with function calls also lose as the count register is caller save.) I'm convinced that compare-and branch instructions (like on MIPS R-series and HP PA) are the way to go. (Some other time maybe I'll post a comparison of the R4000 and RIOS branch instructions. If the RIOS had a load instruction that set the condition codes it would do better...) All in all this stuff is neat and adds character to the architecture. However, I'm not sure it is the best way to do things. Some codes will perform better due to the bit manipulation instructions, but I doubt they show up very often at all. (And for multi-precision shifts, there better be one very important code to make it worthwhile because they never show up in general purpose stuff.) The only real way to know is to measure these things across large bodies of code. I would like to hear what sort of measurements the IBM people did and how they justified many of these instructions. Both MIPS and HP have been rumored to use the "one percent rule" in that if an instruction can't boost performance by 1%, then it doesn't belong in the architecture. For measurement tools, MIPS has had pixie from very early on and Sun now has spix for SPARC. What sort of tools do the IBM people use to analyze their architecture? (At least one RIOS architect has posted here recently...) Beyond performance, I'd argue that these features limit future architectural flexibility. For example, I bet doing an upward compatible 64 bit RIOS architecture is going to be much more difficult than going from MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6 bits for those mask constants. It is a less important to make double precision shifts fast when single precision will now do 64 bits. Etc...)