Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!ra!uvaarpa!murdoch!hemlock!clc5q From: clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) Newsgroups: comp.arch Subject: Re: new instructions Message-ID: <1991May21.191034.25980@murdoch.acc.Virginia.EDU> Date: 21 May 91 19:10:34 GMT References: <9105200213.AA05095@ucbvax.Berkeley.EDU> Sender: usenet@murdoch.acc.Virginia.EDU Organization: University of Virginia Computer Science Department Lines: 105 In article <9105200213.AA05095@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes: > > > Herman Rubin also writes: >How do you expect users who do not even know of the existence of the operations >to use them? > > I expect the compiler to generate the instructions for them. If >the compiler won't generate an instruction this is a strong reason for >not having it in the instruction set. As a compiler writer in this RISCy age, I am inclined to agree with this statement. If we compiler writers, in our infinite brilliance, cannot figure out how to use an instruction, it doesn't belong in the instruction set. Before this becomes a mantra that is chanted mindlessly, let's look at an example that might make us think a little harder. On the VAX architecture, there are two common ways for a compiler to generate code for the integer remainder operation. Let's say that register r6 contains the value of variable "x", register r7 contains the value of variable "y", and we want register r11, currently unused, to contain the value of variable "z" (as determined by a register allocator.) All three variables are 32-bit integers (VAX "longword" types.) Registers r6 through r11 are the general purpose arithmetic registers; we can use others, such as r0 and r1, as scratch registers that are not live across a function call. We will see this scratch register usage in both examples below. Given the C source code statement: z = x % y; /* z gets the remainder of x divided by y */ We will see the "cc" compiler generate the following code: divl3 r7,r6,r0 /* divide r6/r7, put quotient in r0 */ mull2 r7,r0 /* multiply r7*r0, put product in r0 */ subl3 r0,r6,r11 /* subtract x - (x DIV y)*y; put in r11 */ So, the remainder is left in r11, the location for further uses of "z". Thus, we computed "z = x - ((x/y)*y);" where the division is integer division. The main alternative to the above is to use the nonorthogonal VAX instruction called "ediv". This takes a 64-bit (VAX "quadword") dividend and 32-bit divisor and divides them, producing a 32-bit quotient and a 32-bit remainder. Very CISCy 4-operand instruction; no equivalent exists for 32-bit dividend. We need to create a 64-bit dividend to use it; unfortunately, we tend to have our 32-bit quotient in a register surrounded by live registers. So we generate the 3-instruction sequence: movl r6,r1 /* Transfer quotient to r1 */ clrl r0 /* Zero out upper word to form 64-bit r0/r1 register pair quotient */ ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient into r2 and keep remainder in r11 */ Intuitively, it seems like a lot of work to use the extended division when you really don't need 64-bit division. But the first two operations are very simple register operations (and optimization might eliminate the second instruction, as r0 might already be cleared.) I was told that the "cc" compiler did things the first way because "ediv" is such a slow instruction. But I got curious and timed the alternatives myself: [from an old description of the experiment] : I then created a shell script that executes the programs 20 times each, for a total of 20 times 30,000 == 600,000 runs through the loop. I ran the script three times to judge the variance. Results: ................................................... So, looking at the time added by the computation with which we are concerned, we get about a 24% speedup when we only want the remainder, and a 27% speedup when we want the quotient and the remainder. [end excerpt] (I was referring to running the experiment two ways: producing the quotient AND remainder by two methods, and producing just the remainder by the two methods. The advantage of ediv is greater when use it to get two results in one instruction, as the first method needs to copy the quotient result somewhere before destroying it.) So, a 24% speedup over what the "cc" compiler produces on a VAX 8600 running BSD Unix 4.2. When I heard that hardware changes in the VAX 8600 might have obsoleted an instruction selection made by "cc" for the VAX 11/780, I did the test again on a VAX 11/750 running Ultrix. More than 20% improvement for ediv there, too. MORAL: Just because the compiler doesn't use it doesn't mean that the compiler SHOULDN'T have used it. Use many compilers for many languages and applications to measure instruction usage, and second-guess the compiler to ensure good conclusions. Question for RISC architects: When evaluating a large set of customer programs to find instruction frequencies, how do you avoid biasing results because of inefficient compilers? Even if you ran all C programs through "cc" and "gcc", and let's say "gcc" uses ediv for remaindering, the count for ediv is lower than it ought to be. "cc" will only use it for true 64/32 bit division. Perhaps we should be counting intermediate code statement frequencies to decide what higher level operations are important, and then deciding what lower level instructions are needed? ----------------------------------------------------------------------------- "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. ||| clc5q@virginia.edu (Clark L. Coleman)