Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!ra!uvaarpa!murdoch!hemlock!clc5q
From: clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman)
Newsgroups: comp.arch
Subject: Re: new instructions
Message-ID: <1991May21.191034.25980@murdoch.acc.Virginia.EDU>
Date: 21 May 91 19:10:34 GMT
References: <9105200213.AA05095@ucbvax.Berkeley.EDU>
Sender: usenet@murdoch.acc.Virginia.EDU
Organization: University of Virginia Computer Science Department
Lines: 105

In article <9105200213.AA05095@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes:
>
>
>        Herman Rubin also writes:
>How do you expect users who do not even know of the existence of the operations
>to use them?
>
>        I expect the compiler to generate the instructions for them.  If
>the compiler won't generate an instruction this is a strong reason for
>not having it in the instruction set.

As a compiler writer in this RISCy age, I am inclined to agree with this
statement. If we compiler writers, in our infinite brilliance, cannot
figure out how to use an instruction, it doesn't belong in the instruction
set. Before this becomes a mantra that is chanted mindlessly, let's look
at an example that might make us think a little harder.

On the VAX architecture, there are two common ways for a compiler to generate
code for the integer remainder operation. Let's say that register r6 contains
the value of variable "x", register r7 contains the value of variable "y",
and we want register r11, currently unused, to contain the value of variable
"z" (as determined by a register allocator.) All three variables are 32-bit
integers (VAX "longword" types.) Registers r6 through r11 are the general
purpose arithmetic registers; we can use others, such as r0 and r1, as scratch
registers that are not live across a function call. We will see this scratch
register usage in both examples below.

Given the C source code statement:

	z = x % y;   /* z gets the remainder of x divided by y */

We will see the "cc" compiler generate the following code:

	divl3	r7,r6,r0	/* divide r6/r7, put quotient in r0 */
	mull2	r7,r0		/* multiply r7*r0, put product in r0 */
	subl3	r0,r6,r11	/* subtract x - (x DIV y)*y; put in r11 */

So, the remainder is left in r11, the location for further uses of "z".
Thus, we computed "z = x - ((x/y)*y);" where the division is integer
division.

The main alternative to the above is to use the nonorthogonal VAX instruction
called "ediv". This takes a 64-bit (VAX "quadword") dividend and 32-bit divisor
and divides them, producing a 32-bit quotient and a 32-bit remainder. Very
CISCy 4-operand instruction; no equivalent exists for 32-bit dividend. We need
to create a 64-bit dividend to use it; unfortunately, we tend to have our
32-bit quotient in a register surrounded by live registers. So we generate
the 3-instruction sequence:

	movl	r6,r1		/* Transfer quotient to r1 */
	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
				   register pair quotient */
	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
				   into r2 and keep remainder in r11 */

Intuitively, it seems like a lot of work to use the extended division when
you really don't need 64-bit division. But the first two operations are
very simple register operations (and optimization might eliminate the
second instruction, as r0 might already be cleared.) I was told that the
"cc" compiler did things the first way because "ediv" is such a slow
instruction. But I got curious and timed the alternatives myself:

[from an old description of the experiment] :

I then created a shell script that executes the programs 20 times each,
for a total of 20 times 30,000 == 600,000 runs through the loop. I ran
the script three times to judge the variance. Results:

................................................... So, looking at the 
time added by the computation with which we are concerned, we get about
a 24% speedup when we only want the remainder, and a 27% speedup when we
want the quotient and the remainder.

[end excerpt]

(I was referring to running the experiment two ways: producing the quotient
AND remainder by two methods, and producing just the remainder by the two
methods. The advantage of ediv is greater when use it to get two results
in one instruction, as the first method needs to copy the quotient result
somewhere before destroying it.)

So, a 24% speedup over what the "cc" compiler produces on a VAX 8600 running
BSD Unix 4.2. When I heard that hardware changes in the VAX 8600 might have
obsoleted an instruction selection made by "cc" for the VAX 11/780, I did
the test again on a VAX 11/750 running Ultrix. More than 20% improvement
for ediv there, too.

MORAL: Just because the compiler doesn't use it doesn't mean that the
compiler SHOULDN'T have used it. Use many compilers for many languages
and applications to measure instruction usage, and second-guess the
compiler to ensure good conclusions.

Question for RISC architects: When evaluating a large set of customer programs
to find instruction frequencies, how do you avoid biasing results because of
inefficient compilers? Even if you ran all C programs through "cc" and "gcc",
and let's say "gcc" uses ediv for remaindering, the count for ediv is lower
than it ought to be. "cc" will only use it for true 64/32 bit division. Perhaps
we should be counting intermediate code statement frequencies to decide what
higher level operations are important, and then deciding what lower level
instructions are needed?

-----------------------------------------------------------------------------
"The use of COBOL cripples the mind; its teaching should, therefore, be 
regarded as a criminal offence." E.W.Dijkstra, 18th June 1975.
|||  clc5q@virginia.edu (Clark L. Coleman)