Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!wuarchive!udel!haven.umd.edu!uvaarpa!murdoch!hemlock!clc5q
From: clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman)
Newsgroups: comp.arch
Subject: Re: new instructions
Message-ID: <1991May23.210519.23443@murdoch.acc.Virginia.EDU>
Date: 23 May 91 21:05:19 GMT
References: <9105200213.AA05095@ucbvax.Berkeley.EDU> <1991May21.191034.25980@murdoch.acc.Virginia.EDU> <25874@as0c.sei.cmu.edu>
Sender: usenet@murdoch.acc.Virginia.EDU
Organization: University of Virginia Computer Science Department
Lines: 92

In article <25874@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
>
>>Given the C source code statement:
>
>>	z = x % y;   /* z gets the remainder of x divided by y */
>
>>... we generate
>>the 3-instruction sequence:
>>
>>	movl	r6,r1		/* Transfer quotient to r1 */
>>	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
>>				   register pair quotient */
>>	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
>>				   into r2 and keep remainder in r11 */
>
>I hope not.  From the previous code fragment, it is clear you are
>expecting the remainder from SIGNED division.  If you want the same
>answer as before, the code must be
>
>	MOVL R6,R1		; construct the sign-extended 64-bit ...
>	ASHQ #-32,R0,R0		; dividend in the register pair <R0,R1>
>	EDIV ... as before
>

Thanks for pointing out my error. I looked into the VAX Architecture Handbook
and it seems that you are trying to get "ASHL  #-32,R1,R0" in your second
statement. "ASHQ #-32,R0,R0" takes a heck of a long time and gives the wrong
answer. "ASHL #-32,R1,R0" takes the sign bit of R1 and fills R0 with it.
Unfortunately, this seems to be the best way to sign extend on the VAX.
(The coercion instructions don't include CVTLQ == coerce longword to quadword,
so the apparently slower pseudo-shift is the best we can do.)

>You might like to time THAT sequence, and rethink your post.  Or you
>could take my word for it, that when you include the cost of having
>to reserve and target into an even-odd register pair, the EDIV is
>almost always slower.

Well, I timed the new sequence, and a little over half of my speedup 
disappeared, but it is still faster by more than 10% compared to what
"cc -O" does.

As for register allocation issues, that is a complex subject on the VAX.
Registers R6 through R11 are "allocable" general-purpose registers. When
translating most source code statements, you can consider R0 through R5
available. (BTW, Robert, this little tutorial is not directed at you, but
at those new to the VAX register set.) As long as you aren't doing weird
assembly language instructions like CRC or POLY or the string instructions,
R2 through R5 are not going to be trampled by anything. R0 and R1 are used
for the return value of a function, so usage of them has to be temporary
and not live across a function call.

Really good register allocation will use R0 through R11 as much as is legal.
Simpler and poorer register allocation will only use R6 through R11 except
when a single intermediate-code statement is translated into multiple
assembly language statements, and those statements need scratch registers
that will be dead upon conclusion of the single intermediate code operation.
Thus, my code above used R0 through R2 temporarily.

The point here is that "cc -O" is doing the same thing. It generated a sequence
of 3 instructions for the remainder operation, and used R0 as a scratch 
register. Thus, for simple and stupid register allocators, R0 and R1 are
always available as a nice even-odd register pair for scratch usage.
(Although the VAX does not care about even-odd pairs, so I am not sure
why you mentioned them. A contiguous pair is all that is needed.) A smarter
allocator might want to avoid using "ediv" for the remainder operation
because of the need to reserve a pair of registers. (A REALLY smart code
generator might look to see if a pair is available for scratch use, and
generate the "ediv" code if it were, and the "cc -O" sequence otherwise.
And the first instruction of my sequence is unnecessary if the next lower
numbered register is unused at the moment; the ASHL and EDIV can be done
in place.)

The point still remains: "cc -O" produces less than optimal code that
biases instruction count analysis of the architecture. I am still wondering
how system architects handle this bias when determining the future path
of the architecture. And how it affects the famous pronouncements about
how CISCs all have umpteen never-used instructions and umpteen more
rarely-used instructions.

(BTW, I will check the timings again on the VAX 11/750. The speedup I
confirmed was on the VAX 8600. If it is different on the VAX 11/750,
that just points out that a code generator can get outdated and bias
instruction counts. So the point remains the same.)


-----------------------------------------------------------------------------
"The use of COBOL cripples the mind; its teaching should, therefore, be 
regarded as a criminal offence." E.W.Dijkstra, 18th June 1975.
|||  clc5q@virginia.edu (Clark L. Coleman)