Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!spool.mu.edu!samsung!usc!zaphod.mps.ohio-state.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: standard extensions
Message-ID: <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu>
Date: 25 Feb 91 20:14:06 GMT
References: <3381.27c548c3@iccgcc.decnet.ab.com> <PH.91Feb22180807@ama-1.ama.caltech.edu> <1991Feb25.135057.23667@linus.mitre.org>
Organization: State University of New York at Binghamton
Lines: 97

In article <1991Feb25.135057.23667@linus.mitre.org> bs@faron.mitre.org (Robert D. Silverman) writes:
>Yes. There always is assembler. However, the cost of calling an assembler
>routine to do a division returning both quotient and remainder is very
>expensive. In-lining mechanisms for assembler code are woefully inadequate
>in current languages. One can, of course, look at the intermediate assembler
>code generated by the compiler of a HLL and then adjust your assembler code
>so that register usage is correct, but this is very clumsy and must be redone
>everytime you make a change to the HLL code.

I find it hard to criticize the facilities for inlining assembler code (not 
that I would ever do it myself -- well, hardly ever) provided by modern 
compilers, e.g. gcc as outlined in earlier posts to this group and elsewhere. 
At least some modern compilers would seem to provide the facilities for 
including such code in the (sometimes extensive) dataflow analysis they 
already perform for the HLL code -- even to the extent of allocating available 
registers for your use in the inline assembler code.  It would therefore seem 
low-level tweaking as described above is at least on the way out, if not 
already `passed over'.

But on the other hand -- why put assembler into a program at all? Although I 
understand there _are_ a (very few) application areas where even a 10% 
reduction in running time represents big bucks, I find it hard to accept that 
the ability to return the quotient and remainder, e.g., at the same time will 
make even _that_ much difference to code that already would almost certainly 
include more time-intensive instructions in the mix.

For example, suppose we compare two tight loops, one with separate divide & 
modulus/remainder (whichever you need); the other with some combined code. 
Suppose there are n other instructions in each loop.  Let's discard pipe 
flushing considerations and further assume all instructions take a single 
cycle. We have, in loop I say, n+3 instructions (let's even allow another move 
instruction to get at either the quotient or remainder as they may have ended 
up in inconvenient places) and in loop II n+1 instructions. 

Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) 
cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% 
difference if n < 19.

So, provided there are no more than about 20 instructions in the loop in this 
obviously _idealized_ circumstance, the body of the loop will run 10% or 
more faster given the fancy combined quot/rem facility.  Considering our loop 
is only _part_ of a larger program the impact on the total running time of the 
complete code will be even less marked given the facility.

But perhaps typical loops, esp those containing divide and remainder
instructions, are small? Well, after looking through my source code I don't 
happen to find _any_ loops with both divide & remainder in them -- I guess I 
tend to avoid it (blush). However, below is an _example_ that I will not say 
is _typical_, but is at least indicative of the kind of code I'm looking for 
(it's an inner loop from an FFT routine). With -O a Sun3/60 cc produces the 
code following.  Strangely, there are about 20 instructions in the loop.

------

for(xp=x,yp=y,zp=z; yp<z; ++xp,++yp) {
	if(*xp==M-1) *zp++ = M-*yp;
	else if(*yp==M-1) *zp++ = M-*xp;
	else *zp++ = *xp**yp % M;
	}

L77003:
	cmpl	#1000,a0@
	jne	L77005
	movl	#1001,d0
	subl	a1@,d0
	movl	d0,a5@
	jra	LY00000
L77005:
	cmpl	#1000,a1@
	jne	L77007
	movl	#1001,d0
	subl	a0@,d0
	movl	d0,a5@
	jra	LY00000
L77007:
	movl	a0@,d0
	mulsl	a1@,d0
	divsll	#1001,d1:d0
	movl	d1,a5@
LY00000:
	addqw	#4,a5
	addqw	#4,a0
	addqw	#4,a1
	cmpl	a6@(-12),a1
	jcs	L77003

---

To summarize: I don't think provision of a combined divide/remainder
(or divide/modulus for that matter) instruction will necessarily
speed up the total running time of any real programs appreciably
(i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
some circumstance which contradicts this?

-kym
===
No sig is a good sig (this isn't one).


Brought to you by Super Global Mega Corp .com