Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!munnari.oz.au!metro!cluster!rex
From: rex@cs.su.oz (Rex Di Bona)
Newsgroups: comp.arch
Subject: Re: standard extensions
Message-ID: <2124@cluster.cs.su.oz.au>
Date: 26 Feb 91 09:41:14 GMT
References: <3381.27c548c3@iccgcc.decnet.ab.com> <PH.91Feb22180807@ama-1.ama.caltech.edu> <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu>
Sender: news@cluster.cs.su.oz.au
Reply-To: rex@cluster.cs.su.oz (Rex Di Bona)
Organization: Basser Dept of Computer Science, University of Sydney, Australia
Lines: 140

In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

 [ Inlining HLL calls to assembly routines ]

> For example, suppose we compare two tight loops, one with separate divide & 
> modulus/remainder (whichever you need); the other with some combined code. 
> Suppose there are n other instructions in each loop.  Let's discard pipe 
> flushing considerations and further assume all instructions take a single 
> cycle.

Single Cycle? For a divide instruction? On what machine? If we take a real
example here instead (say the 68020, or the R3000 (as I have the info just
behind me here, ugh, there.... lets see, the 68020... , divide, here we are:

Divide unsigned ranges from 76 cycle (best case, register to register) to
a massive 78+24 = 102 cycles for the worst case, a divide with the source
program counter memory indirect pre-indexed (PCMIP). For signed divisions
we add 12 cycles so the best case is 88 to a worst case of 114 cycles.
Now our move takes (best case) 0 cycles (it is totally overlapped with
other instructions), to a worst case of 3+49 = 51 (the 49 is the PCMIP, to
PCMIP addressing mode).

> We have, in loop I say, n+3 instructions (let's even allow another move 
> instruction to get at either the quotient or remainder as they may have ended 
> up in inconvenient places) and in loop II n+1 instructions. 
> 
> Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) 
> cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% 
> difference if n < 19.

back in 68020 land...

For a move of a register to a register we can assume that it is overlapped
as we don't need the result in the next instruction, for a cost of 0 cycles.
(Add more if we move the intermediate to memory...)
For unsigned division...

For the best case...

	(n*(average cycles per instruction) + 76)/(n*Ave + (76 * 2))

Assumption:
	average cycles = 1, n < 685
	average cycles = 2, n < 343
	average cycles = 3, n < 229
	average cycles = 4, n < 172
	average cycles = 10, n < 69
	average cycles = 20, n < 35

For the worst case...

	(n*(average cycles per instruction) + 102)/(n*Ave + (102 * 2))

Assumption:
	average cycles = 1, n < 919
	average cycles = 2, n < 460
	average cycles = 3, n < 307
	average cycles = 4, n < 230
	average cycles = 10, n < 92
	average cycles = 20, n < 46

If we look at the example program given in the article,
we have SIGNED division, so...

For the best case...

	(n*(average cycles per instruction) + 88)/(n*Ave + (88 * 2))

Assumption:
	average cycles = 1, n < 793
	average cycles = 2, n < 397
	average cycles = 3, n < 265
	average cycles = 4, n < 199
	average cycles = 10, n < 80
	average cycles = 20, n < 40

For the worst case...

	(n*(average cycles per instruction) + 114)/(n*Ave + (114 * 2))

Assumption:
	average cycles = 1, n < 1027
	average cycles = 2, n < 514
	average cycles = 3, n < 343
	average cycles = 4, n < 257
	average cycles = 10, n < 103
	average cycles = 20, n < 52

> 
> So, provided there are no more than about 20 instructions in the loop in this 
> obviously _idealized_ circumstance, the body of the loop will run 10% or 
> more faster given the fancy combined quot/rem facility.  Considering our loop 
> is only _part_ of a larger program the impact on the total running time of
> the complete code will be even less marked given the facility.

Yes, but it is usually the case that cycle times for other, more
normally executed instructions are _far_ less than those for divide
instructions. In the above examples we find that we strill require a
large number of other instructions before we arrive at the point where
the overhead is lost in the rest of the loop. I would assume (rough
guess here) that 10 cycles per 68020 instruction is about right,
taking into account cache misses and the such, perhaps less, But the
less the average cycle count for those other instructions the better it
is to use the one instruction to do both the div and remainder...

(considering the R3000, it becomes easier (slightly), we have the
following..  divide takes, hmm... let me see, oops, the Kane book
doesn't seem to mention the time taken for a divide, ok then, lets
assume it is 19 cycles (the time for a Fp divide), and also try it at
12 cycles (lower bound, the time for a multiply)

we get (assuming 1.2 cycles per instruction)
	19 cycles for a divide n < 143
	12 cycles for a divide n < 91

The 1.2 figure comes from a hazy recollection of a talk by John Mashey, where
he was going on about why it wasn't really 1.00 cycles per instruction, and
how they (MIPS) wanted the cycles per instruction to go below 1.00 :-)

> 
> To summarize: I don't think provision of a combined divide/remainder
> (or divide/modulus for that matter) instruction will necessarily
> speed up the total running time of any real programs appreciably
> (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
> some circumstance which contradicts this?
> 
> -kym
> ===
> No sig is a good sig (this isn't one).

I hope that the above number might convince you that having the one
instruction generate both is better in those pesky real world
situations, using real world machines. I doubt (oops, rash prediction
here) that a division will become as cheap as a move instruction,
unless something changed (radically) the speed of the processing unit
without changing the speed of memory (in which case we'll be back at
super CISC anyway).
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral


Brought to you by Super Global Mega Corp .com