Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!rpi!batcomputer!munnari.oz.au!metro!cluster!rex
From: rex@cs.su.oz (Rex Di Bona)
Newsgroups: comp.arch
Subject: Re: standard extensions
Message-ID: <2134@cluster.cs.su.oz.au>
Date: 27 Feb 91 10:50:33 GMT
References: <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> <2124@cluster.cs.su.oz.au> <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu>
Sender: news@cluster.cs.su.oz.au
Reply-To: rex@cluster.cs.su.oz (Rex Di Bona)
Organization: Basser Dept of Computer Science, University of Sydney, Australia
Lines: 83

In article <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> Since I am not entirely clear as to what the article boils down to, apart
> from ``kym, you are _WAY_ wrong''

Sorry, but; yes. The time for a 'divide' in what I would consider a general
purpose machine would out weigh (by a large factor) the time taken for more
normal instructions...

> I ran a program on a Sun3/60 created from the loop given last time with and 
> without adding an extra signed divide (that I made certain was not removed by 
> the optimizer) and found there was a `slow down' of about 30% over 100 reps
> (arithmetic mean). 

So, for each iteration it was 30% for that extra divide, ie, a divide was
taking 42% of the time for the loop???

> Considering how far a Sun was from the model I indicated in my previous post, 
> I'm a bit amazed it didn't diverge more (from Rex Di Bona's post and other 
> private email I had been expecting maybe orders of magnitude).
> 
> -kym

Oh, I don't know, having a divide be so long is about what I figured...
Your loops are not as long as you might think, as they have those if
statements. Here are the instruction counts...

				For 68K		For R3000
the initial if comparison,	6		6
the *zp++ = M-*yp		5		13
the else if compare.		6		6
the *zp++ = M-*xp		6		11
the fluff for the final else	5		18
Stalls				-		10 (? I'm not sure)

Anyway, looking at how these branches are used, we find (on a sample
run of 100 program executions) the two (non division) branches are
executed a total of 0 (yes, zero) times, and the division is executed
each time. (Note 1, below) So we have a loop that is 6+6+5 = 17 (and
the division) instructions long....  if adding a division increases the
time by 30% we can conclude that a division is (roughly) equal to 13
(If 17 = 57.% then 42.% is 13) instructions???? (if we assume the divide
takes 78 cycles, this works out at an average of 6 cycles for a 68K
instruction, not too unreasonable?) On the R3000 we have (6+6+18)*1.2+10 =
46, for a divide of about 19.3 cycles.. again, not too unreasonable?

Note 1: If you look at your program the conditions to take the top two
decision traces depend _totally_ upon random numbers being equal...
You do _not_,  I repeat _not_ change the seed to this random number
generator between runs, so it _always_ produces the same sequence.
This is a bug(feature??) of rand()? So, you would always be executing
_exactly_ the same code, the differences you measured were in fact the
instabilities in the kernel timing of your program :-(

> 
> P.S. my `back of the envelope' calcs using the above result indicates a real 
> 68k signed divide in a similar loop as before will contribute 10% of the
> total at 68 instructions 
	      ^^^^^^^^^^^^ - Instructions are a bad measure for a CISC, as
the time each takes varies so much.

Um, lets see... (assumption ... an instruction is 6 cycles) we get that we
need 703 additional instructions (apart from the divide) to make adding
an extra divide only 10% more overhead...
68 instructions (68 * 6 = 408 cycles) means an extra 16% overhead.

I ran your program on a MIPS RC3230 and got very similar numbers, which
also supports my premis. The additional divide was taking 20 cycles, which
_completely_ overwhelms the rest of the loop, so adding in that extra divide
is taking a _big_ performance hit!

What the important thing is is the _relative_ speed of divide to more
normal (move, simple alu) instructions. When the divide instruction is
much greater than the cost of a 'move' instruction you have to have _a
lot_ of move instructions to compenstate for that extra divide. Unless
you have these you lose big. In your case, you have a relatively small
number of 'other' instructions (17 in the 68020, and 30 in the R3000)
which ment that the divide instruction was almost equal in time to all these
instructions!  (This is why it is easier when multiplying by a constant
to do the shift and adds instead of the multiply instruction (SPARC is
except from this example :-)
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral