Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.arch
Subject: Re: standard extensions
Message-ID: <1991Feb27.214758.14226@bingvaxu.cc.binghamton.edu>
Date: 27 Feb 91 21:47:58 GMT
References: <2124@cluster.cs.su.oz.au> <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> <2134@cluster.cs.su.oz.au>
Organization: State University of New York at Binghamton
Lines: 76

In article <2134@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes:
[I say I'm not way off even considering we're moving away from my original 
assumptions]
> Sorry, but; yes. The time for a 'divide' in what I would consider a general
> purpose machine would out weigh (by a large factor) the time taken for more
> normal instructions...

From my original post I kinda eliminated CISC workstations & such from
consideration, but let's run with Rex's ball anyhow.

[I measured that an extra divide in a loop caused a slow down of 30%]
> So, for each iteration it was 30% for that extra divide, ie, a divide was
> taking 42% of the time for the loop???

No -- 30%.
	1 x divide + other stuff = 1.0
	2 x divide + offer stuff = 1.3
	---
	therefore 1 x divide = 0.3 of loop

> I ran your program on a MIPS RC3230 and got very similar numbers, which
> also supports my premis. The additional divide was taking 20 cycles, which
> _completely_ overwhelms the rest of the loop, so adding in that extra divide
> is taking a _big_ performance hit!

Using pixie on a Magnum I find the following with my little loop with
one divide and the whole thing repeated 100 times:

Number of cycles:		8.2e7
Number of instructions:		3.8e7 (i.e. 2.1 c/i av)
Number of divides executed:	2.6%

I also find a divide takes about 4 add times on average.

So we want to know how big the loop is such that an extra divide accounts for 
10% --

	Td
	----------	= 0.1
	2 Td + n Ta

==>	n = 32 Where Td = 4 Ta (say).

I still think I'm in the ballpark wrt my previous posts even though we have 
diverged a little wrt the architectural assumptions.

As Rex rightly points out the number of instructions _executed_ in my little 
loop is actually fairly small (which only improves my claim that loops
of around 20 instructions would suffice to swap a `divide effect'). There are 
also two other tricks -- the number of divides may be much smaller than the 
number of loop iterations (depending on the size of M), and the `divide' 
effect will be masked by an equal number of multiplies (fairly typical of the 
codes originally in question I think). Which all goes to show the importance 
of measurement.

As to what the _typical_ size of loops that contain, e.g. a divide & remainder,
ARE -- I have no real idea at this time. The loop I have posted is (actually 
the smallest) from various code that I think may be typical of its type. If 
there is any interest, maybe I can do some more pixifying of other programs & 
come up with some better indication.

To summarize: we have an indication that EVEN WHERE DIVIDES ARE EXPENSIVE 
(e.g. on the 68k's as Rex intimates) the cost of an extra divide in (what I 
guess I'm claiming to be) typical loops is not very high wrt the total time 
of the loop (i.e. 30% in my simple test case of the last post) and rapidly 
less when considering the total running time of the program as a whole.

My original point was that talking about the cost of an extra instruction 
or two IN ISOLATION is (both common and) misleading. The total running cost of 
a program is fairly insensitive to the cost of an individual instruction in a 
loop. I indicated at the time that my argument was overly simplistic, but I'm 
surprised at how little it seems to `diverge' (although you _could_ say
30% is 3 times 10% -- I prefer 30% is only 20% away :-) when altering some of 
the underlying assumptions.

-kym