Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: standard extensions Message-ID: <1991Feb27.214758.14226@bingvaxu.cc.binghamton.edu> Date: 27 Feb 91 21:47:58 GMT References: <2124@cluster.cs.su.oz.au> <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> <2134@cluster.cs.su.oz.au> Organization: State University of New York at Binghamton Lines: 76 In article <2134@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes: [I say I'm not way off even considering we're moving away from my original assumptions] > Sorry, but; yes. The time for a 'divide' in what I would consider a general > purpose machine would out weigh (by a large factor) the time taken for more > normal instructions... From my original post I kinda eliminated CISC workstations & such from consideration, but let's run with Rex's ball anyhow. [I measured that an extra divide in a loop caused a slow down of 30%] > So, for each iteration it was 30% for that extra divide, ie, a divide was > taking 42% of the time for the loop??? No -- 30%. 1 x divide + other stuff = 1.0 2 x divide + offer stuff = 1.3 --- therefore 1 x divide = 0.3 of loop > I ran your program on a MIPS RC3230 and got very similar numbers, which > also supports my premis. The additional divide was taking 20 cycles, which > _completely_ overwhelms the rest of the loop, so adding in that extra divide > is taking a _big_ performance hit! Using pixie on a Magnum I find the following with my little loop with one divide and the whole thing repeated 100 times: Number of cycles: 8.2e7 Number of instructions: 3.8e7 (i.e. 2.1 c/i av) Number of divides executed: 2.6% I also find a divide takes about 4 add times on average. So we want to know how big the loop is such that an extra divide accounts for 10% -- Td ---------- = 0.1 2 Td + n Ta ==> n = 32 Where Td = 4 Ta (say). I still think I'm in the ballpark wrt my previous posts even though we have diverged a little wrt the architectural assumptions. As Rex rightly points out the number of instructions _executed_ in my little loop is actually fairly small (which only improves my claim that loops of around 20 instructions would suffice to swap a `divide effect'). There are also two other tricks -- the number of divides may be much smaller than the number of loop iterations (depending on the size of M), and the `divide' effect will be masked by an equal number of multiplies (fairly typical of the codes originally in question I think). Which all goes to show the importance of measurement. As to what the _typical_ size of loops that contain, e.g. a divide & remainder, ARE -- I have no real idea at this time. The loop I have posted is (actually the smallest) from various code that I think may be typical of its type. If there is any interest, maybe I can do some more pixifying of other programs & come up with some better indication. To summarize: we have an indication that EVEN WHERE DIVIDES ARE EXPENSIVE (e.g. on the 68k's as Rex intimates) the cost of an extra divide in (what I guess I'm claiming to be) typical loops is not very high wrt the total time of the loop (i.e. 30% in my simple test case of the last post) and rapidly less when considering the total running time of the program as a whole. My original point was that talking about the cost of an extra instruction or two IN ISOLATION is (both common and) misleading. The total running cost of a program is fairly insensitive to the cost of an individual instruction in a loop. I indicated at the time that my argument was overly simplistic, but I'm surprised at how little it seems to `diverge' (although you _could_ say 30% is 3 times 10% -- I prefer 30% is only 20% away :-) when altering some of the underlying assumptions. -kym