Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!rpi!batcomputer!munnari.oz.au!metro!cluster!rex From: rex@cs.su.oz (Rex Di Bona) Newsgroups: comp.arch Subject: Re: standard extensions Message-ID: <2134@cluster.cs.su.oz.au> Date: 27 Feb 91 10:50:33 GMT References: <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> <2124@cluster.cs.su.oz.au> <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> Sender: news@cluster.cs.su.oz.au Reply-To: rex@cluster.cs.su.oz (Rex Di Bona) Organization: Basser Dept of Computer Science, University of Sydney, Australia Lines: 83 In article <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > Since I am not entirely clear as to what the article boils down to, apart > from ``kym, you are _WAY_ wrong'' Sorry, but; yes. The time for a 'divide' in what I would consider a general purpose machine would out weigh (by a large factor) the time taken for more normal instructions... > I ran a program on a Sun3/60 created from the loop given last time with and > without adding an extra signed divide (that I made certain was not removed by > the optimizer) and found there was a `slow down' of about 30% over 100 reps > (arithmetic mean). So, for each iteration it was 30% for that extra divide, ie, a divide was taking 42% of the time for the loop??? > Considering how far a Sun was from the model I indicated in my previous post, > I'm a bit amazed it didn't diverge more (from Rex Di Bona's post and other > private email I had been expecting maybe orders of magnitude). > > -kym Oh, I don't know, having a divide be so long is about what I figured... Your loops are not as long as you might think, as they have those if statements. Here are the instruction counts... For 68K For R3000 the initial if comparison, 6 6 the *zp++ = M-*yp 5 13 the else if compare. 6 6 the *zp++ = M-*xp 6 11 the fluff for the final else 5 18 Stalls - 10 (? I'm not sure) Anyway, looking at how these branches are used, we find (on a sample run of 100 program executions) the two (non division) branches are executed a total of 0 (yes, zero) times, and the division is executed each time. (Note 1, below) So we have a loop that is 6+6+5 = 17 (and the division) instructions long.... if adding a division increases the time by 30% we can conclude that a division is (roughly) equal to 13 (If 17 = 57.% then 42.% is 13) instructions???? (if we assume the divide takes 78 cycles, this works out at an average of 6 cycles for a 68K instruction, not too unreasonable?) On the R3000 we have (6+6+18)*1.2+10 = 46, for a divide of about 19.3 cycles.. again, not too unreasonable? Note 1: If you look at your program the conditions to take the top two decision traces depend _totally_ upon random numbers being equal... You do _not_, I repeat _not_ change the seed to this random number generator between runs, so it _always_ produces the same sequence. This is a bug(feature??) of rand()? So, you would always be executing _exactly_ the same code, the differences you measured were in fact the instabilities in the kernel timing of your program :-( > > P.S. my `back of the envelope' calcs using the above result indicates a real > 68k signed divide in a similar loop as before will contribute 10% of the > total at 68 instructions ^^^^^^^^^^^^ - Instructions are a bad measure for a CISC, as the time each takes varies so much. Um, lets see... (assumption ... an instruction is 6 cycles) we get that we need 703 additional instructions (apart from the divide) to make adding an extra divide only 10% more overhead... 68 instructions (68 * 6 = 408 cycles) means an extra 16% overhead. I ran your program on a MIPS RC3230 and got very similar numbers, which also supports my premis. The additional divide was taking 20 cycles, which _completely_ overwhelms the rest of the loop, so adding in that extra divide is taking a _big_ performance hit! What the important thing is is the _relative_ speed of divide to more normal (move, simple alu) instructions. When the divide instruction is much greater than the cost of a 'move' instruction you have to have _a lot_ of move instructions to compenstate for that extra divide. Unless you have these you lose big. In your case, you have a relatively small number of 'other' instructions (17 in the 68020, and 30 in the R3000) which ment that the divide instruction was almost equal in time to all these instructions! (This is why it is easier when multiplying by a constant to do the shift and adds instead of the multiply instruction (SPARC is except from this example :-) -------- Rex di Bona (rex@cs.su.oz.au) Penguin Lust is NOT immoral