Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!munnari.oz.au!metro!cluster!rex From: rex@cs.su.oz (Rex Di Bona) Newsgroups: comp.arch Subject: Re: standard extensions Message-ID: <2124@cluster.cs.su.oz.au> Date: 26 Feb 91 09:41:14 GMT References: <3381.27c548c3@iccgcc.decnet.ab.com> <1991Feb25.135057.23667@linus.mitre.org> <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> Sender: news@cluster.cs.su.oz.au Reply-To: rex@cluster.cs.su.oz (Rex Di Bona) Organization: Basser Dept of Computer Science, University of Sydney, Australia Lines: 140 In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: [ Inlining HLL calls to assembly routines ] > For example, suppose we compare two tight loops, one with separate divide & > modulus/remainder (whichever you need); the other with some combined code. > Suppose there are n other instructions in each loop. Let's discard pipe > flushing considerations and further assume all instructions take a single > cycle. Single Cycle? For a divide instruction? On what machine? If we take a real example here instead (say the 68020, or the R3000 (as I have the info just behind me here, ugh, there.... lets see, the 68020... , divide, here we are: Divide unsigned ranges from 76 cycle (best case, register to register) to a massive 78+24 = 102 cycles for the worst case, a divide with the source program counter memory indirect pre-indexed (PCMIP). For signed divisions we add 12 cycles so the best case is 88 to a worst case of 114 cycles. Now our move takes (best case) 0 cycles (it is totally overlapped with other instructions), to a worst case of 3+49 = 51 (the 49 is the PCMIP, to PCMIP addressing mode). > We have, in loop I say, n+3 instructions (let's even allow another move > instruction to get at either the quotient or remainder as they may have ended > up in inconvenient places) and in loop II n+1 instructions. > > Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) > cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% > difference if n < 19. back in 68020 land... For a move of a register to a register we can assume that it is overlapped as we don't need the result in the next instruction, for a cost of 0 cycles. (Add more if we move the intermediate to memory...) For unsigned division... For the best case... (n*(average cycles per instruction) + 76)/(n*Ave + (76 * 2)) Assumption: average cycles = 1, n < 685 average cycles = 2, n < 343 average cycles = 3, n < 229 average cycles = 4, n < 172 average cycles = 10, n < 69 average cycles = 20, n < 35 For the worst case... (n*(average cycles per instruction) + 102)/(n*Ave + (102 * 2)) Assumption: average cycles = 1, n < 919 average cycles = 2, n < 460 average cycles = 3, n < 307 average cycles = 4, n < 230 average cycles = 10, n < 92 average cycles = 20, n < 46 If we look at the example program given in the article, we have SIGNED division, so... For the best case... (n*(average cycles per instruction) + 88)/(n*Ave + (88 * 2)) Assumption: average cycles = 1, n < 793 average cycles = 2, n < 397 average cycles = 3, n < 265 average cycles = 4, n < 199 average cycles = 10, n < 80 average cycles = 20, n < 40 For the worst case... (n*(average cycles per instruction) + 114)/(n*Ave + (114 * 2)) Assumption: average cycles = 1, n < 1027 average cycles = 2, n < 514 average cycles = 3, n < 343 average cycles = 4, n < 257 average cycles = 10, n < 103 average cycles = 20, n < 52 > > So, provided there are no more than about 20 instructions in the loop in this > obviously _idealized_ circumstance, the body of the loop will run 10% or > more faster given the fancy combined quot/rem facility. Considering our loop > is only _part_ of a larger program the impact on the total running time of > the complete code will be even less marked given the facility. Yes, but it is usually the case that cycle times for other, more normally executed instructions are _far_ less than those for divide instructions. In the above examples we find that we strill require a large number of other instructions before we arrive at the point where the overhead is lost in the rest of the loop. I would assume (rough guess here) that 10 cycles per 68020 instruction is about right, taking into account cache misses and the such, perhaps less, But the less the average cycle count for those other instructions the better it is to use the one instruction to do both the div and remainder... (considering the R3000, it becomes easier (slightly), we have the following.. divide takes, hmm... let me see, oops, the Kane book doesn't seem to mention the time taken for a divide, ok then, lets assume it is 19 cycles (the time for a Fp divide), and also try it at 12 cycles (lower bound, the time for a multiply) we get (assuming 1.2 cycles per instruction) 19 cycles for a divide n < 143 12 cycles for a divide n < 91 The 1.2 figure comes from a hazy recollection of a talk by John Mashey, where he was going on about why it wasn't really 1.00 cycles per instruction, and how they (MIPS) wanted the cycles per instruction to go below 1.00 :-) > > To summarize: I don't think provision of a combined divide/remainder > (or divide/modulus for that matter) instruction will necessarily > speed up the total running time of any real programs appreciably > (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate > some circumstance which contradicts this? > > -kym > === > No sig is a good sig (this isn't one). I hope that the above number might convince you that having the one instruction generate both is better in those pesky real world situations, using real world machines. I doubt (oops, rash prediction here) that a division will become as cheap as a move instruction, unless something changed (radically) the speed of the processing unit without changing the speed of memory (in which case we'll be back at super CISC anyway). -------- Rex di Bona (rex@cs.su.oz.au) Penguin Lust is NOT immoral Brought to you by Super Global Mega Corp .com