Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!pt.cs.cmu.edu!b.gp.cs.cmu.edu!Ralf.Brown@B.GP.CS.CMU.EDU From: Ralf.Brown@B.GP.CS.CMU.EDU Newsgroups: comp.sys.ibm.pc Subject: Re: Ami Bios and the V20 Message-ID: <25655921@ralf> Date: 18 Nov 89 12:29:05 GMT Sender: ralf@b.gp.cs.cmu.edu Organization: Carnegie Mellon University School of Computer Science Lines: 57 In-Reply-To: <206900137@prism> In article <206900137@prism>, rob@prism.TMC.COM wrote: > It's sort of surprising how little difference a V20 makes. As another >note mentioned, it also claims to drastically speed up (by a factor of 3 >to 6) effective address calculation, which, unlike integer multiplies and >divides, is a real factor in most code. Yet a test program I wrote that >simply looped around a bunch of statements like > > MOV AX, [BX+SI+2] > >showed a speedup of only about 20%, as I recall. Looping overhead was >clearly a consideration (the V20 doesn't claim to speed up loops >significantly), but I still expected a larger gain. A major problem is that the 8088 and V20 are bus-bound. Any instruction that executes in less than four clock cycles per byte will drain the four-byte instruction prefetch queue. Once the prefetch queue is empty, instructions run only as fast as they can be fetched from memory (at one byte every four clock cycles). Since every branch empties the prefetch queue (and the instructions at the destination may not let it refill), the prefetch queue spends a significant percentage of the time empty. For example, the sequence SHL AX,1 SHL AX,1 SHL AX,1 SHL AX,1 takes eight clocks according to the official Intel instruction timings. Unfortunately, each of these instructions is two bytes long, so it takes eight clocks to fetch each instruction. Thus, the best case is when the instruction queue is full at the start of this sequence: SHL AX,1 two clocks, PQ now has two bytes and is fetching a third SHL AX,1 two clocks, PQ now empty, third byte arrives at end SHL AX,1 only one byte, so start fetching next four clocks later, we can start, so total is six clocks SHL AX,1 wait two clocks for first byte, four for second, then two clocks to execute = eight clocks Total: 18 clocks Worst case is when the prefetch queue is empty, with the next byte two clocks away. Then the first three instructions each take eight clocks to execute, and the last takes ten clocks, for a total of 34 clocks. You should see a greater improvement when replacing an 8086 with a V30, since they can fetch two bytes every four clocks and have a six-byte prefetch queue, greatly reducing the bus-boundedness of the processor (the above instruction sequence runs in eight to 16 clocks, depending on how full the prefetch queue is at the beginning) -- UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=-=-=-=- Voice: (412) 268-3053 (school) ARPA: ralf@cs.cmu.edu BIT: ralf%cs.cmu.edu@CMUCCVMA FIDO: Ralf Brown 1:129/46 FAX: available on request Disclaimer? I claimed something? "How to Prove It" by Dana Angluin 8. proof by wishful citation: The author cites the negation, converse, or generalization of a theorem from the literature to support his claims.