Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!pt.cs.cmu.edu!b.gp.cs.cmu.edu!Ralf.Brown@B.GP.CS.CMU.EDU
From: Ralf.Brown@B.GP.CS.CMU.EDU
Newsgroups: comp.sys.ibm.pc
Subject: Re: Ami Bios and the V20
Message-ID: <25655921@ralf>
Date: 18 Nov 89 12:29:05 GMT
Sender: ralf@b.gp.cs.cmu.edu
Organization: Carnegie Mellon University School of Computer Science
Lines: 57
In-Reply-To: <206900137@prism>

In article <206900137@prism>, rob@prism.TMC.COM wrote:
 >   It's sort of surprising how little difference a V20 makes. As another
 >note mentioned, it also claims to drastically speed up (by a factor of 3 
 >to 6) effective address calculation, which, unlike integer multiplies and 
 >divides, is a real factor in most code. Yet a test program I wrote that 
 >simply looped around a bunch of statements like
 >
 >                   MOV   AX, [BX+SI+2]
 >
 >showed a speedup of only about 20%, as I recall. Looping overhead was
 >clearly a consideration (the V20 doesn't claim to speed up loops
 >significantly), but I still expected a larger gain.

A major problem is that the 8088 and V20 are bus-bound.  Any instruction
that executes in less than four clock cycles per byte will drain the
four-byte instruction prefetch queue.  Once the prefetch queue is empty,
instructions run only as fast as they can be fetched from memory (at one
byte every four clock cycles).	Since every branch empties the prefetch
queue (and the instructions at the destination may not let it refill),
the prefetch queue spends a significant percentage of the time empty.

For example, the sequence
	SHL  AX,1
	SHL  AX,1
	SHL  AX,1
	SHL  AX,1
takes eight clocks according to the official Intel instruction timings.
Unfortunately, each of these instructions is two bytes long, so it takes
eight clocks to fetch each instruction.  Thus, the best case is when the
instruction queue is full at the start of this sequence:
	SHL  AX,1    two clocks, PQ now has two bytes and is fetching a third
	SHL  AX,1    two clocks, PQ now empty, third byte arrives at end
	SHL  AX,1    only one byte, so start fetching next
		     four clocks later, we can start, so total is six clocks
	SHL  AX,1    wait two clocks for first byte, four for second,
		     then two clocks to execute = eight clocks
Total: 18 clocks

Worst case is when the prefetch queue is empty, with the next byte two
clocks away.  Then the first three instructions each take eight clocks
to execute, and the last takes ten clocks, for a total of 34 clocks.

You should see a greater improvement when replacing an 8086 with a V30,
since they can fetch two bytes every four clocks and have a six-byte
prefetch queue, greatly reducing the bus-boundedness of the processor
(the above instruction sequence runs in eight to 16 clocks, depending on
how full the prefetch queue is at the beginning)


--
UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=-=-=-=- Voice: (412) 268-3053 (school)
ARPA: ralf@cs.cmu.edu  BIT: ralf%cs.cmu.edu@CMUCCVMA  FIDO: Ralf Brown 1:129/46
FAX: available on request                      Disclaimer? I claimed something?
"How to Prove It" by Dana Angluin
  8.  proof by wishful citation:
      The author cites the negation, converse, or generalization of a theorem
      from the literature to support his claims.