Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!mit-eddie!uw-beaver!tektronix!cae780!amdcad!amd!intelca!intsc!tomk
From: tomk@intsc.UUCP
Newsgroups: comp.arch
Subject: Re: Re: recent 386 timings from Intel
Message-ID: <929@intsc.UUCP>
Date: Thu, 9-Apr-87 19:54:46 EST
Article-I.D.: intsc.929
Posted: Thu Apr  9 19:54:46 1987
Date-Received: Sat, 11-Apr-87 15:07:23 EST
References: <221@winchester.mips.UUCP> <2130@intelca.UUCP> <1946@hoptoad.uucp>
Organization: Intel Sales, Silicon Valley, Ca.
Lines: 66

> In article <2130@intelca.UUCP>, clif@intelca.UUCP (Clif Purkiser) writes:
> >                  The system we used to run was an Intel Multibus I system 
> > running Unix System V Release 3.0.  The CPU board was a 386/24 
> > MultiBus I which has a 64 Kbyte direct-mapped write-through
> > cache and 2-3 wait states for cache misses.
And John Gilmore replies:
> 
> Hmm, let's make sure:  cache hits run with 0 wait states, cache misses
> run with 2-3 wait states?  I'm curious about the construction of such a
> cache.  What is the basic cycle time of the machine, and how many
> cycles does a cache hit take?  Is it accessing main memory over the
> Multibus, or on a local bus?  Is main memory static ram, or dynamic?

The Multibus I board that was used for the measurements is a standard production
board.  It has 64K bytes of direct mapped cache based on 45ns data rams and
35ns tag rams.  The DRAMs are 120ns access time variety.  They could have
been 150's but 120's are what we buy a lot of.  The DRAM is local on the 
CPU board and is dual ported to the multibus.  The 386/20 board (that's
what we are talking about) will support up to 16MB of DRAM (Multibus I limit).
The DRAM cycles are not started until after the a cache miss is detected. 
The first access on a cache miss will cause 3 wait states.  When a cache
miss occurs the CPU is switched into pipelined address mode and any 
subsequent misses will be 2 wait states.  When a cache hit occurs again then
the CPU resumes operating in non-pipelined address mode.

With this setup we have measured an average of 0.7 wait states running UNIX
os code.

The basic bus cycle time of the machine is 2 CPU clocks.  At 16MHz that is
125ns, 100ns @ 20MHz.  Each wait state adds 61.25ns @ 16MHz and 50ns @ 20MHz.
The basic instruction execution time is 4.5 clocks on the average with some
magical instruction mix (details available on request). Adding a wait state
slows down execution approx. 20%.  

For those curious about the compiler.  The benchmarks were run with the 
greenhills C compiler with the opitmization switch OFF.  The greenhills
technology does a lot of optimization even without the -O switch so it
is hard to tell how badly it destroys the inner loops of the dhrystone
benchmark.  The other side of the coin though is they do the same type
of optimizations on the other machines.  Again compare systems not CPU's.
This is also why I always tell anyone interested in the 386 to come in 
with their favorite benchmark and run it on the box I have.  So far the 
only place the 25MHz 68020's have beaten the 16MHz 386 is when the main
loop of the code fits in 256 bytes.


------
"Ever notice how your mental image of someone you've 
known only by phone turns out to be wrong?  
And on a computer net you don't even have a voice..."

  tomk@intsc.UUCP  			Tom Kohrs
					Regional Architecture Specialist
		   			Intel - Santa Clara

PS: John there will be a 386/20 manual in the mail to you as soon as I can
find one.