Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!bionet!ames!oliveb!intelca!mipos3!blabla!kds
From: kds@blabla.intel.com (Ken Shoemaker)
Newsgroups: comp.arch
Subject: Re: 486 and 68040
Message-ID: <3975@mipos3.intel.com>
Date: 25 Apr 89 23:18:18 GMT
References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com> <17999@winchester.mips.COM>
Sender: news@mipos3.intel.com
Reply-To: kds@blabla.UUCP (Ken Shoemaker)
Organization: Santa Clara Microprocessor Division, Intel Corp., Santa Clara, CA
Lines: 111

In article <17999@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
>>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>>
>> blah blah blah
>>
>All of this is indeed impressive (really).  Having been out of the country
>awhile, I may have missed some things; I am curious about one thing:
>how is it that, with apparently the same technology:
>
>	the i860, with split I & D caches (2-set assoc), and a RISC-style
>	instruction set,
>	has a 1-cycle stall following a load if the data is referenced,
>and
>	the 486, with a joint cache (4-set), and more complex decoding,
>	has no such stall.
>
>The potential answers would appear to be:
>	1) the i860 folks screwed up, and didn't take advantage of the
>	same cache technology.

Not too likely

>	2) The i860 folks were aiming for a higher potential clock rate,
>	and although they could have built no-stall loads at 33MHz, they
>	couldn't at 40/50, and so built it to go with coming cycle-time
>	improvements, whereas the 486 folks didn't, or weren't aiming for
>	as high eventual clock-rates.

Not that either

>	3) The 486 claims of 1-cycle loads included zero impact for
>	instruction-fetching (from the joint cache).  (likewise on stores,
>	pops,pushes, etc).  Note, of course, that we all beat up SPARC
>	implementations for having a 2-cycle load / 3-cycle store for
>	a similar (although not identical) reason......

Well, data access have higher priority to the cache than instruction
accesses.  Instruction accesses happen 16 bytes at a time, and fill up a 32
byte circular instruction queue.  The actual instruction decoder works out
of this queue.  Because of the size of the queue and speed it is filled from
the cache, the amount of instruction/data conflicts with the cache are
relatively small.  However, best performance is achieved if branch and
especially subroutine jump targets are 16-byte aligned.  Also, the fetching
of the instructions at the target of a branch don't conflict with any data
accesses since the "data" access slot of the branch instruction is taken by
a speculative access of the instructions at the target of the branch.  The
comparison with the Sparc isn't especially relevant, since they have only a
single 32-bit path to memory, i.e., cache, and need access to that path to
fetch an instruction every clock they are going to execute a new
instruction.  We don't, because of the 32 byte queue and that we fetch an
average of 4 instructions every clock the instruction fetcher gets access to
the cache.

>	4) Somehow, the cache speed is so fast that there is plenty of
>	time to do everything, i.e., the critical paths are elsewhere.

The cache access path isn't the most critical path on the chip.

There is a fifth answer which wasn't advanced, however (and probably many
more, for that matter).  The one I'd like to mention is that the pipelines
are organized differently.  In most risc machines, you have a load delay
slot and a branch delay slot.  Both give you an idle clock that you attempt
to fill in with something that doesn't have anything to do with the branch
or the load.  On the i486, on the other hand, you don't get load delay
slots, and you don't get deferred branches.  You also get a two stage
instruction decode.  This means that you can run the memory cycle one clock
earlier with respect to the execution stage in the pipeline than you can on
most risc machines because the execution stage is one clock later in the
pipeline.  Thus no load delay slot.  This also means that you take another 
clock on branches taken, which is why a branch taken on the i486 requires 
3 clocks, whereas on most risc machines it takes 2 clocks (the second 
being the branch delay slot).  We think that this is a good tradeoff, since
we need the extra clock to decode the instructions anyway, and it also
improves the performance of all that object code out there for the x86
architecture which isn't going to get recompiled to take advantage of the
load delay slot if it were there.  This is simplified, and probably isn't
very clear.  I will try to put together a longer description of the i486
pipeline sometime and post it on the network.  In the meantime, the April
and May issues of Michael Slater's Microprocessor Report should have most 
of the gory details in John Wharton's articles.  Should have pictures and
diagrams and all that stuff!

>Can somebody who knows (KS?)  say anything about 3); in particular, there's
>a note in EETimes article (April 17, p. 36) about "aligned instruction
>access: 3-clock penalty for nonalignment"  (which sounds like a branch to
>something not aligned on a quad-word boundary costs 3 cycles?)

This has nothing to do with branches.  The i486 supports accesses to
non-aligned object in memory, just like all other x86 machines.  You will
get better performance if you keep all your objects in memory aligned.  That
is all it means.  The i486 also adds a segment attribute that will cause the
processor to trap all unaligned access, however.  You can use this to make
sure that you don't have any of these to insure "portability" of your
databases with most risc processors, to insure that you are getting the most
performance from your application, to give you cheap run-time tag checking,
etc.

>Also, can anybody say anything about the cache-access, i.e., to get
>16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit.
>(Does it?  or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte
>access, but I haven't seen anything yet that says one way or another.)

I think this is covered above.  128-bits in one clock.  You want to use as
much of this as possible, especially at the target of a branch, so you want
to try to 16 byte align your branch targets.
---------------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds