Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!bloom-beacon!apple!versatc!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: 486 and 68040
Message-ID: <18201@winchester.mips.COM>
Date: 27 Apr 89 01:54:25 GMT
References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com> <17999@winchester.mips.COM> <3975@mipos3.intel.com>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 52

In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
(description of 486 stuff)
good comments.  thanx.  at least some of my guesses were right :-)

>of this queue.  Because of the size of the queue and speed it is filled from
>the cache, the amount of instruction/data conflicts with the cache are
>relatively small.  However, best performance is achieved if branch and
>especially subroutine jump targets are 16-byte aligned.
	This is where I'd gotten confused with the "aligned" penalties.
Also, the fetching
>of the instructions at the target of a branch don't conflict with any data
>accesses since the "data" access slot of the branch instruction is taken by
>a speculative access of the instructions at the target of the branch....
>....  We don't, because of the 32 byte queue and that we fetch an
>average of 4 instructions every clock the instruction fetcher gets access to
>the cache.
	Can you say anything about the actual conflict penalties, i.e., the
	percentage of time a load or store stalls due to this?  I.e.,
	one would grossly guess 25% of the time, but it wouldn't surprise me
	if the number was lower than that, given the things that could be done.

>There is a fifth answer which wasn't advanced, however (and probably many
>more, for that matter).  The one I'd like to mention is that the pipelines
>are organized differently....
>....  On the i486, on the other hand, you don't get load delay
>slots, and you don't get deferred branches.  You also get a two stage
>instruction decode.  This means that you can run the memory cycle one clock
>earlier with respect to the execution stage in the pipeline than you can on
>most risc machines because the execution stage is one clock later in the
>pipeline.  Thus no load delay slot.  This also means that you take another 
>clock on branches taken, which is why a branch taken on the i486 requires 
>3 clocks, whereas on most risc machines it takes 2 clocks (the second 
>being the branch delay slot).  We think that this is a good tradeoff, since
>we need the extra clock to decode the instructions anyway....
Yes, certainly a good tradeoff; loads are more frequent than branches.
>
>>Can somebody who knows (KS?)  say anything about 3); in particular, there's
>>a note in EETimes article (April 17, p. 36) about "aligned instruction
>>access: 3-clock penalty for nonalignment"  (which sounds like a branch to
>>something not aligned on a quad-word boundary costs 3 cycles?)
>
>This has nothing to do with branches.  The i486 supports accesses to
>non-aligned object in memory, ....
From your comment above, re subr. calls to 16-byte aligned things,
it sounds like the article may have gotten the 2 things mixed in together.

I'll look forward to the further postings, especially on the pipeline.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086