Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!bloom-beacon!apple!versatc!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: 486 and 68040 Message-ID: <18201@winchester.mips.COM> Date: 27 Apr 89 01:54:25 GMT References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com> <17999@winchester.mips.COM> <3975@mipos3.intel.com> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 52 In article <3975@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: (description of 486 stuff) good comments. thanx. at least some of my guesses were right :-) >of this queue. Because of the size of the queue and speed it is filled from >the cache, the amount of instruction/data conflicts with the cache are >relatively small. However, best performance is achieved if branch and >especially subroutine jump targets are 16-byte aligned. This is where I'd gotten confused with the "aligned" penalties. Also, the fetching >of the instructions at the target of a branch don't conflict with any data >accesses since the "data" access slot of the branch instruction is taken by >a speculative access of the instructions at the target of the branch.... >.... We don't, because of the 32 byte queue and that we fetch an >average of 4 instructions every clock the instruction fetcher gets access to >the cache. Can you say anything about the actual conflict penalties, i.e., the percentage of time a load or store stalls due to this? I.e., one would grossly guess 25% of the time, but it wouldn't surprise me if the number was lower than that, given the things that could be done. >There is a fifth answer which wasn't advanced, however (and probably many >more, for that matter). The one I'd like to mention is that the pipelines >are organized differently.... >.... On the i486, on the other hand, you don't get load delay >slots, and you don't get deferred branches. You also get a two stage >instruction decode. This means that you can run the memory cycle one clock >earlier with respect to the execution stage in the pipeline than you can on >most risc machines because the execution stage is one clock later in the >pipeline. Thus no load delay slot. This also means that you take another >clock on branches taken, which is why a branch taken on the i486 requires >3 clocks, whereas on most risc machines it takes 2 clocks (the second >being the branch delay slot). We think that this is a good tradeoff, since >we need the extra clock to decode the instructions anyway.... Yes, certainly a good tradeoff; loads are more frequent than branches. > >>Can somebody who knows (KS?) say anything about 3); in particular, there's >>a note in EETimes article (April 17, p. 36) about "aligned instruction >>access: 3-clock penalty for nonalignment" (which sounds like a branch to >>something not aligned on a quad-word boundary costs 3 cycles?) > >This has nothing to do with branches. The i486 supports accesses to >non-aligned object in memory, .... From your comment above, re subr. calls to 16-byte aligned things, it sounds like the article may have gotten the 2 things mixed in together. I'll look forward to the further postings, especially on the pipeline. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086