Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!bionet!ames!oliveb!intelca!mipos3!blabla!kds From: kds@blabla.intel.com (Ken Shoemaker) Newsgroups: comp.arch Subject: Re: 486 and 68040 Message-ID: <3975@mipos3.intel.com> Date: 25 Apr 89 23:18:18 GMT References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com> <17999@winchester.mips.COM> Sender: news@mipos3.intel.com Reply-To: kds@blabla.UUCP (Ken Shoemaker) Organization: Santa Clara Microprocessor Division, Intel Corp., Santa Clara, CA Lines: 111 In article <17999@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: >>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >> >> blah blah blah >> >All of this is indeed impressive (really). Having been out of the country >awhile, I may have missed some things; I am curious about one thing: >how is it that, with apparently the same technology: > > the i860, with split I & D caches (2-set assoc), and a RISC-style > instruction set, > has a 1-cycle stall following a load if the data is referenced, >and > the 486, with a joint cache (4-set), and more complex decoding, > has no such stall. > >The potential answers would appear to be: > 1) the i860 folks screwed up, and didn't take advantage of the > same cache technology. Not too likely > 2) The i860 folks were aiming for a higher potential clock rate, > and although they could have built no-stall loads at 33MHz, they > couldn't at 40/50, and so built it to go with coming cycle-time > improvements, whereas the 486 folks didn't, or weren't aiming for > as high eventual clock-rates. Not that either > 3) The 486 claims of 1-cycle loads included zero impact for > instruction-fetching (from the joint cache). (likewise on stores, > pops,pushes, etc). Note, of course, that we all beat up SPARC > implementations for having a 2-cycle load / 3-cycle store for > a similar (although not identical) reason...... Well, data access have higher priority to the cache than instruction accesses. Instruction accesses happen 16 bytes at a time, and fill up a 32 byte circular instruction queue. The actual instruction decoder works out of this queue. Because of the size of the queue and speed it is filled from the cache, the amount of instruction/data conflicts with the cache are relatively small. However, best performance is achieved if branch and especially subroutine jump targets are 16-byte aligned. Also, the fetching of the instructions at the target of a branch don't conflict with any data accesses since the "data" access slot of the branch instruction is taken by a speculative access of the instructions at the target of the branch. The comparison with the Sparc isn't especially relevant, since they have only a single 32-bit path to memory, i.e., cache, and need access to that path to fetch an instruction every clock they are going to execute a new instruction. We don't, because of the 32 byte queue and that we fetch an average of 4 instructions every clock the instruction fetcher gets access to the cache. > 4) Somehow, the cache speed is so fast that there is plenty of > time to do everything, i.e., the critical paths are elsewhere. The cache access path isn't the most critical path on the chip. There is a fifth answer which wasn't advanced, however (and probably many more, for that matter). The one I'd like to mention is that the pipelines are organized differently. In most risc machines, you have a load delay slot and a branch delay slot. Both give you an idle clock that you attempt to fill in with something that doesn't have anything to do with the branch or the load. On the i486, on the other hand, you don't get load delay slots, and you don't get deferred branches. You also get a two stage instruction decode. This means that you can run the memory cycle one clock earlier with respect to the execution stage in the pipeline than you can on most risc machines because the execution stage is one clock later in the pipeline. Thus no load delay slot. This also means that you take another clock on branches taken, which is why a branch taken on the i486 requires 3 clocks, whereas on most risc machines it takes 2 clocks (the second being the branch delay slot). We think that this is a good tradeoff, since we need the extra clock to decode the instructions anyway, and it also improves the performance of all that object code out there for the x86 architecture which isn't going to get recompiled to take advantage of the load delay slot if it were there. This is simplified, and probably isn't very clear. I will try to put together a longer description of the i486 pipeline sometime and post it on the network. In the meantime, the April and May issues of Michael Slater's Microprocessor Report should have most of the gory details in John Wharton's articles. Should have pictures and diagrams and all that stuff! >Can somebody who knows (KS?) say anything about 3); in particular, there's >a note in EETimes article (April 17, p. 36) about "aligned instruction >access: 3-clock penalty for nonalignment" (which sounds like a branch to >something not aligned on a quad-word boundary costs 3 cycles?) This has nothing to do with branches. The i486 supports accesses to non-aligned object in memory, just like all other x86 machines. You will get better performance if you keep all your objects in memory aligned. That is all it means. The i486 also adds a segment attribute that will cause the processor to trap all unaligned access, however. You can use this to make sure that you don't have any of these to insure "portability" of your databases with most risc processors, to insure that you are getting the most performance from your application, to give you cheap run-time tag checking, etc. >Also, can anybody say anything about the cache-access, i.e., to get >16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit. >(Does it? or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte >access, but I haven't seen anything yet that says one way or another.) I think this is covered above. 128-bits in one clock. You want to use as much of this as possible, especially at the target of a branch, so you want to try to 16 byte align your branch targets. --------------- I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds