Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!ncar!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: 486 and 68040 Message-ID: <17999@winchester.mips.COM> Date: 24 Apr 89 05:30:53 GMT References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 59 In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: >In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes: >>Michael Slater writes: >>>- The degree to which clocks per instruction has been reduced. Intel's 486 >>> provides single-clock loads, stores, and moves. Assuming a cache hit, >>> data can be used by the instruction immediately following the load, with >>> no stall cycle at all. It remains to be seen if the 040 will do this. > >In addition, register to register "simple" arithmetic ops (i.e., everything >except multiply and divide) take one clock. Pushes and pops take one clock. >Branch-not-taken takes one clock (if taken it is 3 clocks).... All of this is indeed impressive (really). Having been out of the country awhile, I may have missed some things; I am curious about one thing: how is it that, with apparently the same technology: the i860, with split I & D caches (2-set assoc), and a RISC-style instruction set, has a 1-cycle stall following a load if the data is referenced, and the 486, with a joint cache (4-set), and more complex decoding, has no such stall. The potential answers would appear to be: 1) the i860 folks screwed up, and didn't take advantage of the same cache technology. OR 2) The i860 folks were aiming for a higher potential clock rate, and although they could have built no-stall loads at 33MHz, they couldn't at 40/50, and so built it to go with coming cycle-time improvements, whereas the 486 folks didn't, or weren't aiming for as high eventual clock-rates. OR 3) The 486 claims of 1-cycle loads included zero impact for instruction-fetching (from the joint cache). (likewise on stores, pops,pushes, etc). Note, of course, that we all beat up SPARC implementations for having a 2-cycle load / 3-cycle store for a similar (although not identical) reason...... OR 4) Somehow, the cache speed is so fast that there is plenty of time to do everything, i.e., the critical paths are elsewhere. Can somebody who knows (KS?) say anything about 3); in particular, there's a note in EETimes article (April 17, p. 36) about "aligned instruction access: 3-clock penalty for nonalignment" (which sounds like a branch to something not aligned on a quad-word boundary costs 3 cycles?) Also, can anybody say anything about the cache-access, i.e., to get 16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit. (Does it? or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte access, but I haven't seen anything yet that says one way or another.) (GUESS: above: 1) seems very unlikely. 2) seems possible. 3) seems likely. 4) Seems possible, but unlikely, unless there is really a LONG critical path somewhere else, and this seems unlikely.) -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086