Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!ncar!ames!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: 486 and 68040
Message-ID: <17999@winchester.mips.COM>
Date: 24 Apr 89 05:30:53 GMT
References: <17131@cup.portal.com> <12435@reed.UUCP> <3913@mipos3.intel.com>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 59

In article <3913@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
>In article <12435@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
>>Michael Slater writes:
>>>- The degree to which clocks per instruction has been reduced.  Intel's 486
>>>   provides single-clock loads, stores, and moves.  Assuming a cache hit,
>>>   data can be used by the instruction immediately following the load, with
>>>   no stall cycle at all.  It remains to be seen if the 040 will do this.
>
>In addition, register to register "simple" arithmetic ops (i.e., everything
>except multiply and divide) take one clock.  Pushes and pops take one clock.
>Branch-not-taken takes one clock (if taken it is 3 clocks)....

All of this is indeed impressive (really).  Having been out of the country
awhile, I may have missed some things; I am curious about one thing:
how is it that, with apparently the same technology:

	the i860, with split I & D caches (2-set assoc), and a RISC-style
	instruction set,
	has a 1-cycle stall following a load if the data is referenced,
and
	the 486, with a joint cache (4-set), and more complex decoding,
	has no such stall.

The potential answers would appear to be:
	1) the i860 folks screwed up, and didn't take advantage of the
	same cache technology.
	OR
	2) The i860 folks were aiming for a higher potential clock rate,
	and although they could have built no-stall loads at 33MHz, they
	couldn't at 40/50, and so built it to go with coming cycle-time
	improvements, whereas the 486 folks didn't, or weren't aiming for
	as high eventual clock-rates.
	OR
	3) The 486 claims of 1-cycle loads included zero impact for
	instruction-fetching (from the joint cache).  (likewise on stores,
	pops,pushes, etc).  Note, of course, that we all beat up SPARC
	implementations for having a 2-cycle load / 3-cycle store for
	a similar (although not identical) reason......
	OR
	4) Somehow, the cache speed is so fast that there is plenty of
	time to do everything, i.e., the critical paths are elsewhere.

Can somebody who knows (KS?)  say anything about 3); in particular, there's
a note in EETimes article (April 17, p. 36) about "aligned instruction
access: 3-clock penalty for nonalignment"  (which sounds like a branch to
something not aligned on a quad-word boundary costs 3 cycles?)
Also, can anybody say anything about the cache-access, i.e., to get
16 bytes in one cycle, it presumably has a 128-bit bus to the decode unit.
(Does it?  or is it 2 8-byte accesses per pre-fetch? I'd guess 1 16-byte
access, but I haven't seen anything yet that says one way or another.)

(GUESS: above: 1) seems very unlikely.  2) seems possible.  3) seems likely.
4) Seems possible, but unlikely, unless there is really a LONG critical
path somewhere else, and this seems unlikely.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086