Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!cs.utexas.edu!usc!orion.oac.uci.edu!uci-ics!ucla-cs!oahu!frazier
From: frazier@oahu.cs.ucla.edu (Greg Frazier)
Newsgroups: comp.arch
Subject: Re: RISC vs CISC (rational discussion, not religious wars)
Keywords: Die Space
Message-ID: <28985@shemp.CS.UCLA.EDU>
Date: 9 Nov 89 17:50:24 GMT
References: <503@ctycal.UUCP> <15126@haddock.ima.isc.com> <28942@shemp.CS.UCLA.EDU> <31097@winchester.mips.COM>
Sender: news@CS.UCLA.EDU
Reply-To: frazier@oahu.UUCP (Greg Frazier)
Organization: UCLA Computer Science Department
Lines: 61

In article <31097@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <28942@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:
>...
[me proposing multiple CPU chips]
>I don't think we're anywhere near this yet, and this can be seen by
>analyzing the layout and nature of million-transistor chips [like i860s].
>If you look at the i860 die, you find that:
>	a) Most of the transistors are in the caches.
>	b) Most of the space is the FPU, registers, integer datapath, etc.
>	Some of this stuff is wires, and it doesn't shrink as well as
>	transistors do.
>	c) At the top speed claimed for it, eventually [50Mhz], 12KB
>	of cache is NOWHERE near big enough for efficiency, by itself.
[ further discussion of how the $ is still too small ]

Yeah, I've been wondering why people are bothering with on-chip
caches.  Admittedly, a 2-chip CPU would probably be significantly
more expensive than a single-chip CPU, but I think the product
would be more flexible, and achieve better performance, if, instead
of putting the $ on the CPU chip, the $ was provided on a companion
chip.  This should only increase the latency to the cache by a
single clock cycle, while significantly boosting the hit ratio.
Particularly if this were a data $ (leave the instruction $ on
chip - it belongs there).  With a separate cache chip, one could
a) put some intelligence on it and/or b) make it expandable, such
that a user could choose to provide two or three $ chips.  Of
course, this is not a terribly new idea - but I don't know why it
isn't being used in the newer chips.  Realtime people would love
it - they would put NO $ chips on, and simply populate the board
with fast local memory (almost the same thing, but more predictable
behavior).  This scheme would provide more space on chip for... two cpu's,
or multiple FPU's, or single-cycle FPU's (that's an attractive one!),
or any other of a host of performance boosting schemes.  A disadvantage
is that the CPU chip would have to have multiple ports - a chip
with inst. $ and data $ on chip can have both $'s share a single
port to memory, since (presumably) neither is using it very often.
However, unless one wants 128 bit paths to memory (which one might),
I don't think having 2 ports is a big deal.

Quick performance analysis hack:
on-chip $, assume 85% hit ratio, 1 cycle delay on hit, 14 cycle delay
	on miss
off-chip $, assume 95% hit ratio, 2 cycle delay on hit, 14 cycle delay
	on miss (a smart $ will not incur extra miss delay)

on-chip memory speed: .85*1 + .15*14 = 2.95 cycles/reference
off-chip memory speed: .95*2 + .05*14 = 2.60 cycle/reference - a win!

With the hit ratios I have assumed, the break-even point is a memory
delay of 10 cycles - below that, the on-chip cache becomes a win.
Of course, change the hit ratios, and one changes the break-even
point, so your mileage will vary.  As a final note, if the $ chip
is closely married to the CPU chip, there is no reason why the
2 cycle dealay can't be achieved, I think.

Greg Frazier

"They thought to use and shame me but I win out by nature, because a true
freak cannot be made.  A true freak must be born." - Geek Love

Greg Frazier	frazier@CS.UCLA.EDU	!{ucbvax,rutgers}!ucla-cs!frazier