Path: utzoo!attcan!utgpu!watmath!att!rutgers!ucsd!usc!samsung!uunet!littlei!omepd!toms
From: toms@omews44.intel.com (Tom Shott)
Newsgroups: comp.arch
Subject: Re: RISC vs CISC (rational discussion, not religious wars)
Message-ID: <TOMS.89Nov15114003@omews44.intel.com>
Date: 15 Nov 89 16:40:03 GMT
References: <503@ctycal.UUCP> <15126@haddock.ima.isc.com>
	<28942@shemp.CS.UCLA.EDU> <31097@winchester.mips.COM>
	<28985@shemp.CS.UCLA.EDU> <9769@june.cs.washington.edu>
	<31198@winchester.mips.COM>
Sender: news@omepd.UUCP
Organization: OME, INTeL Corp., Hillsboro, Oregon
Lines: 37
In-reply-to: mash@mips.COM's message of 11 Nov 89 02:37:44 GMT


To throw some more gas on this fire.

As we top 50 MHz for chip speed the biggest problem becomes getting data on
and off chip. You start needing delay cycles between reads and writes on a
IO pin to turn the bus around or a chip w/ lots of pins to get data in and
out faster. Putting a cache on the die has been beat into the ground.

One solution that has not been discussed is flip chip technology. In flip
chip technology many die are mounted directly on a ceramic carrier. (This
is used in IBM mainframes). The result is lower interconnect capacitance
(smaller feature size for lines, no pins) and the ability to match an exact
SRAM to an exact CPU. IOs (not pins) are cheap w/ flip chip technology. You
get higher bandwidth to the cache (wider and faster lines) and the ability
to optomize your process for digital on one die and RAM cells on the other.

Other ways to deal w/ the interconnect speed problem are architectural.
Delayed loads have been spoken about. It just takes more smarts in the
compiler to use those slots. A longer pipeline w/ the load at the start
will also hide off chip delays.

A novel architecture from the Computer Systems Group at UIUC published by
Dave Archer, et el used multiple task running on one CPU to hide delays.
For example w/ a 4 stage pipeline, the CPU chip would run four tasks at
once. I don't remember the details but it worked out that each task
executed at 1/4 of full speed. (I think dummy pipeline stages were used
between the stages). But during that delay time memory fetch latency was
hidden. (Also data dependices). Realistically I might expect this technique
only to be used for large systems aimed at multiuser applications. You need
four tasks always ready to run.


--
-----------------------------------------------------------------------------
Tom Shott    INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520
	     toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com
	INTeL.. Designers of the 960 Superscalar uP and other uP's