Path: utzoo!attcan!utgpu!watmath!att!rutgers!ucsd!usc!samsung!uunet!littlei!omepd!toms From: toms@omews44.intel.com (Tom Shott) Newsgroups: comp.arch Subject: Re: RISC vs CISC (rational discussion, not religious wars) Message-ID: Date: 15 Nov 89 16:40:03 GMT References: <503@ctycal.UUCP> <15126@haddock.ima.isc.com> <28942@shemp.CS.UCLA.EDU> <31097@winchester.mips.COM> <28985@shemp.CS.UCLA.EDU> <9769@june.cs.washington.edu> <31198@winchester.mips.COM> Sender: news@omepd.UUCP Organization: OME, INTeL Corp., Hillsboro, Oregon Lines: 37 In-reply-to: mash@mips.COM's message of 11 Nov 89 02:37:44 GMT To throw some more gas on this fire. As we top 50 MHz for chip speed the biggest problem becomes getting data on and off chip. You start needing delay cycles between reads and writes on a IO pin to turn the bus around or a chip w/ lots of pins to get data in and out faster. Putting a cache on the die has been beat into the ground. One solution that has not been discussed is flip chip technology. In flip chip technology many die are mounted directly on a ceramic carrier. (This is used in IBM mainframes). The result is lower interconnect capacitance (smaller feature size for lines, no pins) and the ability to match an exact SRAM to an exact CPU. IOs (not pins) are cheap w/ flip chip technology. You get higher bandwidth to the cache (wider and faster lines) and the ability to optomize your process for digital on one die and RAM cells on the other. Other ways to deal w/ the interconnect speed problem are architectural. Delayed loads have been spoken about. It just takes more smarts in the compiler to use those slots. A longer pipeline w/ the load at the start will also hide off chip delays. A novel architecture from the Computer Systems Group at UIUC published by Dave Archer, et el used multiple task running on one CPU to hide delays. For example w/ a 4 stage pipeline, the CPU chip would run four tasks at once. I don't remember the details but it worked out that each task executed at 1/4 of full speed. (I think dummy pipeline stages were used between the stages). But during that delay time memory fetch latency was hidden. (Also data dependices). Realistically I might expect this technique only to be used for large systems aimed at multiuser applications. You need four tasks always ready to run. -- ----------------------------------------------------------------------------- Tom Shott INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520 toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com INTeL.. Designers of the 960 Superscalar uP and other uP's