Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!pyramid!weitek!sci!kenm
From: kenm@sci.UUCP (Ken McElvain)
Newsgroups: comp.arch
Subject: Re: Horizontal pipelining
Message-ID: <11444@sci.UUCP>
Date: Tue, 24-Nov-87 02:07:56 EST
Article-I.D.: sci.11444
Posted: Tue Nov 24 02:07:56 1987
Date-Received: Sat, 28-Nov-87 00:43:05 EST
References: <201@PT.CS.CMU.EDU> <388@sdcjove.CAM.UNISYS.COM> <988@edge.UUCP> <958@winchester.UUCP>
Organization: Silicon Compilers Systems Corp. San Jose, Ca
Lines: 52
Summary: miss rates up, miss penalties down

In article <958@winchester.UUCP>, mash@mips.UUCP (John Mashey) writes:
> In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
+ >This discussion needs a new title...
+ >
+ >There are two reasons to share functional units.
+ > - cost, or, if you will, duty cycle.
+ > - simplicity ( in the sense of RISCness ).
+ >
+ >The duty cycle argument says that if a unit is rarely used, then you get a
+ >more effective design by sharing it among all the instruction-issue units.
+ >Note that a lot of the average Cray sits idle while the rest is being
+ >useful.  The counter-argument is that decreasing {prices, power consumption,
+ >etc} make sharing less of a win. Plus, sharing puts constraints on packaging
+ >- you have to get there from here.
+ 
+ >If you assume a single-chip CPU, I guess it's a bad idea.
+ 
+ That's the critical observation, and observe that an increasing piece
+ of the computing spectrum is being dominated by single-chip CPUs,
+ whose design tradeoffs are very different from having boards full of
+ [TTL, ECL, etc] logic.  For example, if you want to micro-time-slice N
+ processes, you must provide N sets of the highest-speed state in the
+ memory hierarchy [registers], and in fact, you'd probably want
+ N sets of caches also.  [Think about having N processes thrashing
+ around interleaved in the same cache: it is hard to see how this
+ will help your hit rates very much. TLBs likewise]  If you were building CPUs
+ that were multiple boards anyway, it might not be impossible to replicate
+ the registers without incurring awful speed penalties: there will be
+ a limit, but certainly, successful systems have been built this way,
+ if only to minimize context switching time. Board yields don't drop
+ like a stone just because you used a little more space.
+ On the other hand, if it's VLSI, you can be up against serious limits,
+ and you have to think hard about what's on the chips.
+ 

I agree that cache [or TLB] hit rates will almost certainly go down.
However, miss penalties will also drop.  It is quite possible that
a cache fill could happen in the time it takes for the barrel
to turn around.

A ten stage barrel processor running at 25Mhz would easily allow
over 300ns for a cache fill before it cost another instruction slot.
The performance limit here is likely to be the bandwidth of the
cache fill mechanism.

Another issue is the instruction set.  It's not clear that you want
a bunch of registers.  It may be much better to do more of a memory
to memory architecture.  (I would recommend keeping some base registers).
A number of other areas also have some surprising tradeoffs.

Ken McElvain
Silicon Compiler Systems
decwrl!sci!kenm