Path: utzoo!mnetor!uunet!husc6!bloom-beacon!gatech!purdue!i.cc.purdue.edu!j.cc.purdue.edu!pur-ee!hankd
From: hankd@pur-ee.UUCP (Hank Dietz)
Newsgroups: comp.arch
Subject: Re: SPARC and multiprocessing (large read latencies)
Message-ID: <8063@pur-ee.UUCP>
Date: 5 May 88 20:16:48 GMT
References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <381@mancol.UUCP>
Organization: Purdue University Engineering Computer Network
Lines: 53
Summary: Here's what Smith is doing and what we're doing

In article <381@mancol.UUCP>, jh@mancol.UUCP (John Hanley) writes:
> In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
> >...why hasn't anyone talked about the fact that processors like SPARC are not
> >really designed for large-scale multiprocessing, e.g., they have no provision
> >for "hiding" BIG, stochastic, memory reference delays across a log n stage
> >interconnection network, etc.? I think it's pretty uninteresting to talk about
> >multi-processor systems which are small enough that "snooping caches" work.... 
> 
> My favorite method of keeping the CPU busy while a memory-read request is
> traversing the network is the one used by the Denelcor HEP: context-switch
> on a cache miss.... [or alternatively....]
> LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
> PREFETCH C, LD A, LD B, LD C.  If the time to execute the three non-blocking
> prefetch instructions is comparable to the network latency, you win big, since
> they execute during time that would have been spent idle anyway....
> Something I haven't seen is the above PREFETCH instructions implemented in
> hardware.  Call it an intelligent look-ahead cache, or an aux. CPU. 
> Predictive memory requests are made not only on the instruction stream,
> but also on the data stream, a few instructions ahead of time....

Burton Smith, of HEP fame, is sort-of doing both in his latest machine; so
are we (CARP -- the Compiler-oriented Architecture Research group at Purdue).

I believe Burton's machine microtasks, a la HEP, but he also has a method
whereby many memory references (or other slow operations) can be initiated
without waiting for earlier ones to complete.  (I still don't know how much
of his design is in the public domain, so I can't say much more about it.)

The CARP machine doesn't microtask, but we do some very sneaky interrupt
enabling...  the CARP machine processor has provision for multiple delayed
operations to be initiated without waiting for earlier ones to complete, and
interrupts are only enabled when the processor has to wait LONGER than the
compiler expected.  The interrupt latency is high this way (perhaps dozens
of instructions between interrupt accept states), but this isn't such a bad
problem when you consider a multiprocessor machine where ANY processor could
service ANY interrupt.  There are actually two separate delayed operation
mechanisms in the CARP machine:  one for compile-time known delays and one
for delays where only the expected delay is known at compile-time.  For some
operations, the expected-delay-based mechanism is late targeting; i.e., the
destination register in register address space is not specified until the
item has arrived, hence the usable register address space is not reduced by
having multiple items pending (selection of a staging register is implicit
in the type of the delayed operation).

We look at it this way:  if you want to get high speedup by multiprocessing,
since not everything can be parallelized, we don't want to slow the
sequential parts by microtasking.  The result is that we implement machine
use priorities by dynamically changing the parallelism-width dedicated to
each task, and we concentrate on other mechanisms for hiding delays...
preferably mechanisms which do not "use-up" parallelism that we could have
used to achieve speedup through parallel execution.

						-hankd