Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!umd5!brl-adm!cmcl2!phri!manhat!mancol!jh
From: jh@mancol.UUCP (John Hanley)
Newsgroups: comp.arch
Subject: Re: SPARC and multiprocessing (large read latencies)
Message-ID: <381@mancol.UUCP>
Date: 4 May 88 04:53:22 GMT
References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <51409@sun.uucp> <8029@pur-ee.UUCP>
Reply-To: jh@mancol.UUCP (John Hanley)
Organization: Manhattan College
Lines: 69

In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>...why hasn't anyone talked about the fact that processors like SPARC are not
>really designed for large-scale multiprocessing, e.g., they have no provision
>for "hiding" BIG, stochastic, memory reference delays across a log n stage
>interconnection network, etc.? I think it's pretty uninteresting to talk about
>multi-processor systems which are small enough that "snooping caches" work.... 

My favorite method of keeping the CPU busy while a memory-read request is
traversing the network is the one used by the Denelcor HEP: context-switch
on a cache miss.  When a process requests that a disk block be read on a
time-sharing system, the scheduler tries to get useful work done during the
rotational latency by blocking the process and executing another one; when
a process requests a memory word on a HEP'ish system the CPU tries to get
useful work done during the network latency by executing another process.
This, of course, requires extremely light-weight processes so that time to
context-switch is comparable to time to execute any other instruction.
One way of doing this is to have a very memory-intensive architecture, with
almost no registers besides PC and PSW (and even the status word can be
dispensed with; c.f. recent comp.arch discussion).  Another method is to
sacrifice single-process speed for parellel speedup, by putting a multiplexor
in front of every single register, so that a context switch is effected by
simply changing the index register that addresses the MUX.  To prevent the
need for reloading the MMU's page-descriptors on every context-switch, it is
preferable for most switches to be between "threads" of the same program
(same virtual address space) rather than between processes running unrelated
programs.

Another tack is to have a conventional ("context switches are expensive")
processor that rarely waits on cache misses.  Writing to a memory location
is always fast because you don't have to wait around for it to finish.  Reads
cause problems.  Suppose you come to a code fragment that is about to do three
array references (low probability of being in the cache).  Rather than saying
LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
PREFETCH C, LD A, LD B, LD C.  If the time to execute the three non-blocking
prefetch instructions is comparable to the network latency, you win big, since
they execute during time that would have been spent idle anyway.  Code density
is shot to hell, and the compiler has to be _very_ smart about cache-hit
likelihoods (or else runtime profiling has to be done, which is tricky because
adding and removing where the PREFETCH instructions go is going to change the
pattern of what's in the cache when).

Something I haven't seen is the above PREFETCH instructions implemented in
hardware.  Call it an intelligent look-ahead cache, or an aux. CPU. 
Predictive memory requests are made not only on the instruction stream,
but also on the data stream, a few instructions ahead of time.  Is this
impractical because the aux. CPU has to be nearly as complicated as the CPU
itself, so you'd get better elapsed times from dual processors that spend a
lot of time waiting, rather than a single processor that hardly ever waits
on a cache miss?  (The aux. CPU could be on the same chip as the CPU -- do
any available processors do data prefetch as well as instruction prefetch?)
In some cases the address to prefetch simply can't be computed soon enough
(LD i, LD base_addr+4*i), but usually there's enough play in the data
dependencies that instructions can be rescheduled to allow predictions to be
made in time (LD i, do something else useful while simultaneously computing
base_addr+4*i, prefetch base_addr+4*i while doing some other useful things,
do the array reference (either by recalculating base_addr+4*i or by grabbing
the already computed result from the aux. CPU)). 

Since all we're interested in is reducing the percentage of cache misses, it
is by no means necessary to make the aux. CPU as intelligent as the primary
CPU; the aux. is permitted to give up on a complicated address calculation
and say, "I don't know," incurring only the penalty of a few wasted cycles
on the primary.  Is this a loose enough constraint to make the aux. CPU
paractical, or is the idea economically infeasible (the dollars for extra
compute power would be better spent on another processor) ?


             --John Hanley
              System Programmer, Manhattan College
              ..!cmcl2.nyu.edu!manhat!jh  or  hanley@nyu.edu   (CMCL2<=>NYU.EDU)