Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!umd5!brl-adm!cmcl2!phri!manhat!mancol!jh From: jh@mancol.UUCP (John Hanley) Newsgroups: comp.arch Subject: Re: SPARC and multiprocessing (large read latencies) Message-ID: <381@mancol.UUCP> Date: 4 May 88 04:53:22 GMT References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <51409@sun.uucp> <8029@pur-ee.UUCP> Reply-To: jh@mancol.UUCP (John Hanley) Organization: Manhattan College Lines: 69 In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >...why hasn't anyone talked about the fact that processors like SPARC are not >really designed for large-scale multiprocessing, e.g., they have no provision >for "hiding" BIG, stochastic, memory reference delays across a log n stage >interconnection network, etc.? I think it's pretty uninteresting to talk about >multi-processor systems which are small enough that "snooping caches" work.... My favorite method of keeping the CPU busy while a memory-read request is traversing the network is the one used by the Denelcor HEP: context-switch on a cache miss. When a process requests that a disk block be read on a time-sharing system, the scheduler tries to get useful work done during the rotational latency by blocking the process and executing another one; when a process requests a memory word on a HEP'ish system the CPU tries to get useful work done during the network latency by executing another process. This, of course, requires extremely light-weight processes so that time to context-switch is comparable to time to execute any other instruction. One way of doing this is to have a very memory-intensive architecture, with almost no registers besides PC and PSW (and even the status word can be dispensed with; c.f. recent comp.arch discussion). Another method is to sacrifice single-process speed for parellel speedup, by putting a multiplexor in front of every single register, so that a context switch is effected by simply changing the index register that addresses the MUX. To prevent the need for reloading the MMU's page-descriptors on every context-switch, it is preferable for most switches to be between "threads" of the same program (same virtual address space) rather than between processes running unrelated programs. Another tack is to have a conventional ("context switches are expensive") processor that rarely waits on cache misses. Writing to a memory location is always fast because you don't have to wait around for it to finish. Reads cause problems. Suppose you come to a code fragment that is about to do three array references (low probability of being in the cache). Rather than saying LD A, , LD B, , LD C, , you could say PREFETCH A, PREFETCH B, PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking prefetch instructions is comparable to the network latency, you win big, since they execute during time that would have been spent idle anyway. Code density is shot to hell, and the compiler has to be _very_ smart about cache-hit likelihoods (or else runtime profiling has to be done, which is tricky because adding and removing where the PREFETCH instructions go is going to change the pattern of what's in the cache when). Something I haven't seen is the above PREFETCH instructions implemented in hardware. Call it an intelligent look-ahead cache, or an aux. CPU. Predictive memory requests are made not only on the instruction stream, but also on the data stream, a few instructions ahead of time. Is this impractical because the aux. CPU has to be nearly as complicated as the CPU itself, so you'd get better elapsed times from dual processors that spend a lot of time waiting, rather than a single processor that hardly ever waits on a cache miss? (The aux. CPU could be on the same chip as the CPU -- do any available processors do data prefetch as well as instruction prefetch?) In some cases the address to prefetch simply can't be computed soon enough (LD i, LD base_addr+4*i), but usually there's enough play in the data dependencies that instructions can be rescheduled to allow predictions to be made in time (LD i, do something else useful while simultaneously computing base_addr+4*i, prefetch base_addr+4*i while doing some other useful things, do the array reference (either by recalculating base_addr+4*i or by grabbing the already computed result from the aux. CPU)). Since all we're interested in is reducing the percentage of cache misses, it is by no means necessary to make the aux. CPU as intelligent as the primary CPU; the aux. is permitted to give up on a complicated address calculation and say, "I don't know," incurring only the penalty of a few wasted cycles on the primary. Is this a loose enough constraint to make the aux. CPU paractical, or is the idea economically infeasible (the dollars for extra compute power would be better spent on another processor) ? --John Hanley System Programmer, Manhattan College ..!cmcl2.nyu.edu!manhat!jh or hanley@nyu.edu (CMCL2<=>NYU.EDU)