Path: utzoo!mnetor!uunet!husc6!bloom-beacon!gatech!purdue!i.cc.purdue.edu!j.cc.purdue.edu!pur-ee!hankd From: hankd@pur-ee.UUCP (Hank Dietz) Newsgroups: comp.arch Subject: Re: SPARC and multiprocessing (large read latencies) Message-ID: <8063@pur-ee.UUCP> Date: 5 May 88 20:16:48 GMT References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <381@mancol.UUCP> Organization: Purdue University Engineering Computer Network Lines: 53 Summary: Here's what Smith is doing and what we're doing In article <381@mancol.UUCP>, jh@mancol.UUCP (John Hanley) writes: > In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > >...why hasn't anyone talked about the fact that processors like SPARC are not > >really designed for large-scale multiprocessing, e.g., they have no provision > >for "hiding" BIG, stochastic, memory reference delays across a log n stage > >interconnection network, etc.? I think it's pretty uninteresting to talk about > >multi-processor systems which are small enough that "snooping caches" work.... > > My favorite method of keeping the CPU busy while a memory-read request is > traversing the network is the one used by the Denelcor HEP: context-switch > on a cache miss.... [or alternatively....] > LD A, , LD B, , LD C, , you could say PREFETCH A, PREFETCH B, > PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking > prefetch instructions is comparable to the network latency, you win big, since > they execute during time that would have been spent idle anyway.... > Something I haven't seen is the above PREFETCH instructions implemented in > hardware. Call it an intelligent look-ahead cache, or an aux. CPU. > Predictive memory requests are made not only on the instruction stream, > but also on the data stream, a few instructions ahead of time.... Burton Smith, of HEP fame, is sort-of doing both in his latest machine; so are we (CARP -- the Compiler-oriented Architecture Research group at Purdue). I believe Burton's machine microtasks, a la HEP, but he also has a method whereby many memory references (or other slow operations) can be initiated without waiting for earlier ones to complete. (I still don't know how much of his design is in the public domain, so I can't say much more about it.) The CARP machine doesn't microtask, but we do some very sneaky interrupt enabling... the CARP machine processor has provision for multiple delayed operations to be initiated without waiting for earlier ones to complete, and interrupts are only enabled when the processor has to wait LONGER than the compiler expected. The interrupt latency is high this way (perhaps dozens of instructions between interrupt accept states), but this isn't such a bad problem when you consider a multiprocessor machine where ANY processor could service ANY interrupt. There are actually two separate delayed operation mechanisms in the CARP machine: one for compile-time known delays and one for delays where only the expected delay is known at compile-time. For some operations, the expected-delay-based mechanism is late targeting; i.e., the destination register in register address space is not specified until the item has arrived, hence the usable register address space is not reduced by having multiple items pending (selection of a staging register is implicit in the type of the delayed operation). We look at it this way: if you want to get high speedup by multiprocessing, since not everything can be parallelized, we don't want to slow the sequential parts by microtasking. The result is that we implement machine use priorities by dynamically changing the parallelism-width dedicated to each task, and we concentrate on other mechanisms for hiding delays... preferably mechanisms which do not "use-up" parallelism that we could have used to achieve speedup through parallel execution. -hankd