Path: utzoo!attcan!uunet!svcs1!andy From: andy@svcs1.UUCP (Andy Piziali) Newsgroups: comp.arch Subject: Re: Late, Lamented E&S-1 -- whats it look like? Keywords: architecture, parallelism, latency Message-ID: <329@svcs1.UUCP> Date: 24 Nov 89 19:38:56 GMT References: <36652@apple.Apple.COM> <324@svcs1.UUCP> <36725@apple.Apple.COM> Reply-To: andy@svcs1.UUCP (Andy Piziali) Organization: Silicon Valley Computer Society PC UNIX SIG, San Jose, Ca. Lines: 42 Distribution: In article <36725@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: The problem with multiprocessors is to get my dusty decks running on them. And, if you have no architectural support for this, just software, then anyone could use the same technique and slap together a bunch of micros to achieve the same end. So, what are the architectural features that permitted this system to be used effectively? The issue of running "dusty decks" on new computers has always been addressed with software technology, compilers which translate old programs into the new machine's instruction set and, in the case of parallel machines, map data parallelism onto multiple processors. In the case of the ES-1, the compiler has an intimate knowledge of the CU pipeline for use in scheduling optimized code and source language preprocessors are used to detect parallel operations and create multiple threads within the original, single-threaded, dusty deck. On top of the necessary compiler technology, there must then be architectural support for coordinating the multiple threads of control created by the compiler. In the ES-1, there are three mechanisms for inter-thread synchroni- zation: atomic memory accesses, signals, and interrupts. The atomic memory accesses are your typical test-and-set operations: read and set, read and reset, and reset first. The signal mechanism is a means for threads to asynchronously communicate. A hardware control block is constructed by the thread (A) specifying what signals the thread is expecting. When another thread (B) sends thread A a signal, the receipt of the signal is recorded in the control block and if the thread is not currently active (running on a CU), a processor running a lower priority thread is interrupted. The CUs in an ES-1 may send interrupts to one another for use in asynchronous event signalling. What was the latency through the crossbar (ie. how many delay slots were there after a load?) I feel more comfortable answering how load latency is hidden in general in the ES-1 than citing specific machine parameters. The integer and floating point registers are independently scoreboarded and are always non-blocking. There is no fixed number of delay slots after loads. Instruction issue is not stalled until a load destination register is specified as a subsequent source register.