Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!unmvax!deimos.cis.ksu.edu!uxc!uxc.cso.uiuc.edu!mcdurb!aglew From: aglew@mcdurb.Urbana.Gould.COM Newsgroups: comp.arch Subject: Re: Complex Instructions Message-ID: <28200307@mcdurb> Date: 7 May 89 17:40:00 GMT References: <38629@bbn.COM> Lines: 96 Nf-ID: #R:bbn.COM:38629:mcdurb:28200307:000:4740 Nf-From: mcdurb.Urbana.Gould.COM!aglew May 7 12:40:00 1989 >prem@crackle.amd.com (Prem Sobel) writes >>Many years ago, there was a machine called the Interdata Model 70 which >>had instructions for atomically adding or removing items from circular >>double ended queues. The data structure was defined reasonable effeciently >>and the machine was microcoded. >> >>Yet no one, no compiler seriously used these instructions. The reason was, >>amazingly, that individual instructions were faster!!! I never looked at >>the microcode, so I cannot commebnt why that was. > > Is this sort of finding not precisely the reason behind RISC architecture? > The realisation that many an instruction to handle a relatively > sophisticated job turned out to be slower than a series of very simple > instructions to do the equivalent thing? > > I myself am not quite clear yet on why this should be ( is there perhaps a > reason that larger, more sophisticated microcode routines can't be made > optimally fast?). Yet it does seem to be a fact of life. > > > Alastair Milne The Gould NP1 had queue manipulation instructions that provided atomicity. And, yes, the OS avoided using them, because a hand-crafted set of individual instructions were faster. But the CISCy queue manipulation instructions were left in because the hand crafted sequences used priviliged operations not callable from user mode. Creating a system call to do the same operation was more expensive than leaving the microcoded instructions - especially since the system call would have had to do privilige checks that the hand-crafted inside-the-OS version didn't. Providing user-level access to these instructions was only important because Gould's Real-Time Simulator customers regularly did explicit parallel programming (the queue operations were message primitives, if you will) - and they *did* *not* want to have to run their entire application priviliged. Now, I would much rather have decomposed these CISCy operations into a RISCy set of primitives that might be used to implement other multiprocessing primitives. But, the normal arithmetic and logic primitives are not sufficient. Several times I have attempted to start discussion in comp.arch about what such primitives might be - here is what I remember of the previous discussions (I have the highlights archived, but not easily accessible). It may be time for another go-around. RISCy PRIMITIVES FOR PARALLEL PROCESSING ======================================== Active Memory ------------- Some manufacturers implement special kinds of memory that provide extended semantics suitabler for multiprocessing - like Sequent's MULTIBUS semaphore locations. Somebody at MIPS said someething similar, but it wasn't clear if the approach was actually used. PRO: no extra CPU operations. CON: resource allocation problem - especially if you want to make these accessible to ordinary user processes - and if the active memory is limited in size (the HEP's full/empty bits on all memory locations wouldn't suffer this). Atomicity Boundaries -------------------- Letting the user block interrupts, but only for a maximum amount of time (erroring if still blocked at the end of the max time) has been tried, eg. in Honeywell and Norsk Data(?) machines. Similarly, architecting that interrupts may occur only at, say, modulo N instruction boundaries (either static or dynamic - static requiring special provision for loops) can let the user specify some uninterruptable (atomic wrt interrupts on the same processor) operations, with help from language system to locate operations between the appropriate boundaries. Apparently the ARM does something similar, only checking for interrupts at branches. Obviously, approaches similar to those for interrupts can be applied to multiple processor synchronization. Split Synchronization --------------------- My contribution has been to point out that such synchronization operations may take a long time, and may be split into START-SYNCHRONIZATION and STOP-SYNCHRONIZATION operations - and that other instructions may be placed between these synchronization points. So you do not necessarily have to stop the entire processor when performing one of these operations. Recently a similar idea appeared in ASPLOS III in the paper titled "Fuzzy Barriers..." Andy "Krazy" Glew aglew@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew Motorola Microcomputer Division, Champaign-Urbana Design Center 1101 E. University, Urbana, Illinois 61801, USA. My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.