Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!unmvax!deimos.cis.ksu.edu!uxc!uxc.cso.uiuc.edu!mcdurb!aglew
From: aglew@mcdurb.Urbana.Gould.COM
Newsgroups: comp.arch
Subject: Re: Complex Instructions
Message-ID: <28200307@mcdurb>
Date: 7 May 89 17:40:00 GMT
References: <38629@bbn.COM>
Lines: 96
Nf-ID: #R:bbn.COM:38629:mcdurb:28200307:000:4740
Nf-From: mcdurb.Urbana.Gould.COM!aglew    May  7 12:40:00 1989


>prem@crackle.amd.com (Prem Sobel) writes
>>Many years ago, there was a machine called the Interdata Model 70 which
>>had instructions for atomically adding or removing items from circular
>>double ended queues. The data structure was defined reasonable effeciently
>>and the machine was microcoded.
>>
>>Yet no one, no compiler seriously used these instructions. The reason was,
>>amazingly, that individual instructions were faster!!! I never looked at
>>the microcode, so I cannot commebnt why that was.
>
>    Is this sort of finding not precisely the reason behind RISC architecture?
>    The realisation that many an instruction to handle a relatively
>    sophisticated job turned out to be slower than a series of very simple
>    instructions to do the equivalent thing?
>
>    I myself am not quite clear yet on why this should be ( is there perhaps a
>    reason that larger, more sophisticated microcode routines can't be made
>    optimally fast?).  Yet it does seem to be a fact of life.  
>
>
>    Alastair Milne

The Gould NP1 had queue manipulation instructions that provided atomicity.
And, yes, the OS avoided using them, because a hand-crafted set of individual
instructions were faster.

But the CISCy queue manipulation instructions were left in because the hand
crafted sequences used priviliged operations not callable from user mode.
Creating a system call to do the same operation was more expensive than 
leaving the microcoded instructions - especially since the system call would
have had to do privilige checks that the hand-crafted inside-the-OS version
didn't.
    Providing user-level access to these instructions was only important
because Gould's Real-Time Simulator customers regularly did explicit parallel
programming (the queue operations were message primitives, if you will)
- and they *did* *not* want to have to run their entire application 
priviliged.

Now, I would much rather have decomposed these CISCy operations into a RISCy
set of primitives that might be used to implement other multiprocessing
primitives. But, the normal arithmetic and logic primitives are not sufficient.
Several times I have attempted to start discussion in comp.arch about what
such primitives might be - here is what I remember of the previous
discussions (I have the highlights archived, but not easily accessible).
It may be time for another go-around.

RISCy PRIMITIVES FOR PARALLEL PROCESSING
========================================

Active Memory
-------------
    Some manufacturers implement special kinds of memory that provide extended
semantics suitabler for multiprocessing - like Sequent's MULTIBUS semaphore
locations.  Somebody at MIPS said someething similar, but it wasn't clear
if the approach was actually used.
    PRO: no extra CPU operations.
    CON: resource allocation problem - especially if you want to make
these accessible to ordinary user processes - and if the active memory is
limited in size (the HEP's full/empty bits on all memory locations wouldn't
suffer this).

Atomicity Boundaries
--------------------
    Letting the user block interrupts, but only for a maximum amount of time
(erroring if still blocked at the end of the max time) has been tried, eg.
in Honeywell and Norsk Data(?) machines.
    Similarly, architecting that interrupts may occur only at, say, modulo
N instruction boundaries (either static or dynamic - static requiring special
provision for loops) can let the user specify some uninterruptable
(atomic wrt interrupts on the same processor) operations, with help from
language system to locate operations between the appropriate boundaries.
    Apparently the ARM does something similar, only checking for interrupts 
at branches.
    
Obviously, approaches similar to those for interrupts can be applied to 
multiple processor synchronization.
    
Split Synchronization
---------------------
    My contribution has been to point out that such synchronization operations
may take a long time, and may be split into START-SYNCHRONIZATION and
STOP-SYNCHRONIZATION operations - and that other instructions may be placed 
between these synchronization points. So you do not necessarily have to
stop the entire processor when performing one of these operations.
    Recently a similar idea appeared in ASPLOS III in the paper titled
"Fuzzy Barriers..."


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.