Xref: utzoo comp.sys.mips:227 comp.arch:11876
Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!uwm.edu!gem.mps.ohio-state.edu!tut.cis.ohio-state.edu!pt.cs.cmu.edu!spice.cs.cmu.edu!af
From: af@spice.cs.cmu.edu (Alessandro Forin)
Newsgroups: comp.sys.mips,comp.arch
Subject: Mips, Mach, test-and-set
Message-ID: <6565@pt.cs.cmu.edu>
Date: 17 Oct 89 23:05:33 GMT
Organization: Carnegie-Mellon University, CS/RI
Lines: 64


A number of posts in various newsgroups raised the issue of how to write
parallel programs for MIPS-based machines, given the lack of any form of
interlocked instruction in the instruction set.  This article describes how
the problem was solved in the port of Mach to the MIPS M-500 and
DECStation-3100 done at CMU.

We wanted a machine-independent solution, but with attention to performance
issues.  User programs should be allowed to run unmodified on different MIPS
boxes running Mach, modulo the byte-order.
We decided to provide two solutions: one that is guaranteed to work under
all conditions, and one that might be more efficient, but which might fail
if the hardware or the user program do not meet certain restrictions.
Programs are only guaranteed to be portable if they use the former solution,
as the second might make use of special hardware features only available on
some specific MIPS (multi)processor.  Users are expected to write parallel
programs using appropriate compilers or support libraries, such as the Mach
C-Threads library.  The library will export the appropriate interface
functions, e.g.  mutex_try_lock() for C-Threads.  

The first solution is implemented by adding a new instruction to
the MIPS architecture, a test-and-set instruction (you guessed it).
It is encoded as a "special" instruction with "function" code 
	#define tas_op	0x0F
which is emulated in the kernel.  Since the MIPS architecture
can handle exceptions very efficiently the emulation is also
efficient and gives a constant value of less than 8usec on the Pmax,
in the absence of page faults.
If one is smart enough to use test-and-testandset spin locks
the latency on a busy lock is then just one memory reference.

The second solution for the Pmax is currently based on a new algorithm
(derived from Lamport's) that does not need hardware support, as the
hardware provides none.  Given the type of cache used on the Pmax
(write-through) the performance of this solution is not as good as one would
like: about 2usecs in the good outcome.  This algorithm is also only working
for uniprocessor machines, and known to be failable on multiprocessor
machines.  We will continue to explore this software alternative for the
user-level libraries and see if we can make it a guaranteed,
machine-independent solution.  Clearly, a multiprocessor without
cache-coherency in hardware is hopeless in this respect, but might stand a
chance of implementing the first solution.

Since Lamport's algorithms were mentioned, it is worth noting that
those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm"
are inappropriate for user-level programs on MIPS.  The first problem
is that Lamport explicitly noted that the first (fast) algorithm was
only intended for interrupt-level routines in the kernel, e.g. with no
pre-emption.  Too many people fail to realize this, even if it is
clearly stated in the paper.
The second, much more subtle problem, is that the other algorithms described
in the paper use an atomic "compare register with memory" instruction, which
is not available on any RISC machine.  These algorithms are instead directly
applicable to CISC machines like the Vax.

sandro-

	Alessandro Forin
	School of Computer Science
	Carnegie Mellon University
	5000 Forbes Avenue
	Pittsburgh, PA 15213
	ARPA: af@cs.cmu.edu
	Phone: (412) 268-6861