Xref: utzoo comp.sys.mips:227 comp.arch:11876 Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!uwm.edu!gem.mps.ohio-state.edu!tut.cis.ohio-state.edu!pt.cs.cmu.edu!spice.cs.cmu.edu!af From: af@spice.cs.cmu.edu (Alessandro Forin) Newsgroups: comp.sys.mips,comp.arch Subject: Mips, Mach, test-and-set Message-ID: <6565@pt.cs.cmu.edu> Date: 17 Oct 89 23:05:33 GMT Organization: Carnegie-Mellon University, CS/RI Lines: 64 A number of posts in various newsgroups raised the issue of how to write parallel programs for MIPS-based machines, given the lack of any form of interlocked instruction in the instruction set. This article describes how the problem was solved in the port of Mach to the MIPS M-500 and DECStation-3100 done at CMU. We wanted a machine-independent solution, but with attention to performance issues. User programs should be allowed to run unmodified on different MIPS boxes running Mach, modulo the byte-order. We decided to provide two solutions: one that is guaranteed to work under all conditions, and one that might be more efficient, but which might fail if the hardware or the user program do not meet certain restrictions. Programs are only guaranteed to be portable if they use the former solution, as the second might make use of special hardware features only available on some specific MIPS (multi)processor. Users are expected to write parallel programs using appropriate compilers or support libraries, such as the Mach C-Threads library. The library will export the appropriate interface functions, e.g. mutex_try_lock() for C-Threads. The first solution is implemented by adding a new instruction to the MIPS architecture, a test-and-set instruction (you guessed it). It is encoded as a "special" instruction with "function" code #define tas_op 0x0F which is emulated in the kernel. Since the MIPS architecture can handle exceptions very efficiently the emulation is also efficient and gives a constant value of less than 8usec on the Pmax, in the absence of page faults. If one is smart enough to use test-and-testandset spin locks the latency on a busy lock is then just one memory reference. The second solution for the Pmax is currently based on a new algorithm (derived from Lamport's) that does not need hardware support, as the hardware provides none. Given the type of cache used on the Pmax (write-through) the performance of this solution is not as good as one would like: about 2usecs in the good outcome. This algorithm is also only working for uniprocessor machines, and known to be failable on multiprocessor machines. We will continue to explore this software alternative for the user-level libraries and see if we can make it a guaranteed, machine-independent solution. Clearly, a multiprocessor without cache-coherency in hardware is hopeless in this respect, but might stand a chance of implementing the first solution. Since Lamport's algorithms were mentioned, it is worth noting that those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm" are inappropriate for user-level programs on MIPS. The first problem is that Lamport explicitly noted that the first (fast) algorithm was only intended for interrupt-level routines in the kernel, e.g. with no pre-emption. Too many people fail to realize this, even if it is clearly stated in the paper. The second, much more subtle problem, is that the other algorithms described in the paper use an atomic "compare register with memory" instruction, which is not available on any RISC machine. These algorithms are instead directly applicable to CISC machines like the Vax. sandro- Alessandro Forin School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 ARPA: af@cs.cmu.edu Phone: (412) 268-6861