Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!ucsd!rutgers!rochester!pt.cs.cmu.edu!andrew.cmu.edu!pgc+ From: pgc+@andrew.cmu.edu (Paul G. Crumley) Newsgroups: comp.arch Subject: Re: RT/PC Unaligned Accesses Message-ID: Date: 2 Apr 89 01:23:57 GMT References: <4618@pt.cs.cmu.edu> Organization: Information Technology Center, Carnegie Mellon, Pittsburgh, PA Lines: 139 In-Reply-To: <4618@pt.cs.cmu.edu> Hello, For those that are not familiar with the design of the RT/PC processor, I will describe what is done for unaligned acess to memory for data accesses and instruction fetches. All of this is described in detail in the "IBM RT PC Hardware Technical Reference Volume I", part number 75X0232. Instructions on the RT/PC must be half-word aligned. (each word contains four 8 bit bytes) The Instruction Address Register (IAR) has bit 0, the LSB, forced to a zero. If you exectute a branch to an absolute location (most branches are relative and contain the displacement in halfwords) thus loading the whole IAR, the LSB is silently forced to a zero. If one tries to load the IAR using the privileged instruction MTS (Move To SCR (System Control Register)) the LSB is forced to a value of zero. These absolute branches and the MTS instruction are the only ways that a program can attempt to get an odd value in the IAR. In both cases the resulting state of the CPU is completely defined. The processor has instructions that load and store bytes, halfwords and words. For halfword accesses (both load and store) the LSB is silently forced to a zero and for word accesses the two LSBs are silently forced be zeroes. Load and store instructions have no effect on the condition bits and no exceptions are generated when the lower order bit or bits are forced to zero. There are two instructions which are used to move a group of registers to or from storage as unit. These are the Load Multiple(LM) and Store Multiple(STM) instructions. These instructions move N registers to an area of memory 4*N in size. This area must start on a word aligned boundry. For both instruction fetches and data loads and stores it is possible to generate an exception from the MMU (Memory Management Unit) or other storage controllers on the RSC (ROMP (RT's CPU) Storage Channel). It is possible for instructions, and the source or target area for a LM or STM instruction, to span a page boundry. If such a condition causes a page fault exception all of the information needed to restart the instruction fetch or the data access is available to the processor. Now that we all know what is being discussed, let's see if there really are problems with this scheme. I will attempt to discuss some of Don's complaints as these are fairly representative of what I have heard from other people in the past about the way the RT/PC handles unaligned accesses. "How can one write a correct program on a machine that drops the bits and doesn't report any exceptions?" If you have a program that runs correctly on some other processor and you have used language supported ways to create new objects, this program will run correctly on an RT/PC. If your program assumes you can stuff an integer into an array of floating point numbers you might have problems. If the language doesn't understand type coercions or other support the allocation of new objects you are out of luck. Anything you do is not portable so it has to be isolated from the rest of the program anyway. "OK, what about importing pointers to things? How can one write correct code in that case?" Pointers that one gets from an untrusted object always have to be checked. They have to be checked to make sure they won't page fault. They have to be checked to be certain they are pointing to a valid instance of the class of object to which they claim to point. In most cases, this checking process is non-trivial. The work of checking the low order bits is simple and quick compared to the work required to determine if the pointer you just got is really a pointer to a valid object. "Doesn't the need to pad all the data structures waste a lot of space?" For most languages, the compilers are free to rearrange the order of the elements of a structure any way they desire. With smart compilers, a worst-case of 3 bytes per structure is used for padding. In compensation, programs operate faster when the data is loaded and stored in single memory system cycles rather than requiring two such cycles. If a particular language (COBOL?) really agrees to order the elements of structures in exactly the way the programmer desires, there are instructions in the RT/PC's repertoire that allow those elements to be moved correctly. "Oh, great...now we have to rewrite all our test cases." No comment. "Come on, there must be SOME case where a piece of code that works on a machine that fixes up unaligned memory access doesn't work on the RT/PC and the RT/PC doesn't complain!" Well, yes, there are such cases. One can imagine a program that grabs an arbitrary instruction stream, smashes that instruction stream into memory starting at an arbitrary location, and jumps to the instructions. Another example is a simple-minded malloc that hands back arbitrary addresses. Such programs might work on some machine but fail on the RT/PC. Though it is simple to image such program fragments, in practice, the code that does these types of functions was isolated in language constructs, library routines and operating system services long ago. "Look, Paul, how can you possibly defend such a inelegant piece of garbage? This seems so out of character for you!" Well, what can I say? The RT/PC executes correct programs correctly. I have never had a problem with the way the RT/PC accesses memory and I have written lots of code that manipulates imported pointers, implements device drivers, whips together instruction streams on the fly, and other, non-trivial functions. I believe the silicon saved by the need for the shifters and more complex microcode was put to better use in a variety of ways. Some of these uses include: -- Programs execute more reliably on an RT/PC than on many other processors. The RT/PC implements paritiy generation and checking with automatic retransmission on inter-module busses. This causes transient errors that would be undetect on other processors to be detected and fixed on the RT/PC. -- The RT/PCs' MMU allows access to storage to be regulated with very fine ganularity. This allows options for software designs that were not feasible on many other machines. Debuggers can allow programs to execute at full speed and trap on accesses to an arbitrarily large number of data objects. It is possible to allow multiple processes to share data areas and new ways of synchronizing changes to that data. Mapped files that must be consistent across multiple data spaces or processors are possible. -- The RT/PC's MMU provides many facilities that greatly enhance the system's performance. The MMU defines a very large virtual address space (2**40). This space can be used in a variety of ways by programs and operating systems. There are plenty of TLB entries (Translations Look-aside Buffers) thus providing good virtual memory performance. Hardware is provided that allows quick process switches. DMA can use real or virtual addresses. Memory accesses on the RT/PC are interleaved thus providing faster data loads and stores. Please note that all of the above listed items refer to the MMU. It turns out that the CPU generates all the address bits for memory system access. (The CPU does force the low order bit of the IAR to zero thus limiting instruction fetches to halfwords.) It is the MMU chip that drops the low order bits for unaligned accesses. I have ignored this fact till now since it is an implementation detail. The system looks the same to the programmer no matter where the bits are dropped. Still, I want to point out that the way the RT/PC implements unaligned accesses is not a product of the RISC processor, but of the MMU. Other devices on the RSC are free to support unaligned accesses in any manner they like. As with all implementations there are trade-offs. The silicon saved in the MMU by not supporting arbitrary alignment was put to good use. The RT/PC operates more reliably and with higher performance than many similar systems. When good compilers are used, the current RT/PCs, models 12X & 13X, perform very well. In conclusion, I hope this note helps to clear up the issues involving the manner in which the RT/PC implements unaligned accesses. Though the dropping of low order address bits may seem to prevent any hope of producing correct object code, we see that such concerns are not a problem. If there are problems with the way a program accesses data on the RT/PC, there are problems with that code on other machines too. Best regards, Paul