Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-crg!nike!oliveb!glacier!mips!mash From: mash@mips.UUCP (John Mashey) Newsgroups: net.arch Subject: Re: Delayed Loads Message-ID: <694@mips.UUCP> Date: Wed, 17-Sep-86 03:11:01 EDT Article-I.D.: mips.694 Posted: Wed Sep 17 03:11:01 1986 Date-Received: Fri, 19-Sep-86 22:11:09 EDT References: <5100133@ccvaxa> <486@weitek.UUCP> Reply-To: mash@mips.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 65 In article <486@weitek.UUCP> mahar@weitek.UUCP (Mike Mahar) writes: >In article <5100133@ccvaxa>, aglew@ccvaxa.UUCP writes: >> >> There has been some discussion of delayed branches in this newsgroup; >> can anybody say anything useful about delayed load/stores? Ie. memory >> access functions that are defined to work the same way as delayed >> branches, not to take effect until after a few more instructions. >> (example; further discussion by Mike on Weitek 713[67] alsu & sequencer.) I missed the original of this. Both the Stanford MIPS and the MIPS Computer Systems R2000 use non-interlocked load instructions. A code reorganizer rearranges instructions to place independent ones in the "load-delay-slots". Note that the load latency always exists, whether or not software fills the slot, leaves a nop, or the hardware provides an interlock. The main issues are in deciding how much interlocking and optimization one can expect from the software, and therefore can leave out of the hardware. One can observe several distinct design styles in the handling of load-delay latency, or of other operations that produce results used by later instructions. All of these require: a) Hardware interlocks, with some parallelism. b) Non-pipelined implementation, i.e., an extreme form of a) that makes most interlocks unnecessary! (but slow) c) Software scheduling required for correctness everywhere. d) Some combination of a) and c) required for correctness. e) Basically a), but designed with c) in mind. Most computers use a), with the complexity of interlock dependent on the nature of the architecture, and on the aggressiveness of pipelining. A good example would be a 360/91, and presumably some of the faster 30XX machines. Note that a complex architecture may require considerable hardware to dynamically detect opearations that can be done in parallel, do them that way, and make sure everythign is fine when exceptions happen. [I.e., nothing stops CISCs from being fast, but it takes a lot of gates!] Branch handling gets exciting, for example. The "bottom" end of many computer families is often in class b). I assume that many specialized VLSI parts use c). I don't know of any aggressively-pipelined general CPU architecture that does this. Can anybody post some? d) Many RISC designs fall in d). For example, MIPS R2000 uses software to fill load and branch delays, while using hardware interlocks for integer multiply/divide, and for some floating-point operations. The HP SPectrum (I think) fills branch delays by software, but uses hardware for load delays. In either case, at least some of the hardware design was predicated on the expected nature of compilers, i.e., things were left out of the hardware based on knowing what the compilers might be able to do. e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would rearrange code to help things go fast, but the hardware could handle all of the interlocks itself [I think. Anybody know different?] It's amusing to note that people have done reorganizing compilers for machines whose architecture provides interlocks, but whose faster members can run faster given code that has been organized with more aggressive pipelines in mind. [i.e., big IBM machines] -- -john mashey DISCLAIMER: UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086