Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-crg!lll-lcc!styx!twg-ap!amdahl!pyramid!decwrl!glacier!mips!mash From: mash@mips.UUCP (John Mashey) Newsgroups: net.arch Subject: Re: Delayed Loads Message-ID: <697@mips.UUCP> Date: Sat, 20-Sep-86 14:56:41 EDT Article-I.D.: mips.697 Posted: Sat Sep 20 14:56:41 1986 Date-Received: Sun, 21-Sep-86 18:42:47 EDT References: <5100133@ccvaxa> <1115@masscomp.UUCP> Reply-To: mash@mips.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 50 In article <1115@masscomp.UUCP> hank@masscomp.UUCP (Hank Cohen) writes: >In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes: >> >>There has been some discussion of delayed branches in this newsgroup; >>can anybody say anything useful about delayed load/stores? Ie. memory >>access functions that are defined to work the same way as delayed >>branches, not to take effect until after a few more instructions. >> >The benefit of such an approach is similar to that of delayed >branches. In a pipelined processor the result of an operation is not >available immediately so if the next instruction in the pipe requires the >result then the pipeline must be stopped until the result is ready. This >interlock logic tends to significantly complicate the design of the CPU >and slows down execution times. Performance of pipelined processors can be >improved by generating code that does not generate data dependent pipeline >interlocks. Presumably microprocessors without pipeline interlocks have >delayed stores as well as delayed branches and for the same reason. No. Delayed branches and delayed loads are the identical problem, one each for Instruction and Data. There's no reason to delay stores, since you already have the data you want. The problem with stores is having enough buffering to smooth the flow of data to memory, and not stall the processor waiting for the write to happen. Solutions to the problem include: register windows (which help the subset of writes that would be subroutine register saves), stack caches (which help the writes that are near the top of the stack), and either write-back caches (like on an 8600), or write-thru caches with write buffers [i.e., like the 1-deep write buffer on the 780, or a MIPS 4-deep write buffer, or (lots of others)]. > >An even thornier problem arises if you allow self modifying code to be run >on your machine. i.e. You build a real Von Neuman machine. The problem of >detecting stores into the instruction stream of a pipelined processor is >even more difficult than detecting data interdependencies. On the Amdahl >470 v8 (the pipelined processor that I am most familiar with) the attempt >is not even made to detect stores into instructions that are already in >execution. All that they try to do is see if a store is "close" in which >case the entire pipeline is flushed and serialized. A pleasant thing about doing an architecture from scratch is the ability to forbid the use of stores into the instruction stream. [Obviously, you must be able to create executable code, but you can require a system call to indicate weird cache manipulations.] There appears to be a fair amount of hardware in many high-end machines dedicated to worrying about this [relatively rare] event, which is too bad. Had it been forbidden from day one, I suspect little performance would be lost; certainly, most high-level languages don't do this kind of thing anyway. -- -john mashey DISCLAIMER: UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086