Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!lll-crg!lll-lcc!styx!twg-ap!amdahl!pyramid!decwrl!glacier!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: net.arch
Subject: Re: Delayed Loads
Message-ID: <697@mips.UUCP>
Date: Sat, 20-Sep-86 14:56:41 EDT
Article-I.D.: mips.697
Posted: Sat Sep 20 14:56:41 1986
Date-Received: Sun, 21-Sep-86 18:42:47 EDT
References: <5100133@ccvaxa> <1115@masscomp.UUCP>
Reply-To: mash@mips.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 50

In article <1115@masscomp.UUCP> hank@masscomp.UUCP (Hank Cohen) writes:
>In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>>
>>There has been some discussion of delayed branches in this newsgroup;
>>can anybody say anything useful about delayed load/stores? Ie. memory
>>access functions that are defined to work the same way as delayed
>>branches, not to take effect until after a few more instructions.
>>
>The benefit of such an approach is similar to that of delayed 
>branches.  In a pipelined processor the result of an operation is not
>available immediately so if the next instruction in the pipe requires the
>result then the pipeline must be stopped until the result is ready.  This
>interlock logic tends to significantly complicate the design of the CPU
>and slows down execution times.  Performance of pipelined processors can be
>improved by generating code that does not generate data dependent pipeline
>interlocks.  Presumably microprocessors without pipeline interlocks have
>delayed stores as well as delayed branches and for the same reason.
No.
Delayed branches and delayed loads are the identical problem, one each
for Instruction and Data.  There's no reason to delay stores, since you
already have the data you want.  The problem with stores is having enough
buffering to smooth the flow of data to memory, and not stall the processor
waiting for the write to happen.  Solutions to the problem include:
register windows (which help the subset of writes that would be subroutine
register saves), stack caches (which help the writes that are near the
top of the stack), and either write-back caches (like on an 8600), or
write-thru caches with write buffers [i.e., like the 1-deep write buffer
on the 780, or a MIPS 4-deep write buffer, or (lots of others)].
>
>An even thornier problem arises if you allow self modifying code to be run
>on your machine. i.e. You build a real Von Neuman machine.  The problem of
>detecting stores into the instruction stream of  a pipelined processor is
>even more difficult than detecting data interdependencies.  On the Amdahl
>470 v8 (the pipelined processor that I am most familiar with) the attempt
>is not even made to detect stores into instructions that are already in
>execution.  All that they try to do is see if a store is "close" in  which
>case the entire pipeline is flushed and serialized.

A pleasant thing about doing an architecture from scratch is the ability to
forbid the use of stores into the instruction stream. [Obviously, you must
be able to create executable code, but you can require a system call to
indicate weird cache manipulations.]  There appears to be a fair amount of
hardware in many high-end machines dedicated to worrying about this
[relatively rare] event, which is too bad.  Had it been forbidden from
day one, I suspect little performance would be lost; certainly, most
high-level languages don't do this kind of thing anyway.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086