Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!lll-crg!nike!oliveb!glacier!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: net.arch
Subject: Re: Delayed Loads
Message-ID: <694@mips.UUCP>
Date: Wed, 17-Sep-86 03:11:01 EDT
Article-I.D.: mips.694
Posted: Wed Sep 17 03:11:01 1986
Date-Received: Fri, 19-Sep-86 22:11:09 EDT
References: <5100133@ccvaxa> <486@weitek.UUCP>
Reply-To: mash@mips.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 65

In article <486@weitek.UUCP> mahar@weitek.UUCP (Mike Mahar) writes:
>In article <5100133@ccvaxa>, aglew@ccvaxa.UUCP writes:
>> 
>> There has been some discussion of delayed branches in this newsgroup;
>> can anybody say anything useful about delayed load/stores? Ie. memory
>> access functions that are defined to work the same way as delayed
>> branches, not to take effect until after a few more instructions.
>> (example; further discussion by Mike on Weitek 713[67] alsu & sequencer.)

I missed the original of this.  Both the Stanford MIPS and the MIPS Computer
Systems R2000 use non-interlocked load instructions.  A code reorganizer
rearranges instructions to place independent ones in the "load-delay-slots".

Note that the load latency always exists, whether or not software
fills the slot, leaves a nop, or the hardware provides an interlock.
The main issues are in deciding how much interlocking and optimization
one can expect from the software, and therefore can leave out of the
hardware.

One can observe several distinct design styles in the handling of
load-delay latency, or of other operations that produce results used by
later instructions.  All of these require:
	a) Hardware interlocks, with some parallelism.
	b) Non-pipelined implementation, i.e., an extreme form of a)
		that makes most interlocks unnecessary! (but slow)
	c) Software scheduling required for correctness everywhere.
	d) Some combination of a) and c) required for correctness.
	e) Basically a), but designed with c) in mind.

Most computers use a), with the complexity of interlock dependent on
the nature of the architecture, and on the aggressiveness of pipelining.
A good example would be a 360/91, and presumably some of the faster 30XX
machines.  Note that a complex architecture may require considerable
hardware to dynamically detect opearations that can be done in parallel,
do them that way, and make sure everythign is fine when exceptions happen.
[I.e., nothing stops CISCs from being fast, but it takes a lot of gates!]
Branch handling gets exciting, for example.

The "bottom" end of many computer families is often in class b).

I assume that many specialized VLSI parts use c).  I don't know of
any aggressively-pipelined general CPU architecture that does this.
Can anybody post some?

d) Many RISC designs fall in d).  For example, MIPS R2000 uses software
to fill load and branch delays, while using hardware interlocks for
integer multiply/divide, and for some floating-point operations.
The HP SPectrum (I think) fills branch delays by software, but uses
hardware for load delays.  In either case, at least some of the hardware
design was predicated on the expected nature of compilers, i.e., things
were left out of the hardware based on knowing what the compilers
might be able to do.

e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would
rearrange code to help things go fast, but the hardware could handle
all of the interlocks itself [I think. Anybody know different?]

It's amusing to note that people have done reorganizing compilers for
machines whose architecture provides interlocks, but whose faster
members can run faster given code that has been organized with
more aggressive pipelines in mind. [i.e., big IBM machines]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086