Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!usc!apple!amdcad!dgcad!dg-rtp!siberia!hamilton
From: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Newsgroups: comp.arch
Subject: Re: Second-generation RISC
Message-ID: <1991Mar26.210643.2052@dg-rtp.dg.com>
Date: 26 Mar 91 21:06:43 GMT
References: <6128@baird.cs.strath.ac.uk> <7425@titcce.cc.titech.ac.jp> <3189@inews.intel.com> <705@seqp4.UUCP>
Sender: usenet@dg-rtp.dg.com (Usenet Administration)
Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Organization: Data General Corporation, Research Triangle Park, NC
Lines: 55

In article <705@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeffrey d'Arcy) writes:
|> kds@blabla.intel.com (Ken Shoemaker) writes:
|> >Delayed
|> >branches will probably go away as they really are an artifact of having a
|> >short, fixed length pipeline.
|> 
|> As long as there's at least one pipe stage devoted to instruction fetch and
|> decode, I think delayed branches would make sense.  If this part of the 
|> pipeline ceases to be fixed length, maybe we'll see multiple-instruction
|> branch delays, with a *variable* number of instructions after the branch.
|> 
This is a slightly backwards way of looking at the problem.  The question
is not how many branch delays are implied by the pipeline structure; that
will change from implementation to implementation.  The question is how
many branch delays can be profitably filled by compilers.

Suppose, for example, that we discover that 99% of the time the compiler
can profitably fill three branch delay slots.  Then, regardless of the
pipeline structure of the machine, the delayed branches should support three
delay slots.  Instead of writing:

	instr a
	instr b		(executes in four clocks on a non-superscalar
	branch.n foo     machine with one fetch/decode stage)
	instr c

we write:

	branch.n foo
	instr a		(also executes in four clocks on a non-superscalar
	instr b		 machine with one fetch/decode stage)
	instr c

Since, by hypothesis, the compiler can profitably use all three slots nearly all
the time we can live with the occasional no-op inserted when all three slots
cannot be filled.

But look at what happens when we move this code to a two intructions/cycle
superscalar...
The first sequence now wastes one entire cycle, or two instruction times,
and has an effective execution time of three clocks.  The second sequence
wastes nothing and has an effective execution time of two clocks.  The second
sequence, in other words, executes on a given implementation at its best
possible speed, regardless of the pipeline structure.

The moral of the story is that the delayed branching should be designed around
the best that the compiler can do, not the idiosyncracies of a particular
implementation.  The compiler should be able to generate code that uses all
the available pipelining, without worrying about precisely how much pipelining
that is.

----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA