Path: utzoo!mnetor!uunet!husc6!mailrus!ames!amdahl!nsc!voder!apple!bcase
From: bcase@Apple.COM (Brian Case)
Newsgroups: comp.arch
Subject: Re: The WM Machine
Message-ID: <9331@apple.Apple.Com>
Date: 5 May 88 18:39:36 GMT
References: <5339@aw.sei.cmu.edu>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc, Cupertino, CA
Lines: 54

In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>Two Operations per Instruction
>------------------------------
>This feature seems to me quite unwarranted.  One major point of a RISC
>design is that instructions are simple enough to be executed quickly.
>But here is a machine where, within one instruction, there are two
>operations, the one gated on the result of the other.  This forces the
>instruction to take twice as long, regardless of how many execution
>units might be available.

No, no, no.  The data dependency rule is included specifically to allow
a pipeline stage between the two ALUs.  Wulf is clearly assuming the 
pipestage is implemented (at least it is clear to me).  The basic cycle
of the WM machine should be no longer than that of any other RISC machine.

>A design in which twice as many instructions are executed, but successive
>instructions are independent, clearly cannot perform worse, and will perform
>better if several execution units can be run in parallel.  One main lesson
>of prior designs - achieve higher parallelism by uncoupling mutually dependent
>operations - has simply been ignored.

I don't think Wulf is ignoring anything.  The complexity of an n-instruction-
at-a-time (n > 1) is much greater than that of a WM machine.  I don't know
how much greater, but it is greater.  Maybe significantly.

>The WM sketch intends the first half of each instruction to be overlapped with
>the second half of the preceding instruction, and there is a dependency
>rule about that.  But the same apparatus is all that is needed to execute
>independent instructions in parallel, so the implementation complexity of
>WM is no less, but its performance is impeded by the extra dependency
>within each instruction.

The implementation complexity of WM is A LOT LESS than a machine that tries
to execute two instructions at the same time (two less register file ports,
no "scoreboarding" needed, etc.).  At least *I* would much rather implement
a WM than a two-instruction-at-a-time conventional RISC.  However, the
performance of a two-instruction-at-a-time conventional RISC might be greater,
and is generalizable (but then, just implement a two-instruction-at-a-time
version of the WM...).

>Moreover, there are several RISC machines where a comparison,
>test, and jump decision can be done in one cycle.  Therefore, no purpose
>is served by separating the comparison from the jump decision, since
>there is no latency to be overlapped.  And the branch-delay optimisation
>issue is equally present, under a different guise, in the WM machine, as
>its author admits.

Those RISC machines that implement compare/branch in one instruction are
fudging a bit (unless the compare is very simple, like == 0, == reg, != 0,
!= reg).  The purpose served by separating the compare from the branch is
quite important!  This is called, by some, branch spreading, and it allows
early evaluation of the branch so that latency is reduced to zero.  This
is my one big complaint with Wulf's paper:  he didn't reference previous
branch spreading work done on CRISP and RIDGE machines.