Path: utzoo!mnetor!uunet!husc6!think!bloom-beacon!gatech!purdue!umd5!uvaarpa!virginia!uvacs!wulf
From: wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf)
Newsgroups: comp.arch
Subject: Re: The WM Machine
Message-ID: <2389@uvacs.CS.VIRGINIA.EDU>
Date: 6 May 88 17:54:19 GMT
References: <5339@aw.sei.cmu.edu> <9331@apple.Apple.Com>
Reply-To: wulf@uvacs.cs.virginia.edu.UUCP (Bill Wulf)
Organization: U.Va. CS dept.  Charlottesville, VA
Lines: 70

[..... Aaaaarrrrrgggghhhhh. The perversity of inanimate objects!
This is the message that should have gotten sent BEFORE the last one.
Grumble, mutter, sputter, fume .... ]

I am new to this ... I only began reading this newsgroup a few
days ago when one of the folks here told me that some articles 
had appeared on WM. Alas, I haven't seen them all (like Firth's),
so forgive me for jumping into the middle of this without full
information and any misunderstandings I may have as a consequence.

Re Baum's comments about streaming state and context swaps:  There
are indeed a number of instructions in the full WM defn for saving/
restoring state, including that of the FIFOs -- though, perhaps not
so many as he fears. As should be obvious, I made a conscious decision
to increase the amount of state (witness 64 registers, 32 integer and 32
floating point) to improve performance at the expense of context-swapping.
I am not especially concerned about the number of instructions this
involves (it's only a handful in any case), but the size of the state
is a legit concern.

For what it's worth, the rationale was as follows:  Processors are getting
ever faster. Interrupt rates have gotten higher too, but not as much so
since they are governed by more by the external world -- and people don't
type faster, disks don't rotate especially faster, etc. I think this trend
will continue, so in planning processors for the future, it's reassonable
to consider increasing the amount of state if that speeds up the processor.
One can't be silly about it, like blowing up the state by orders of magnitude,
since interrupt rates and required response time to some of the interrupts
is increasing, but a modest increase is a reasonable thing to consider. I did,
and opted for (some) more state.

I don't want to try to describe all the architectural features of WM related
to OS issues (of which state save/restore is just one) here -- it would just
take too much space. <<insert previous article here>>

Re Case's reply to Firth (which, again, I haven't seen): Right on.

The two op's/instruction add nothing to the cycle time. But note that they DO
add to the latency. Considering the 3-stage pipe simple model of RISCs (decode,
op,write-back), WM has 4 stages (decode, op, op, write-back) -- so a result
is available 4/3's later. Whether this wins or loses depends on whether
you've done one or two operations in the instruction. If 1, and couldn't
dispatch an unrelated operation on the next cycle, you lose by 30%. If
you use 2, you win by 50% (4/3s rather than 6/3s). The data says you
win!

As I said in the CAN article, I was very surprised by the frequency of
use of 2 ops/cycle. When I first started this, I was really only exploring
a number of ways to make all instructions exactly 32 bits. The 2ops/instr
was just one idea, and I expected it to be a loser -- so I built a quick
and dirty compiler than generated this format and was blown away by the
results. (Aside -- After getting the preliminary results, I tried 3, 4,...,
ops/instr (ignoring the fact that I couldn't encode them in 32 bits) and
found that 2 is magic; you get very little benefit from more than 2).

An added (perhaps obvious) addition to Case's reply add-comp-br vs. the way
WM does things: Since the IFU is decoupled from the arithmetic units, and
hence the cond branches can be executed early (assuming the CC is available),
this means that you can start filling the instruction buffer(s) early too,
thus eliminating instruction-fetch latency. You can statistically get a similar
effect with branch-prediction, but the WM scheme works independent of the
way the branch is taken -- so, specifically, works for the if-then-else case
that is the the bug-a-boo of branch-prediction schemes.

General caveat -- at the time I wrote the WM article, I hadn't seen any designs
that handle conditionals like WM (CRISP gets some of the same effect, but with
an entirely different, and I think more expensive, mechanism). Since then, the
ZS-1 design was pointed out to me, which has a very similar mechanism, has
a FIFO interface to memory, and decouples the int & fp units in a similar way.
Some ideas just seem to have a time to be ripe.