Path: utzoo!mnetor!uunet!husc6!think!bloom-beacon!gatech!purdue!umd5!uvaarpa!virginia!uvacs!wulf From: wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf) Newsgroups: comp.arch Subject: Re: The WM Machine Message-ID: <2389@uvacs.CS.VIRGINIA.EDU> Date: 6 May 88 17:54:19 GMT References: <5339@aw.sei.cmu.edu> <9331@apple.Apple.Com> Reply-To: wulf@uvacs.cs.virginia.edu.UUCP (Bill Wulf) Organization: U.Va. CS dept. Charlottesville, VA Lines: 70 [..... Aaaaarrrrrgggghhhhh. The perversity of inanimate objects! This is the message that should have gotten sent BEFORE the last one. Grumble, mutter, sputter, fume .... ] I am new to this ... I only began reading this newsgroup a few days ago when one of the folks here told me that some articles had appeared on WM. Alas, I haven't seen them all (like Firth's), so forgive me for jumping into the middle of this without full information and any misunderstandings I may have as a consequence. Re Baum's comments about streaming state and context swaps: There are indeed a number of instructions in the full WM defn for saving/ restoring state, including that of the FIFOs -- though, perhaps not so many as he fears. As should be obvious, I made a conscious decision to increase the amount of state (witness 64 registers, 32 integer and 32 floating point) to improve performance at the expense of context-swapping. I am not especially concerned about the number of instructions this involves (it's only a handful in any case), but the size of the state is a legit concern. For what it's worth, the rationale was as follows: Processors are getting ever faster. Interrupt rates have gotten higher too, but not as much so since they are governed by more by the external world -- and people don't type faster, disks don't rotate especially faster, etc. I think this trend will continue, so in planning processors for the future, it's reassonable to consider increasing the amount of state if that speeds up the processor. One can't be silly about it, like blowing up the state by orders of magnitude, since interrupt rates and required response time to some of the interrupts is increasing, but a modest increase is a reasonable thing to consider. I did, and opted for (some) more state. I don't want to try to describe all the architectural features of WM related to OS issues (of which state save/restore is just one) here -- it would just take too much space. <> Re Case's reply to Firth (which, again, I haven't seen): Right on. The two op's/instruction add nothing to the cycle time. But note that they DO add to the latency. Considering the 3-stage pipe simple model of RISCs (decode, op,write-back), WM has 4 stages (decode, op, op, write-back) -- so a result is available 4/3's later. Whether this wins or loses depends on whether you've done one or two operations in the instruction. If 1, and couldn't dispatch an unrelated operation on the next cycle, you lose by 30%. If you use 2, you win by 50% (4/3s rather than 6/3s). The data says you win! As I said in the CAN article, I was very surprised by the frequency of use of 2 ops/cycle. When I first started this, I was really only exploring a number of ways to make all instructions exactly 32 bits. The 2ops/instr was just one idea, and I expected it to be a loser -- so I built a quick and dirty compiler than generated this format and was blown away by the results. (Aside -- After getting the preliminary results, I tried 3, 4,..., ops/instr (ignoring the fact that I couldn't encode them in 32 bits) and found that 2 is magic; you get very little benefit from more than 2). An added (perhaps obvious) addition to Case's reply add-comp-br vs. the way WM does things: Since the IFU is decoupled from the arithmetic units, and hence the cond branches can be executed early (assuming the CC is available), this means that you can start filling the instruction buffer(s) early too, thus eliminating instruction-fetch latency. You can statistically get a similar effect with branch-prediction, but the WM scheme works independent of the way the branch is taken -- so, specifically, works for the if-then-else case that is the the bug-a-boo of branch-prediction schemes. General caveat -- at the time I wrote the WM article, I hadn't seen any designs that handle conditionals like WM (CRISP gets some of the same effect, but with an entirely different, and I think more expensive, mechanism). Since then, the ZS-1 design was pointed out to me, which has a very similar mechanism, has a FIFO interface to memory, and decouples the int & fp units in a similar way. Some ideas just seem to have a time to be ripe.