Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!mit-eddie!uw-beaver!cornell!rochester!pt.cs.cmu.edu!k.gp.cs.cmu.edu!lindsay From: lindsay@k.gp.cs.cmu.edu (Donald Lindsay) Newsgroups: comp.arch Subject: Re: VLIW Message-ID: <3588@pt.cs.cmu.edu> Date: 16 Nov 88 03:43:41 GMT References: <70@armada.UUCP> <28200228@urbsdc> <5087@mit-vax.LCS.MIT.EDU> <556@m3.mfci.UUCP> <5097@mit-vax.LCS.MIT.EDU> Organization: Carnegie-Mellon University, CS/RI Lines: 43 In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >Micro-dataflow is an interesting pipeline management mechanism that >was used in the IBM 360/91 computer. I think that this is more commonly known as Tomasulo instruction scheduling. There was a study, a few years ago, showing that a Cray-1 would have had higher throughput if it had used this method. This system is essentially the high-price/high-win version of a scoreboard. Many modern systems have chosen to go with compile-time scheduling, some retaining a few hardware interlocks, some not. The argument is actually deeper than just fancy compilers versus fancy (or self-reliant) hardware. There are two basic issues. The first issue is branches. They happen very often, and the hardware solutions don't mind. The innovation that made VLIW possible was a compiler innovation for scheduling in the presence of branches. It works well in certain kinds of code: only Multiflow has much understanding about how well it works on the rest of the code. The second issue is cycle counts and synchronization. It used to be common for instructions to take a data-dependent number of clocks. For example, a multiply by a small number would run faster than a multiply by a big number. Also, there were machines with asynchronous units: they were done when they were done, and that was that. (The latest buzzword is "self timed circuits", but they aren't necessarily like that.) All in all, the hardware solutions coped fine with all this. The compilers give up and rely on fond hopes. There are several reasons that data-dependent instruction timing has come to disfavor. For one, hardware interlocks only look ahead just so far, and are rarely as clever as the Tomasulo scheme. So, the compilers were generating code that interlocked a lot. By making the machines more predictable, we've made it possible for compilers to compare possible overlap sequences, and compute - at compile time - which will run faster. That still leaves conditional branches. The approach of HEP was straighforward enough: run someone else as a crack-stuffer. I wonder what the follow-on will look like. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science --