Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!mit-eddie!uw-beaver!cornell!rochester!pt.cs.cmu.edu!k.gp.cs.cmu.edu!lindsay
From: lindsay@k.gp.cs.cmu.edu (Donald Lindsay)
Newsgroups: comp.arch
Subject: Re: VLIW
Message-ID: <3588@pt.cs.cmu.edu>
Date: 16 Nov 88 03:43:41 GMT
References: <70@armada.UUCP> <28200228@urbsdc> <5087@mit-vax.LCS.MIT.EDU> <556@m3.mfci.UUCP> <5097@mit-vax.LCS.MIT.EDU>
Organization: Carnegie-Mellon University, CS/RI
Lines: 43

In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>Micro-dataflow is an interesting pipeline management mechanism that
>was used in the IBM 360/91 computer.

I think that this is more commonly known as Tomasulo instruction
scheduling.  There was a study, a few years ago, showing that a Cray-1
would have had higher throughput if it had used this method.

This system is essentially the high-price/high-win version of a
scoreboard. Many modern systems have chosen to go with compile-time
scheduling, some retaining a few hardware interlocks, some not.

The argument is actually deeper than just fancy compilers versus fancy
(or self-reliant) hardware. There are two basic issues.

The first issue is branches. They happen very often, and the hardware
solutions don't mind. The innovation that made VLIW possible was a
compiler innovation for scheduling in the presence of branches.  It works
well in certain kinds of code: only Multiflow has much understanding
about how well it works on the rest of the code.

The second issue is cycle counts and synchronization. It used to be
common for instructions to take a data-dependent number of clocks.  For
example, a multiply by a small number would run faster than a multiply by
a big number. Also, there were machines with asynchronous units: they
were done when they were done, and that was that. (The latest buzzword is
"self timed circuits", but they aren't necessarily like that.)
All in all, the hardware solutions coped fine with all this. The compilers
give up and rely on fond hopes.

There are several reasons that data-dependent instruction timing has come
to disfavor. For one, hardware interlocks only look ahead just so far,
and are rarely as clever as the Tomasulo scheme. So, the compilers were
generating code that interlocked a lot. By making the machines more
predictable, we've made it possible for compilers to compare possible
overlap sequences, and compute - at compile time - which will run faster.

That still leaves conditional branches. The approach of HEP was
straighforward enough: run someone else as a crack-stuffer. I wonder
what the follow-on will look like.

-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science
--