Xref: utzoo comp.misc:2089 comp.arch:3905
Path: utzoo!mnetor!uunet!husc6!mailrus!nrl-cmf!ames!sgi!bron
From: bron@olympus.SGI.COM (Bron C. Nelson)
Newsgroups: comp.misc,comp.arch
Subject: Re: Instruction Scheduling
Message-ID: <12678@sgi.SGI.COM>
Date: 12 Mar 88 02:34:19 GMT
References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM>
Sender: daemon@sgi.SGI.COM
Organization: Silicon Graphics Inc, Mountain View, CA
Lines: 78
Keywords: optimization   pipeline constraints  code re-organization
Summary: Questions for existing compilers/architectures


What you need to know in order to do decent instruction scheduling
is the idea of "dependence;" you need to know which instructions
MUST preceed which other instructions, and which can be done
independently.  For ALU ops, the dependencies can be seen in the
register usage.  The real problem is loads/stores.  Without
some kind of analysis, you cannot move a load past a store, and
you cannot move a store past a load OR a store.  This can severely
inhibit the amount of re-ordering that can be done, particularly
if you only schedule a basic block at a time.

The question I want the answer to is: how much and what kind of
analysis is actually done by the various compilers Out There?

To do a really good job, you need the kind of data dependence
analysis that a vectorizing compiler does.  Cray Research, Multiflow,
and Ardent Computer are the only firms I know of for sure that generate
and this kind of information and then use it when doing instruction
scheduling.  (I'd bet that IBM, CDC, NEC, etc. do it too, but I don't
know for sure).  One problem is that data dependence analysis is done
at a fairly high level, while instruction scheduling is a very low
level optimization.  Transmitting the information through the compiler
can be hard.  Question:  How detailed is the information passed
to the instruction scheduler?  Anyone at Cray/Multiflow/Ardent etc.
care to say?

A lot of things that occur in practice can be seen to be independent
without resorting to full data dependence analysis (disjoint, global
arrays, scalars from the stack, etc.)  How much of this kind of analysis
is done by various compilers?  What sort of disambiguation has been
found most useful for scheduling?  Anyone at MIPSco, AMD, GE, Mot.,
Sun, etc.  care to enlighten us?

Studies done several years ago (sorry, don't have the references, but
I believe they are cited in Ellis's "Bulldog" thesis bibliography)
showed that in typical Fortran programs, a basic block only has about
a factor of 2 of parallelism in the operations.  For cpu's with only a
few 2 cycle instructions (e.g. load/branch like the original Berkeley
RISC and Stanford MIPS machines) this is probably enough; you could
typically find enough "local" parallelism to fill the slots most of
the time.  If many of your instructions take multiple cycles (e.g.
a Cray X-MP where even doing an address-length (24bit) integer add
takes 2 cycles, and a load takes 14), you need to do more, or you'll
wait a lot.  For example, a measurement done at LLNL in July of
1985 showed that nearly **60%** of the cycles on their Cray X-MP48
were spent interlocked waiting for results of previous computations.
(This does NOT count time spent waiting after a branch, which is
probably about another 5-10%.)  Clearly, sceduling only within a
basic block was not enough; the recent Cray compilers will push
some operations (i.e. loads) across block boundries.  VLIW machines
have the same sort of problem; the machine is capable of issuing
instructions faster than the (data dependent) computations can deliver
results.  VLIW machines also break block boundries to find more
instructions to issue concurrently.

The whole point of the last paragraph is to say that the amount of
worry you put into this optimization is very dependent on the payoff
your particular hardware can get out of it (no surprise), and that
the payoff varies dramatically from machine to machine.  SO, the
question (finally!) is:  for YOUR particular architecture, how much
time is spent interlocked (or executing NO-OPs for those machines
without interlocks)?  Aggregate numbers and integer only numbers are
fine, but particularly interesting would be numbers involving multi-
cycle instructions (e.g. floating point).

An aside: do the current "no hardware interlocks" cpus REALLY have
no interlocks, or are they just talking about the integer ALU ops?
Since I know something about the MIPSco chip(s), I'll use that as
an example: when you do an integer divide, do you really put in 35
no-ops, or is this "special cased"?  Does the f.p. co-processor
have interlocks?

This is already too long; I'll beat the subject of memory latency to
death sometime in the future.

---------------------------------------------------------------------
Bron Nelson  bron@sgi.com
Don't blame my employers for my opinions