Xref: utzoo comp.misc:2094 comp.arch:3909
Path: utzoo!mnetor!uunet!hsi!mfci!root
From: root@mfci.UUCP (SuperUser)
Newsgroups: comp.misc,comp.arch
Subject: Re: Instruction Scheduling
Message-ID: <301@m2.mfci.UUCP>
Date: 12 Mar 88 16:40:36 GMT
References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM>
Reply-To: colwell@multiflow.UUCP (Robert Colwell)
Followup-To: comp.arch
Organization: Multiflow Computer Inc., Branford Ct. 06405
Lines: 65
Keywords: optimization   pipeline constraints  code re-organization

In article <12678@sgi.SGI.COM] bron@olympus.SGI.COM (Bron C. Nelson) writes:
]Question:  How detailed is the information passed
]to the instruction scheduler?  Anyone at Cray/Multiflow/Ardent etc.
]care to say?
I'll defer to one of our compiler people to answer this one (left this
in so you wouldn't think I was avoiding it, even though I am...)
]
].... Clearly, sceduling only within a
]basic block was not enough; the recent Cray compilers will push
]some operations (i.e. loads) across block boundries.  VLIW machines
]have the same sort of problem; the machine is capable of issuing
]instructions faster than the (data dependent) computations can deliver
]results.  VLIW machines also break block boundries to find more
]instructions to issue concurrently.
A minor nit:  it's the compiler, in our case a Trace Scheduling (tm)
compacting compiler (hope that keeps our legal guys happy) that breaks
the block boundaries;  the VLIW part makes it worthwhile.  And we're
usually careful to distinguish between instructions, which are the
aggregate total of bits which all come flying out of the icache at 
once, vs. packets, which are the individual control fields for the
various functional units.  I usually look at the machine as a parallel
collection of pipelines, each of different length, most
of which (especially the floating pipes) have a latency of at least
one instruction.
]
]The whole point of the last paragraph is to say that the amount of
]worry you put into this optimization is very dependent on the payoff
]your particular hardware can get out of it (no surprise), and that
]the payoff varies dramatically from machine to machine.  SO, the
]question (finally!) is:  for YOUR particular architecture, how much
]time is spent interlocked (or executing NO-OPs for those machines
]without interlocks)?  Aggregate numbers and integer only numbers are
]fine, but particularly interesting would be numbers involving multi-
]cycle instructions (e.g. floating point).
I know you want numbers here, but I think a few more points should
be included.  First, the answers will vary greatly with the code you're
discussing.  I think you know you have a balanced machine if a wide
range of code causes the machine to limit in different places.  For
instance, one program may (in a hypothetical VLIW) cause the register
read ports to become the bottleneck, another may run out of memory
bandwidth, a third might require more floating point units,  a
fourth may do a lot of non-pipelined operations (divide), and a fifth
might do chained memory operations that can't be done in parallel 
(which means your cpu waits around for the memory latency a lot -- this
hypothetical VLIW has no data cache.)  Depending
on which limit is the one currently blocking performance, your answer
on the NOPs will vary a lot.  Perhaps if you pick a benchmark and
ask we'll be able to compare results better.
]
]An aside: do the current "no hardware interlocks" cpus REALLY have
]no interlocks, or are they just talking about the integer ALU ops?
]Since I know something about the MIPSco chip(s), I'll use that as
]an example: when you do an integer divide, do you really put in 35
]no-ops, or is this "special cased"?  Does the f.p. co-processor
]have interlocks?
We have no hardware interlocks except for a bank stall resolver built
into the memory controllers.

]Bron Nelson  bron@sgi.com
I hope to have some reportable numbers on this stuff soon.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090