Xref: utzoo comp.misc:2089 comp.arch:3905 Path: utzoo!mnetor!uunet!husc6!mailrus!nrl-cmf!ames!sgi!bron From: bron@olympus.SGI.COM (Bron C. Nelson) Newsgroups: comp.misc,comp.arch Subject: Re: Instruction Scheduling Message-ID: <12678@sgi.SGI.COM> Date: 12 Mar 88 02:34:19 GMT References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> Sender: daemon@sgi.SGI.COM Organization: Silicon Graphics Inc, Mountain View, CA Lines: 78 Keywords: optimization pipeline constraints code re-organization Summary: Questions for existing compilers/architectures What you need to know in order to do decent instruction scheduling is the idea of "dependence;" you need to know which instructions MUST preceed which other instructions, and which can be done independently. For ALU ops, the dependencies can be seen in the register usage. The real problem is loads/stores. Without some kind of analysis, you cannot move a load past a store, and you cannot move a store past a load OR a store. This can severely inhibit the amount of re-ordering that can be done, particularly if you only schedule a basic block at a time. The question I want the answer to is: how much and what kind of analysis is actually done by the various compilers Out There? To do a really good job, you need the kind of data dependence analysis that a vectorizing compiler does. Cray Research, Multiflow, and Ardent Computer are the only firms I know of for sure that generate and this kind of information and then use it when doing instruction scheduling. (I'd bet that IBM, CDC, NEC, etc. do it too, but I don't know for sure). One problem is that data dependence analysis is done at a fairly high level, while instruction scheduling is a very low level optimization. Transmitting the information through the compiler can be hard. Question: How detailed is the information passed to the instruction scheduler? Anyone at Cray/Multiflow/Ardent etc. care to say? A lot of things that occur in practice can be seen to be independent without resorting to full data dependence analysis (disjoint, global arrays, scalars from the stack, etc.) How much of this kind of analysis is done by various compilers? What sort of disambiguation has been found most useful for scheduling? Anyone at MIPSco, AMD, GE, Mot., Sun, etc. care to enlighten us? Studies done several years ago (sorry, don't have the references, but I believe they are cited in Ellis's "Bulldog" thesis bibliography) showed that in typical Fortran programs, a basic block only has about a factor of 2 of parallelism in the operations. For cpu's with only a few 2 cycle instructions (e.g. load/branch like the original Berkeley RISC and Stanford MIPS machines) this is probably enough; you could typically find enough "local" parallelism to fill the slots most of the time. If many of your instructions take multiple cycles (e.g. a Cray X-MP where even doing an address-length (24bit) integer add takes 2 cycles, and a load takes 14), you need to do more, or you'll wait a lot. For example, a measurement done at LLNL in July of 1985 showed that nearly **60%** of the cycles on their Cray X-MP48 were spent interlocked waiting for results of previous computations. (This does NOT count time spent waiting after a branch, which is probably about another 5-10%.) Clearly, sceduling only within a basic block was not enough; the recent Cray compilers will push some operations (i.e. loads) across block boundries. VLIW machines have the same sort of problem; the machine is capable of issuing instructions faster than the (data dependent) computations can deliver results. VLIW machines also break block boundries to find more instructions to issue concurrently. The whole point of the last paragraph is to say that the amount of worry you put into this optimization is very dependent on the payoff your particular hardware can get out of it (no surprise), and that the payoff varies dramatically from machine to machine. SO, the question (finally!) is: for YOUR particular architecture, how much time is spent interlocked (or executing NO-OPs for those machines without interlocks)? Aggregate numbers and integer only numbers are fine, but particularly interesting would be numbers involving multi- cycle instructions (e.g. floating point). An aside: do the current "no hardware interlocks" cpus REALLY have no interlocks, or are they just talking about the integer ALU ops? Since I know something about the MIPSco chip(s), I'll use that as an example: when you do an integer divide, do you really put in 35 no-ops, or is this "special cased"? Does the f.p. co-processor have interlocks? This is already too long; I'll beat the subject of memory latency to death sometime in the future. --------------------------------------------------------------------- Bron Nelson bron@sgi.com Don't blame my employers for my opinions