Xref: utzoo comp.misc:2094 comp.arch:3909 Path: utzoo!mnetor!uunet!hsi!mfci!root From: root@mfci.UUCP (SuperUser) Newsgroups: comp.misc,comp.arch Subject: Re: Instruction Scheduling Message-ID: <301@m2.mfci.UUCP> Date: 12 Mar 88 16:40:36 GMT References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM> Reply-To: colwell@multiflow.UUCP (Robert Colwell) Followup-To: comp.arch Organization: Multiflow Computer Inc., Branford Ct. 06405 Lines: 65 Keywords: optimization pipeline constraints code re-organization In article <12678@sgi.SGI.COM] bron@olympus.SGI.COM (Bron C. Nelson) writes: ]Question: How detailed is the information passed ]to the instruction scheduler? Anyone at Cray/Multiflow/Ardent etc. ]care to say? I'll defer to one of our compiler people to answer this one (left this in so you wouldn't think I was avoiding it, even though I am...) ] ].... Clearly, sceduling only within a ]basic block was not enough; the recent Cray compilers will push ]some operations (i.e. loads) across block boundries. VLIW machines ]have the same sort of problem; the machine is capable of issuing ]instructions faster than the (data dependent) computations can deliver ]results. VLIW machines also break block boundries to find more ]instructions to issue concurrently. A minor nit: it's the compiler, in our case a Trace Scheduling (tm) compacting compiler (hope that keeps our legal guys happy) that breaks the block boundaries; the VLIW part makes it worthwhile. And we're usually careful to distinguish between instructions, which are the aggregate total of bits which all come flying out of the icache at once, vs. packets, which are the individual control fields for the various functional units. I usually look at the machine as a parallel collection of pipelines, each of different length, most of which (especially the floating pipes) have a latency of at least one instruction. ] ]The whole point of the last paragraph is to say that the amount of ]worry you put into this optimization is very dependent on the payoff ]your particular hardware can get out of it (no surprise), and that ]the payoff varies dramatically from machine to machine. SO, the ]question (finally!) is: for YOUR particular architecture, how much ]time is spent interlocked (or executing NO-OPs for those machines ]without interlocks)? Aggregate numbers and integer only numbers are ]fine, but particularly interesting would be numbers involving multi- ]cycle instructions (e.g. floating point). I know you want numbers here, but I think a few more points should be included. First, the answers will vary greatly with the code you're discussing. I think you know you have a balanced machine if a wide range of code causes the machine to limit in different places. For instance, one program may (in a hypothetical VLIW) cause the register read ports to become the bottleneck, another may run out of memory bandwidth, a third might require more floating point units, a fourth may do a lot of non-pipelined operations (divide), and a fifth might do chained memory operations that can't be done in parallel (which means your cpu waits around for the memory latency a lot -- this hypothetical VLIW has no data cache.) Depending on which limit is the one currently blocking performance, your answer on the NOPs will vary a lot. Perhaps if you pick a benchmark and ask we'll be able to compare results better. ] ]An aside: do the current "no hardware interlocks" cpus REALLY have ]no interlocks, or are they just talking about the integer ALU ops? ]Since I know something about the MIPSco chip(s), I'll use that as ]an example: when you do an integer divide, do you really put in 35 ]no-ops, or is this "special cased"? Does the f.p. co-processor ]have interlocks? We have no hardware interlocks except for a bank stall resolver built into the memory controllers. ]Bron Nelson bron@sgi.com I hope to have some reportable numbers on this stuff soon. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090