Xref: utzoo comp.misc:2090 comp.arch:3906 Path: utzoo!mnetor!uunet!husc6!hao!boulder!sunybcs!bingvaxu!leah!itsgw!imagine!pawl14.pawl.rpi.edu!jesup From: jesup@pawl14.pawl.rpi.edu (Randell E. Jesup) Newsgroups: comp.misc,comp.arch Subject: Re: Instruction Scheduling Message-ID: <519@imagine.PAWL.RPI.EDU> Date: 12 Mar 88 07:48:40 GMT References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM> Sender: news@imagine.PAWL.RPI.EDU Reply-To: beowulf!lunge!jesup@steinmetz.UUCP Organization: RPI Public Access Workstation Lab - Troy, NY Lines: 69 Keywords: optimization pipeline constraints code re-organization In article <12678@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes: > The real problem is loads/stores. Without >some kind of analysis, you cannot move a load past a store, and >you cannot move a store past a load OR a store. This can severely >inhibit the amount of re-ordering that can be done, particularly >if you only schedule a basic block at a time. Yup! There are some simple cases, such as where you're loading/storing with the same base register, and the register wasn't changed inbetween (in particular, this helps local variable references.) Maybe there should be assembler instructions (not real ones) that contain some of this information that the compiler/programmer has found out. Example: (addr is offset(base)) uldw reg,addr - unaliased loadword (no references via other registers/absolute will occur) ustw addr,reg - unaliased storeword aldw reg,addr,addr[,addr...] - aliased loadword - references via later addr are aliases of one being used. etc. This is all in all pretty kludgey. I like this better: .alias addr,addr,addr - informs back end (assembler/reorganizer) that these addresses are aliased, otherwise it can assume it's safe to move load/stores around (within usual bounds.) This will only be valid until any one of the address registers used is changed, after which the reorg must assume full aliasing again until told otherwise. The net effect is lots of these near beginnings of blocks. But it does do the job (if the compiler can figure it out, that is!) > Clearly, sceduling only within a >basic block was not enough; the recent Cray compilers will push >some operations (i.e. loads) across block boundries. Global optimization is the wave of the future in high-speed pipelined processors (IMHO). Actually, some pretty old compilers do this: it's called removing loop invariants. It is, of course, a special-case optimization, but it does move things across block boundaries. (And boy is it a pain when it gets it wrong. I had a fortran subroutine with dummy[3] = dummy[3] in it so it wouldn't pull it out of the loop (it was equivalenced.)) >The whole point of the last paragraph is to say that the amount of >worry you put into this optimization is very dependent on the payoff >your particular hardware can get out of it (no surprise), and that >the payoff varies dramatically from machine to machine. The potential payoff is almost directly proportional to the length of your load delays. The faster CPUs get, given current memory, board, and IC packaging technology, the longer these delays will get. >An aside: do the current "no hardware interlocks" cpus REALLY have >no interlocks, or are they just talking about the integer ALU ops? The rpm-40 has NO interlocks, nor would I envision any XP for it having any either. However, that doesn't mean SOME interlocks aren't a good thing, especially in long load delay machines, IFF you can do it without affecting cycle time (i.e. doesn't hit critical path). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)