Xref: utzoo comp.misc:2090 comp.arch:3906
Path: utzoo!mnetor!uunet!husc6!hao!boulder!sunybcs!bingvaxu!leah!itsgw!imagine!pawl14.pawl.rpi.edu!jesup
From: jesup@pawl14.pawl.rpi.edu (Randell E. Jesup)
Newsgroups: comp.misc,comp.arch
Subject: Re: Instruction Scheduling
Message-ID: <519@imagine.PAWL.RPI.EDU>
Date: 12 Mar 88 07:48:40 GMT
References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM>
Sender: news@imagine.PAWL.RPI.EDU
Reply-To: beowulf!lunge!jesup@steinmetz.UUCP
Organization: RPI Public Access Workstation Lab - Troy, NY
Lines: 69
Keywords: optimization   pipeline constraints  code re-organization

In article <12678@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes:
>  The real problem is loads/stores.  Without
>some kind of analysis, you cannot move a load past a store, and
>you cannot move a store past a load OR a store.  This can severely
>inhibit the amount of re-ordering that can be done, particularly
>if you only schedule a basic block at a time.

	Yup!  There are some simple cases, such as where you're loading/storing
with the same base register, and the register wasn't changed inbetween (in
particular, this helps local variable references.)  Maybe there should be
assembler instructions (not real ones) that contain some of this information
that the compiler/programmer has found out.  Example:

	(addr is offset(base))
	uldw reg,addr	- unaliased loadword (no references via other
			  registers/absolute will occur)
	ustw addr,reg	- unaliased storeword
	aldw reg,addr,addr[,addr...] - aliased loadword - references via later
			  addr are aliases of one being used.
	etc.

This is all in all pretty kludgey.  I like this better:

	.alias addr,addr,addr - informs back end (assembler/reorganizer) that
			  these addresses are aliased, otherwise it can assume
			  it's safe to move load/stores around (within usual
			  bounds.)

This will only be valid until any one of the address registers used is
changed, after which the reorg must assume full aliasing again until
told otherwise.  The net effect is lots of these near beginnings of
blocks.  But it does do the job (if the compiler can figure it out,
that is!)
			
>  Clearly, sceduling only within a
>basic block was not enough; the recent Cray compilers will push
>some operations (i.e. loads) across block boundries.

	Global optimization is the wave of the future in high-speed pipelined
processors (IMHO).
	Actually, some pretty old compilers do this: it's called removing
loop invariants.  It is, of course, a special-case optimization, but it
does move things across block boundaries.  (And boy is it a pain when it
gets it wrong.  I had a fortran subroutine with dummy[3] = dummy[3] in
it so it wouldn't pull it out of the loop (it was equivalenced.))

>The whole point of the last paragraph is to say that the amount of
>worry you put into this optimization is very dependent on the payoff
>your particular hardware can get out of it (no surprise), and that
>the payoff varies dramatically from machine to machine.

	The potential payoff is almost directly proportional to the length
of your load delays.  The faster CPUs get, given current memory, board, and
IC packaging technology, the longer these delays will get.

>An aside: do the current "no hardware interlocks" cpus REALLY have
>no interlocks, or are they just talking about the integer ALU ops?

	The rpm-40 has NO interlocks, nor would I envision any XP for it
having any either.  However, that doesn't mean SOME interlocks aren't a
good thing, especially in long load delay machines, IFF you can do it
without affecting cycle time (i.e. doesn't hit critical path).

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)