Path: utzoo!attcan!uunet!know!zaphod.mps.ohio-state.edu!uwm.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew
From: aglew@crhc.uiuc.edu (Andy Glew)
Newsgroups: comp.arch
Subject: Compilers taking advantage of architectural enhancements
Message-ID: <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu>
Date: 11 Oct 90 19:49:20 GMT
References: <162639@<1990Oct9> <3300194@m.cs.uiuc.edu>
Sender: news@ux1.cso.uiuc.edu (News)
Organization: Center for Reliable and High-Performance Computing University of
	Illinois at Urbana Champaign
Lines: 135
In-Reply-To: gillies@m.cs.uiuc.edu's message of 10 Oct 90 21:51 CDT

>[gillies@m.cs.uiuc.edu]
>One of the problems in new CPU designs is that designers don't realize
>which architecture enhancements are pointless, because we don't and
>may never have the optimization technology to take advantage of them.

Ah, a potentially interesting and useful topic.  Perhaps we can start
a discussion that will lead to a list of possible hardware
architectural enhancements that a compiler can/cannot take advantage
of?  Maybe the real compiler guys (preston, David C.) will help us
out?

By the way, it is useful in such discussions to distinguish between
run-of-the-mill optimizations that can pretty much be added to any
existing compiler by a fairly good CS grad, and optimizations that
require so much background that only the "experts" in the field can
reasonably be expected to do this.  For example, many of the
parallelizing FORTRAN loop optimizations can reasonably only be
expected to be done by people from Rice or UIUC CSRD (note that I'm
UIUC, but CRHC, not CSRD), so unless you are willing to pay for these
guys (or their assorted front companies) you aren't likely to get them
onto your machine.
    Cost of compiler development can be significant.  Sometimes a
company might put a hardware feature in even though they know a
compiler approach would be better, because they can't afford the
compiler guys (or the compiler guys already have exclusive contracts
with a competitor).


Let me list a few things to start off the discussion.  I hope and
expect to be shot down on a few of them.  It's easy to list a few of
the hardware enhancements that we already know compilers can take
advantage of.


Branch Delay Slots - small number
    One or two branch delay slots can be filled quite easily.
    DIFFICULTY: a reasonably good CS grad can do it.

Branch Delay slots - large number
    Most groups report difficulty filling a large number of branch delay
    slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu)
    has been able to get good results for up to 8 delay slots, on slightly
    different hardware (forward semantic, with profiling feedback).
    DIFFICULTY: given the slightly different hardware, and exposure
    	to the necessary ideas, a reasonably good CS grad can do it.

Register file - moderate sized (up to 32 registers)
    Most compilers can use these efficiently. 
    DIFFICULTY:	a reasonably good CS grad can do it.
    	Avoiding IBM's patents is a slight problem.

Register file - large (around 128 registers, or more)
    Most compilers do not get enough benefit from these to justify
    the extra hardware, or the slowed down register access.

Multiple, hierarchical, registers sets
    Like Cray's fast S registers and slower T registers.
    It took a very long time for compilers to take advantage 	
    of these multiple register sets.
    DIFFICULTY: complex.
    

Separate floating point register file
    Most compilers can take advantage of this.
    DIFFICULTY: easy

Heterogenous register file
    Few compilers have been developed that can take advantage of a truly heterogenous
    register file, one in which, for example,  the divide unit writes to registers
    D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers
    S1..S4 -- even though such hardware conceivably has a cycle time advantage over
    homogenous registers, even on VLIW machines where data can easily be moved
    to generic registers when necessary.
    DIFFICULTY: hard. 

Instruction cache
    Many groups have reported compilers that can take advantage of small instruction caches
    by rearranging code to minimize I-cache misses.  Eg. HP, UIUC IMPACT.
    DIFFICULTY: moderately complex - a good CS grad can do it, but it requires tools
    	and background that take a while to develop, and are not widely available.
    One of the complicating factors is that this sort of optimization is most
    useful when it is done in cooperaton with the linker/loader. Which is something
    that your average compiler hacker doesn't have background in (although it isn't
    too hard).

Data cache - software managed consistency
    This reportedly has been done, but certainly isn't run-of-the-mill.
    DIFFICULTY: needs a skilled compiler expert.

Micro-scheduling parallelism (like CONVEX's ASAP)
    Apparently isn't too hard to develop compiler support for this.
    Hardware is rather expensive, though.
    DIFFICULTY: moderately complex

Vectorizing
    Done in lots of places, but high performance vectorizing compilers
    still seem to be a problem for a lot of companies.
    DIFFICULTY: complex.  The easy things are easy, but milking the
    last few FLOPs out of your vector unit requires compiler experts.

Parallelizing - fine grain, small numbers of processors
    (like Alliant's concurrent hardware)
    Done, eg. by Alliant and Cray
    DIFFICULTY: to be competitive with this sort of hardware, you
    	need real compiler experts on your team.  Harder than vectorizing.

Parallelizing, fine grain, large numbers of processors.
    What Tera is trying to do.
    DIFFICULTY: very complex. Leading edge.
    	Good luck, Tera.

Multiple functional units - heterogenous - VLIW or superscalar
    DIFFICULTY: complex.
    	Complexity of the packing algorithms grows rapidly as
    	you try to go beyond basic block parallelism

Multiple functional units - homogenous - VLIW or superscalar
    DIFFICULTY: moderately complex
    	Easier than the heterogenous case, and the packing algorithms
    	are considerably easier.

Special hardware instructions - scalar
    Taking advantage of simple instructions like abs(), conditional exchange, etc.
    DIFFICULTY:
    	(1) When treated not as a compiler problem, but as a problem of simply
    	    writing libraries to inline optimized machine code, EASY
    	    Requires inlining support.
    	(2) When treated as a compiler problem, eg. compiler should recognize
    	    IF a < 0 THEN a = -a
    	    and automatically convert it to a = ABS(a)
    	    MODERATELY COMPLEX.
    I expect Herman Rubin to comment on this.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]